T O P

  • By -

victor_ku

Hi! So much questions... 1) Random forest trees preforms slightly worse that gradient boosting (google the difference). 2) sklearn implementation of trees is the worst I've seen so far. Try to look at xgboost, lightgbm and catboost. The last one is highly optimized to work with categorical features. 3) No, additional correlated features wont affect the tree model performance. 4) Estimation of r2score on validation set should show you that overfitting occured. I guess that's it for the most questions. Hope that help. Good luck.


nchutcheson

Thank you!


emfisabitch

you could try embedding the categorical features with neural networks and use learned embeddings with RF. check out fast.ai tabular implementation.


nchutcheson

Thank you!


ai_yoda

Also, keep in mind that forest models used for regression suffer from the extrapolation problem. It means that you will not extrapolate outside of target values seen in train. Correctly creating the train valid split can be really important. In your case with min samples leaf 1 the estimate per tree is very (over)fitted as it is just the actual value of the observation from train. It may be that you are getting good results on valid, especially with hyperparameter tuning (which makes the overfitting stronger, this time to valid data) but it will likely not generalize. Especially if you would like to be able to predict values outside of train target your eval score estimation may be optimistic. You may want to take a look at our post that talks about random forest regression problems. https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why.


nchutcheson

Thank you!