0

I have generated a dataset of artificial data and want to distinguish two labels from each other using a random forest.

I thought having correlated features in my dataset will decrease the algorithms accuracy, so I identified the correlated features in the heatmap below and remove all but one of them. So that I am left with "unique" features only.

However when I plot the accuracy of the forest over the number of features used for training and testing, then in the case of having deleted the correlated features I reach overall decreased accuracy (Hence the two pictures at the right).

The two posts (Won't highly-correlated variables in random forest distort accuracy and feature-selection? and Selecting good features) seem two concern similar issues, but I do not quite understand how they apply to me.

How can I remove the correlation from my dataset and reach an increase in accuracy?

enter image description here

Philipp
  • 121
  • 5
  • You are overfitting a fair bit, should reduce that though limiting tree depth (ex using min_samples_leaf) or increasing n_estimators. It is possible the best hyperparameters are dependent on number of features, and the removal of correlated features – Jon Nordby Mar 30 '19 at 10:59
  • @jonnor how do you see that I overfit the model? Thanks for the answer. – Philipp Mar 31 '19 at 15:23
  • Training accuracy significantly higher than validation/test accuracy. – Jon Nordby Mar 31 '19 at 15:51
  • Random Forest, which are generally build to node purity for a very good reason, overfit by definition. This doesn't reduce their performance, given that a reasonable number of trees are build. IMO you shouldn't limit the tree depth. – Scholar Mar 31 '19 at 17:59

0 Answers0