Doesn't high feature correlation decrease random forest accuracy?

Question

I have generated a dataset of artificial data and want to distinguish two labels from each other using a random forest.

I thought having correlated features in my dataset will decrease the algorithms accuracy, so I identified the correlated features in the heatmap below and remove all but one of them. So that I am left with "unique" features only.

However when I plot the accuracy of the forest over the number of features used for training and testing, then in the case of having deleted the correlated features I reach overall decreased accuracy (Hence the two pictures at the right).

The two posts (Won't highly-correlated variables in random forest distort accuracy and feature-selection? and Selecting good features) seem two concern similar issues, but I do not quite understand how they apply to me.

How can I remove the correlation from my dataset and reach an increase in accuracy?

You are overfitting a fair bit, should reduce that though limiting tree depth (ex using min_samples_leaf) or increasing n_estimators. It is possible the best hyperparameters are dependent on number of features, and the removal of correlated features — Jon Nordby, Mar 30 '19 at 10:59
@jonnor how do you see that I overfit the model? Thanks for the answer. — Philipp, Mar 31 '19 at 15:23
Training accuracy significantly higher than validation/test accuracy. — Jon Nordby, Mar 31 '19 at 15:51
Random Forest, which are generally build to node purity for a very good reason, overfit by definition. This doesn't reduce their performance, given that a reasonable number of trees are build. IMO you shouldn't limit the tree depth. — Scholar, Mar 31 '19 at 17:59

Doesn't high feature correlation decrease random forest accuracy?

0 Answers0