How can preprocessing with PCA but keeping the same dimensionality improve random forest results?

Question

I found these sentences:

PCA before random forest can be useful not for dimensionality reduction but to give you data a shape where random forest can perform better.

I am quiet sure that in general if you transform your data with PCA keeping the same dimensionality of the original data you will have a better classification with random forest

from this page: PCA on high-dimensional text data before random forest classification?

In my case I found this is really true, for a regression problem with a database of ~1M records and 25 predictors. The error decreases by about 10% if I use the 25 PCA as predictors instead of the 25 original predictors.

Can anyone help me in understanding and clearly interpreting this result?

In many cases, PCA before supervised method is not recommended, because PCA is not taking account to the response variable. But keep the same number of features, will increase the performance? good question.. — Haitao Du, Jul 27 '17 at 17:57
Random forest is invariant to scaling, so all the action here would have to come from rotation and reflection in the linear transformation generated by PCA? Does random forest prefer (1) some features to be extremely predictive while others are entirely useless compared to (2) all features are somewhat predictive? — Matthew Gunn, Jul 27 '17 at 18:15

Sycorax · Answer 1 · 2017-07-27T21:02:36.493

Random forest struggles when the decision boundary is "diagonal" in the feature space because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that PCA re-orients the data so that splits perpendicular to the rotated & rescaled axes align well with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in general, because not all decision boundaries are improved when rotated (e.g. a circle). And even if you do have a diagonal decision boundary, or a boundary that would be easier to find in a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no knowledge at all about the classification component of the task (it is not "$y$-aware").

Also, @hxd1011's caveat applies to all projects using PCA for supervised learning: data rotated by PCA may have little-to-no relevance to the classification objective.

How can preprocessing with PCA but keeping the same dimensionality improve random forest results?

1 Answers1