PCA on high-dimensional text data before random forest classification?

Question

Does it make sense to do PCA before carrying out a Random Forest Classification?

I'm dealing with high dimensional text data, and I want to do feature reduction to help avoid the curse of dimensionality, but don't Random Forests already to some sort of dimension reduction?

The RF algorithm does not really suffer from high number of predictors since it only take a random subset of them (so called `mtry` parameter) to build each tree. There is also a recursive feature elimination technique built on top of the RF algorithm (see the [varSelRF](http://cran.r-project.org/web/packages/varSelRF/index.html) R package and references therein). It is, however, certainly possible to add an initial data reduction scheme, although it should be part of the cross-validation process. So the question is: do you want to input a linear combination of your features to RF? — chl, Jan 10 '13 at 21:03
How many features/dimensions, F? >1K? >10K? Are the features discrete or continuous, e.g. term-frequency, tfidf, similarity metrics, word vectors or what? PCA runtime is quadratic to F. — smci, Jun 09 '15 at 22:58
See e.g. [Best PCA algorithm for huge number of features?](http://stats.stackexchange.com/questions/2806/best-pca-algorithm-for-huge-number-of-features) — smci, Jun 09 '15 at 23:00
Strongly related: http://stats.stackexchange.com/questions/258938 — amoeba, Jan 31 '17 at 23:22

Ivan Batalov · Answer 1 · 2018-11-16T01:31:38.477

I'd like to add my two cents to this since I thought the existing answers were incomplete.

Performing PCA can be especially useful before training a random forest (or LightGBM, or any other decision tree-based method) for one particular reason I illustrated in the pic below.

Basically, it can make the process of finding the perfect decision boundary much easier by aligning your training set along the directions with highest variance.

Decision trees are sensitive to rotation of the data, since the decision boundary they create is always vertical/horizontal (i.e. perpendicular to one of the axes). Therefore, if your data looks like the left pic, it will take a much bigger tree to separate these two clusters (in this case it's an 8 layer tree). But if you align your data along its principal components (like in the right pic), you can achieve perfect separation with just one layer!

Of course, not all datasets are distributed like this, so PCA may not always help, but it's still useful to try it and see if it does. And just a reminder, don't forget to normalize your dataset to the unit variance before performing PCA!

P.S.: As for dimensionality reduction, I'll agree with the rest of folks in that it's not usually as big of a problem for random forests as for other algorithms. But still, it might help speed up your training a little. Decision tree training time is O(nmlog(m)), where n is the number of training instances, m - number of dimensions. And although random forests randomly pick a subset of dimensions for each tree to be trained on, the lower fraction of the total number of dimensions you pick, the more trees you need to train to achieve good performance.

score 12 · Accepted Answer · edited Jan 19 '21 at 17:55

Leo Breiman wrote that "dimensionality can be a blessing". In general, random forests can run on large data sets without problems. How large is your data? Different fields handle things in different ways depending on subject-matter knowledge. For example, in gene expression studies genes are often discarded based on low variance (no peeking at the outcome) in a process sometimes called non-specific filtering. This can help with the running time on random forests. But it is not required.

Sticking with the gene expression example, sometimes analysts use PCA scores to represent gene expression measurements. The idea is to replace similar profiles with one score that is potentially less messy. Random forests can be run both on the original variables or the PCA scores (a surrogate for the variables). Some have reported better results with this approach, but there are no good comparisons to my knowledge.

In sum, there is no need to do PCA before running RF. But you can. The interpretation could change depending on your goals. If all you want to do is predict, the interpretation may be less important.

Thank you for the response. Speed is an issue, more because I have several thousand possible labels in a multi-label problem. The application is classifying a corpus of text data drawn from both twitter and analysts' description of certain events. I'm using tf-idf weighting and the bag of words model. — Maus, Jan 10 '13 at 22:04

score 1 · Answer 3 · answered Oct 03 '14 at 11:00

1

PCA before random forest can be useful not for dimensionality reduction but to give you data a shape where random forest can perform better.

I am quiet sure that in general if you transform your data with PCA keeping the same dimensionality of the original data you will have a better classification with random forest

answered Oct 03 '14 at 11:00

Donbeo

3,001
5
31
48

PCA runtime is quadratic to number of features F, so it's not always cheap. – smci Jun 09 '15 at 22:59
by perfomances I meant prediction perfomances. I was not referring to computational time – Donbeo May 17 '16 at 17:29
2

Could you add some justification to your claims? It seems that PCA will not always improve results. For example, when the decision boundary is invariant to rotations (e.g. a circle), performing PCA will just rescale and rotate the data (and therefore, the circle), but RF will still have to approximate the elliptical boundary with lots of rectangular splits. – Sycorax Jul 27 '17 at 18:29

Thomas · Answer 4 · 2021-12-13T12:26:15.620

I think an other important observation here is the role of the parameter m_try, indicating the number of directions to try for splitting each individual tree. There the RF model is essentially doing an automatic feature engineering and, if chosen carefully, should remove the need of dimensionality reduction. Of course, m_try should be optimized to a high value if the number of irrelevant features is huge, increasing the computational cost.

Something that is not clear to me is if optimizing parameters of RF, e.g. using the mtry direction, we can remove the need of dimensionality reduction at 100%.

PCA on high-dimensional text data before random forest classification?

4 Answers4

Linked