Random Forest and preprocessing in Data Mining

Question

When applying the Random Forest classification technique, do we have to do preprocessing or is it true that it is not needed for Random Forest ?

Björn · Answer 1 · 2021-04-27T14:44:14.997

In theory, RF can approximate arbitrary functions that involve multiple input features using branches in the trees it builds. However, with limited training data (always the case in practice), it can matter a lot how conveniently the data are set-up for RF. I.e. it's ideal if it needs to only make a few cuts along as few dimensions as possible. See e.g. this answer to another question (it was about tree boosting, but the same answer applies to RF). Another thing that usually gains you a lot is representing high-cardinality categorical features well (e.g. using embeddings or target encoding) - RF is not really good at dealing with categorical features with many levels (if you just one-hot-encode them).

I.e. in practice you will want to pre-processing and feature engineering, and would expect it to improve RF performance. How much it helps depends on how good the un-processed data were as features to start off with vs. how much it can be improved.

score 0 · Answer 2 · answered Dec 02 '18 at 07:58

I think it pretty much depends on the quality of the data.

For example, if you are using public data-set that has been preprocessed, you may just proceed with any form of classification. On the other hand, if you collect your own data, you may want to clean off some noise data / outliers (hence, preprocess data comes into play).

I remember there is almost same question answered here.

Best Practices with Data Wrangling before running Random Forest Predictions

Random Forest and preprocessing in Data Mining

2 Answers2