Scaling dataset in Random Forest

Question

Scaling a dataset for Random Forest modelling is not necessary. However, if we have already done the scaling and normalization to the dataset, will it impact our Random Forest modelling?

It shouldn't matter. Also decision trees/RF should be invariant to scaling and normalization since they find partitions, which only depends on orderings. — , Aug 08 '21 at 17:52
Related: https://stats.stackexchange.com/questions/72231/decision-trees-variable-feature-scaling-and-variable-feature-normalization — Adrian, Aug 10 '21 at 18:31

Sycorax · Answer 1 · 2021-08-09T08:19:32.730

8

Any monotonic injective transformation of the features won't change the model wrt how it splits the data. The reason is the same as for why scaling is unnecessary: the random forest looks for partitions, and partitions only depend on how the data are sorted. If there is an optimal split on some scale, then by the definition of monotonic injective, the same split exists after transformation, and it's just as good (at splitting the training data, at least).

edited Aug 09 '21 at 08:19

answered Aug 08 '21 at 18:35

Sycorax

76,417
20
189
313

2

(+1) It's just the splitting point may change after the transformation, e.g. if we split the feature $x$ between $[1, 2]$, the split point will be $1.5$. But, if we split $x^2$ between $[1^2,2^2]$, then the split point will be $x^2=5/2\rightarrow x=\sqrt{5/2}$. This won't affect model build and training performance, but *might* have little effects on the test set. – gunes Aug 08 '21 at 19:18
It's fair to point out that there is a difference between training a model and measuring its performance on a holdout set, so I've edited the first sentence to clarify that the model training is unchanged. – Sycorax Aug 08 '21 at 19:21

Scaling dataset in Random Forest

1 Answers1