7

Scaling a dataset for Random Forest modelling is not necessary. However, if we have already done the scaling and normalization to the dataset, will it impact our Random Forest modelling?

  • 3
    It shouldn't matter. Also decision trees/RF should be invariant to scaling and normalization since they find partitions, which only depends on orderings. –  Aug 08 '21 at 17:52
  • Related: https://stats.stackexchange.com/questions/72231/decision-trees-variable-feature-scaling-and-variable-feature-normalization – Adrian Aug 10 '21 at 18:31

1 Answers1

8

Any monotonic injective transformation of the features won't change the model wrt how it splits the data. The reason is the same as for why scaling is unnecessary: the random forest looks for partitions, and partitions only depend on how the data are sorted. If there is an optimal split on some scale, then by the definition of monotonic injective, the same split exists after transformation, and it's just as good (at splitting the training data, at least).

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • 2
    (+1) It's just the splitting point may change after the transformation, e.g. if we split the feature $x$ between $[1, 2]$, the split point will be $1.5$. But, if we split $x^2$ between $[1^2,2^2]$, then the split point will be $x^2=5/2\rightarrow x=\sqrt{5/2}$. This won't affect model build and training performance, but *might* have little effects on the test set. – gunes Aug 08 '21 at 19:18
  • It's fair to point out that there is a difference between training a model and measuring its performance on a holdout set, so I've edited the first sentence to clarify that the model training is unchanged. – Sycorax Aug 08 '21 at 19:21