0

I am currently training a random forest. After transforming a categorical feature into dichotomous columns, should I drop the first level?

For example, I have three unique values in a featured named sex:

  1. m for male
  2. f for female
  3. na for not available

Thus, I encoded sex into three columns:

sex  sex_m  sex_f  sex_na
  m      1      0       0
  f      0      1       0
 na      0      0       1

I dropped sex (obviously), but should I also drop one of the three encoded columns?

Dropping the base level is necessary when running a regression to avoid multicollinearity, but this is not a problem when running a random forest. So what is the most common approach?

For reference, each tree is being trained with a randomly selected set of 8 out of 63 features.

Arturo Sbr
  • 305
  • 1
  • 12
  • Decision trees can usually cope directly with categorical variables – Firebug May 29 '20 at 15:24
  • 1
    This answer is provided in the context of gradient boosting, but the logic also applies to random forest. https://stats.stackexchange.com/questions/438875/one-hot-encoding-of-a-binary-feature-when-using-xgboost/439191#439191 – Sycorax May 29 '20 at 15:37
  • 1
    Related: https://stats.stackexchange.com/questions/410939/label-encoding-vs-dummy-variable-one-hot-encoding-correctness/414729#414729, https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 – kjetil b halvorsen May 30 '20 at 18:22

1 Answers1

1

Technically, both will work.

However, creating dummies is rarely a good idea in random forests as it weights down the chance of other variables to be picked for splitting.

Integer coding often does the job pretty well. The more levels, the more it helps to use a meaningful order. Some implementations - e.g. ranger in R - do smart ordering, internally. Avoiding dummy coding also greatly helps to interprete the models by the usual suspects (variable importance, partial dependence plots).

Michael M
  • 10,553
  • 5
  • 27
  • 43