0

As Sklearn need to encode categorical features in order to run tree based algorithm, i was wondering what are the fact i should be careful when analysnig the outputs (predictions, feature importance in case of random forest, ...) ?

I tried to gather some information from internet but it is not really clear in my head.

I understood that dummy encoding:

  • increase the number of feature (equal to number of modalities of all nominal features) and so it become computationally difficult to try all the (feature,value of feature) combinations. (complexity problem)
  • since the cart algorithm select the "best" feature, a modality which become a feature has less luck to be chosen at any step compared with the initial nominal feature which contain all the modalities(it is particularly true at the beginning of the tree, so if we restrict the depth of the tree the encoded nominal features could never appear). So its feature importance is "sub-estimated" ? (interpretation problem)
  • tend to give imbalanced tree (complexity problem)
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
curious
  • 343
  • 1
  • 3
  • 10
  • Card is built to have a univariate output so I’m guessing that you’re asking what’s the risk in one hot encoding inputs. The way that your bootstraps off the columns is going to be corrupted because it thinks it’s going to be rejecting a call him when in fact a call and could be how many levels of value that column has. Columns with many levels are going to be grossly expanded well those with you will not. This will very strongly deform your input data. – EngrStudent Apr 17 '21 at 12:13
  • See my answer at https://stats.stackexchange.com/questions/390671/random-forest-regression-with-sparse-data-in-python/430127#430127 for some ideas. – kjetil b halvorsen Apr 17 '21 at 12:49

0 Answers0