2

Regarding categorical features - ordinary trees treat categorical features in two main ways, CART - considers only binary splitting, those computing the mean response value (y_mean_i per each category i), sort them by this value and considers only splits along this axis.

C4.5, considers multiple splits (each category gets a node, A little more prone to overfitting).

There may be more subtle changes like the quality of the split (Gini impurity, information gain etc) but let's focus on the split itself.

So the question is as follows - why do all of the implementations (except R) I know of (Sklearn,Xgboost, Lightbm, Catboost) avoid using the first method?

Sklearn & Xgboost - takes only numerical values, those it's possible to impute them in pre-processing, one hot, or other more sophisticated methods. Lightgbm - they recommend imputing the categorical features in pre-processing, especially when the number of classes is large. Catboost (the 'new kid' in town), uses also clever imputing methods (inspired by online learning). R - does use this method CART as the basic tree - https://github.com/gbm-developers/gbm/issues/44

The only reason I can think of currently is memory efficiency (each categorical node in a tree doesn't need to save which categories are pointing to the left/right child). Any intuition? Thanks

afek
  • 21
  • 1

0 Answers0