Handling of categorical variables: rpart vs tree

Question

For tree and randomForest packages in R, the number of levels for a factor (as a categorical variable) is capped at 32. An explanation might be that the number of comparisons at each split becomes very high (2^32 approximately). Why does rpart still work with a factor with larger no. of levels?

I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or [link](http://stats.stackexchange.com/questions/191055/where-in-elements-of-statistical-learning-does-it-talk-of-a-trick-to-deal-with/191057) for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32. — Peter Calhoun, Nov 14 '16 at 07:21
Hi Peter, thanks for your help. Are you aware of any detailed documentation for the rpart package (or a research paper, perhaps)? — Pradnyesh Joshi, Nov 25 '16 at 06:55
Recursive partitioning (CART) requires about n=100,000 to be reliable. Random forests are for tall and thin datasets and often do not perform well when n is not huge and the number of features is large. — Frank Harrell, Sep 27 '21 at 12:31
@Frank Harrell:, where does this n=100,000 figure come from? — JTH, Sep 27 '21 at 13:01
Simulations I've done, and https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-137 — Frank Harrell, Sep 27 '21 at 16:15

kjetil b halvorsen · Answer 1 · 2020-08-26T15:10:37.607

Partially answered in comments:

I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or link for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32.

– Peter Calhoun

For some alternative ideas see Random Forest Regression with sparse data in Python

Handling of categorical variables: rpart vs tree

1 Answers1