For tree and randomForest packages in R, the number of levels for a factor (as a categorical variable) is capped at 32. An explanation might be that the number of comparisons at each split becomes very high (2^32 approximately). Why does rpart still work with a factor with larger no. of levels?
Asked
Active
Viewed 1,281 times
6

kjetil b halvorsen
- 63,378
- 26
- 142
- 467

Pradnyesh Joshi
- 273
- 2
- 11
-
2I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or [link](http://stats.stackexchange.com/questions/191055/where-in-elements-of-statistical-learning-does-it-talk-of-a-trick-to-deal-with/191057) for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32. – Peter Calhoun Nov 14 '16 at 07:21
-
Hi Peter, thanks for your help. Are you aware of any detailed documentation for the rpart package (or a research paper, perhaps)? – Pradnyesh Joshi Nov 25 '16 at 06:55
-
1Recursive partitioning (CART) requires about n=100,000 to be reliable. Random forests are for tall and thin datasets and often do not perform well when n is not huge and the number of features is large. – Frank Harrell Sep 27 '21 at 12:31
-
1@Frank Harrell:, where does this n=100,000 figure come from? – JTH Sep 27 '21 at 13:01
-
Simulations I've done, and https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-137 – Frank Harrell Sep 27 '21 at 16:15
1 Answers
1
Partially answered in comments:
I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or link for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32.
– Peter Calhoun
For some alternative ideas see Random Forest Regression with sparse data in Python

kjetil b halvorsen
- 63,378
- 26
- 142
- 467