6

For tree and randomForest packages in R, the number of levels for a factor (as a categorical variable) is capped at 32. An explanation might be that the number of comparisons at each split becomes very high (2^32 approximately). Why does rpart still work with a factor with larger no. of levels?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Pradnyesh Joshi
  • 273
  • 2
  • 11
  • 2
    I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or [link](http://stats.stackexchange.com/questions/191055/where-in-elements-of-statistical-learning-does-it-talk-of-a-trick-to-deal-with/191057) for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32. – Peter Calhoun Nov 14 '16 at 07:21
  • Hi Peter, thanks for your help. Are you aware of any detailed documentation for the rpart package (or a research paper, perhaps)? – Pradnyesh Joshi Nov 25 '16 at 06:55
  • 1
    Recursive partitioning (CART) requires about n=100,000 to be reliable. Random forests are for tall and thin datasets and often do not perform well when n is not huge and the number of features is large. – Frank Harrell Sep 27 '21 at 12:31
  • 1
    @Frank Harrell:, where does this n=100,000 figure come from? – JTH Sep 27 '21 at 13:01
  • Simulations I've done, and https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-137 – Frank Harrell Sep 27 '21 at 16:15

1 Answers1

1

Partially answered in comments:

I don't know the full reason, but CART uses a trick to reduce the number of splits considered. For regression, the levels of a categorical predictor are replaced by mean of the outcome; for binary responses, levels are replaced by the proportion of outcomes in class 1 (see Elements of Statistical Learning book or link for reason). For categorical predictors, there are some approximations. I don't know why randomForest caps this at 32.

– Peter Calhoun

For some alternative ideas see Random Forest Regression with sparse data in Python

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467