0

In Logistic Regression there is the "One in Ten Rule" (https://en.wikipedia.org/wiki/One_in_ten_rule). For example, there is a sample of 2000 customers, and 50 of them belong to the positive class (1950 of them are negative). Then the maximum number of features we can have in the Logistic Regression model is 50 / 10 = 5, although there may be 150 features available.

My questions is: For tree based models (e.g. XGBoost, random forest), is there any constraint / rule of thumb on number of features allowed in the model (at the same time), given the minority class sample size? Can you please also explain the reason? Thank you.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
KWang
  • 21
  • 3
  • I've edited the title of the post to distinguish this particular question from similar but not identical questions about the general $n \ll p$ case, such as https://stats.stackexchange.com/questions/471726/why-does-machine-learning-work-for-high-dimensional-datan-ll-p/471778#471778 These questions are not the same as this one, which asks whether the *minority class* places a limit on the number of features. – Sycorax Dec 01 '20 at 04:51
  • Note that the Wikipedia article you cite has a number of citations to papers finding that the "one in ten" rule is not a particularly good rule. This observation suggests there is unlikely to be a similarly simple rule to apply in the more general case of exotic machine learning model. Or, if there is such a rule, we wouldn't expect it to be a good one. – Sycorax Dec 01 '20 at 04:57

0 Answers0