9

I know that sample size affects power in any statistical method. There are rules are thumb for how many samples a regression needs for each predictor.

I also hear often that the number of samples in each category in the dependent variable of a logistic regression is important. Why is this?

What are the actual consequences to the logistic regression model when the number of samples in one of the categories is small (rare events)?

Are there rules of thumb that incorporate both the number of predictors and the number of samples in each level of the dependent variable?

Michael Webb
  • 1,936
  • 10
  • 21
  • https://stats.stackexchange.com/questions/306122/rare-events-logistic-regression https://stats.stackexchange.com/questions/178015/feature-selection-and-pca-in-logistic-regression-with-rare-events-data (and a lot of similar unanswered questions) – kjetil b halvorsen Oct 12 '17 at 18:26
  • I think this reference may help. Manel, S., Williams, H.C., Ormerod, S.J., 2001. Evaluating presence-absence models in ecology: the need to account for prevalence. J. Appl. Ecol. 38 (5), 921–931. http://dx.doi.org/10.1046/j.1365-2664.2001.00647.x There a many more about modelling unbalanced datasets. – Rafa_Mas Oct 18 '17 at 17:54

1 Answers1

11

The standard rule of thumb for linear (OLS) regression is that you need at least $10$ data per variable or you will be 'approaching' saturation. However, for logistic regression, the corresponding rule of thumb is that you want $15$ data of the less commonly occurring category for every variable.

The issue here is that binary data just don't contain as much information as continuous data. Moreover, you can have perfect predictions with a lot of data, if you only have a couple of actual events. To make an example that is rather extreme, but should be immediately clear, consider a case where you have $N = 300$, and so tried to fit a model with $30$ predictors, but had only $3$ events. You simply can't even estimate the association between most of your $X$-variables and $Y$.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 2
    +1 Also, with rare events you will need a surprisingly large number of cases to estimate the true intercept ([Harrell](http://www.springer.com/us/book/9783319194240), on p. 233, says 96 cases total to have 95% confidence of having predicted probability within 0.1 of the true value when true probability is close to 0 in an intercept-only model), and if there is unbalanced sampling you might need a [rare events correction](https://stats.stackexchange.com/q/6067/28500) – EdM Oct 12 '17 at 19:04
  • 1
    So rare events can bias the estimated intercept. Do rare events cause other specific problems (inconsistency, instability, convergence issues when computing the MLE)? – Michael Webb Oct 12 '17 at 21:21
  • @Great38 the "perfect predictions" issue in this answer can lead to problems with convergence and wide standard errors. See [this post](https://stats.stackexchange.com/q/45803/28500) and others on the Hauck-Donner effect or perfect separation. – EdM Oct 13 '17 at 00:30
  • @Great38, the question is a little unclear. There isn't really any problem w/ rare events. If I have $10^{20}$ data, but w/ 'only' $10^{6}$ events in a model with hundreds of predictors, my event rate is $0.00000000000001$ But I shouldn't expect to have any problems despite my low proportion of events & my hundreds of predictors. – gung - Reinstate Monica Oct 13 '17 at 00:50