I have a dataset with lots Y=0 and few Y=1. I have to run logistic regression, so I'm using a retrospective sample in order to get a more balanced sample. Could someone give me some references that explain which are the problems arising when I use logistic regression in an unbalanced sample? I kwow that the main problems are instability of estimated coefficients and poor predictive power of the model, but I need some references.
Asked
Active
Viewed 898 times
0
-
The problem is the few $Y=1$ rather than the many $Y=0$. See [here](http://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients). If you *already have* the data, there's no benefit to throwing some of it away. – Scortchi - Reinstate Monica Mar 28 '15 at 11:27
1 Answers
2
Take a look at Logistic Regression in Rare Events Data in Political Analysis 9 (2001): 137-63 by Gary King and Langche Zeng.
There really isn't a problem using logistic regression modelling in the case you described. The issues is that your estimates will have small-sample bias. You can use exact logistic regression if your sample isn't too big or you can use the method described in the paper above which is based off of a penalized-likelihood approach.

conjugateprior
- 19,431
- 1
- 55
- 83

StatsStudent
- 10,205
- 4
- 37
- 68
-
3But the magnitude of the problem is small, and ordinary maximum likelihood estimation may suffice. – Frank Harrell Mar 28 '15 at 12:44