0

I have a dataset with lots Y=0 and few Y=1. I have to run logistic regression, so I'm using a retrospective sample in order to get a more balanced sample. Could someone give me some references that explain which are the problems arising when I use logistic regression in an unbalanced sample? I kwow that the main problems are instability of estimated coefficients and poor predictive power of the model, but I need some references.

Luca Dibo
  • 467
  • 1
  • 4
  • 19
  • The problem is the few $Y=1$ rather than the many $Y=0$. See [here](http://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients). If you *already have* the data, there's no benefit to throwing some of it away. – Scortchi - Reinstate Monica Mar 28 '15 at 11:27

1 Answers1

2

Take a look at Logistic Regression in Rare Events Data in Political Analysis 9 (2001): 137-63 by Gary King and Langche Zeng.

There really isn't a problem using logistic regression modelling in the case you described. The issues is that your estimates will have small-sample bias. You can use exact logistic regression if your sample isn't too big or you can use the method described in the paper above which is based off of a penalized-likelihood approach.

conjugateprior
  • 19,431
  • 1
  • 55
  • 83
StatsStudent
  • 10,205
  • 4
  • 37
  • 68