Are there any caveats when logistic regression is used on a sample with average probability of success close to one (1.4M dataset, mean prob. of success = 0.975)?
Asked
Active
Viewed 239 times
4
-
1How many predictors are you using? – David Robinson Mar 01 '14 at 11:45
-
24, some of them are categorical with several levels – Andrey Chetverikov Mar 01 '14 at 11:50
-
1You need to check whether you're getting extreme bias in odds ratio estimates when some predictor patterns define very small groups. – Scortchi - Reinstate Monica Mar 01 '14 at 11:51
-
1For starters, what are the counts and # of successes of each category in each of those categorical variables? – David Robinson Mar 01 '14 at 11:55
-
It's a country-wide data, so one of the controlled variables included in the analysis is "region" with 80 levels, where number of cases varies hugely from 800 to 100000. But this variable is included as control only. The categorical variables that are the aim of analysis have 2 to 10 levels, with no less than 4000 cases per level. – Andrey Chetverikov Mar 01 '14 at 12:01
-
So you've got 35,000 successes & say 250 degrees of freedom for the predictors, & no predictor patterns defining very small groups - I wouldn't be expecting any problems. See [here](http://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression/68917) for a problem that can happen & how to deal with it. – Scortchi - Reinstate Monica Mar 01 '14 at 12:17
-
Posts [here](http://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients) & [here](http://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression) are relevant to the general question of balance. – Scortchi - Reinstate Monica Mar 01 '14 at 12:27
1 Answers
4
As you seem to come from a social-science background, King & Zeng (2001), "Logistic Regression in Rare Events Data", Political Analysis, 9, pp 137–163 might be a good starter - the term here is "rare event data".The authors claim that "popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events", and the paper had quite some impact.

Scortchi - Reinstate Monica
- 27,560
- 8
- 81
- 248

Julian Schuessler
- 2,025
- 11
- 16
-
3That nice paper shows how to estimate the bias in $\hat{\beta}$ in the rare event situation, but does not demonstrate that the bias-corrected estimator is closer to the true $\beta$, i.e., that the variance of the bias correction is small enough that it doesn't matter. – Frank Harrell Mar 01 '14 at 13:55
-
1@julian: I added the full reference - it's often a good idea to do so because links "rot" – Scortchi - Reinstate Monica Mar 01 '14 at 16:35