In a logistic regression model focused on prediction, what are the problems associated with a big difference in the sample sizes of a binary variable?

Question

In short, I'm curious about the problems associated with a difference in the sample size of respondents by a binary variable when fitting a logit model focused on prediction rather than causality.

By difference in sample size, I mean that 80% of respondents have been surveyed as 0 in a binary variable, while 20% of respondents are 1. In absolute numbers, let's say that 1600 respondents are 0, while 400 are 1.

I understand the difference in sample size may be representative of the population, but does it cause any problems in the logit model? I have read that it could reduce sensitivity.

What theorems, functions, assumptions, etc. should I look into or use?

For reference, I'm working in R in case that helps in providing an example as an answer.

Thank you for your help.

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Apr 24 '20 at 15:08

score 0 · Answer 1 · answered Apr 24 '20 at 15:07

I like this paper, because it references some other studies which examine the Events Per Variable (EPV) in logistic regression. The paper examines the common 1-in-10 rule for logistic regression and argues there is no rationale for it (and that it might be too liberal a requirement).

From this paper, it seems that some studies say a low EPV can result in highly variable predictions (see the first paragraph of that paper). The extent to which this is a problem highly depends on your situation. If you are only regressing on a handful of variables which are more or less orthogonal, then you may be OK. If your variables are correlated, then you will have model instability. Everything in between is a "it depends" situation. In any case, when the EPV is low, I think you should bootstrap your model to examine how well the asymptotic estimates of model variance stack up against bootstrapped variance estimates.

In a logistic regression model focused on prediction, what are the problems associated with a big difference in the sample sizes of a binary variable?

1 Answers1