0

In short, I'm curious about the problems associated with a difference in the sample size of respondents by a binary variable when fitting a logit model focused on prediction rather than causality.

By difference in sample size, I mean that 80% of respondents have been surveyed as 0 in a binary variable, while 20% of respondents are 1. In absolute numbers, let's say that 1600 respondents are 0, while 400 are 1.

I understand the difference in sample size may be representative of the population, but does it cause any problems in the logit model? I have read that it could reduce sensitivity.

What theorems, functions, assumptions, etc. should I look into or use?

For reference, I'm working in R in case that helps in providing an example as an answer.

Thank you for your help.

Will M
  • 123
  • 5

1 Answers1

0

I like this paper, because it references some other studies which examine the Events Per Variable (EPV) in logistic regression. The paper examines the common 1-in-10 rule for logistic regression and argues there is no rationale for it (and that it might be too liberal a requirement).

From this paper, it seems that some studies say a low EPV can result in highly variable predictions (see the first paragraph of that paper). The extent to which this is a problem highly depends on your situation. If you are only regressing on a handful of variables which are more or less orthogonal, then you may be OK. If your variables are correlated, then you will have model instability. Everything in between is a "it depends" situation. In any case, when the EPV is low, I think you should bootstrap your model to examine how well the asymptotic estimates of model variance stack up against bootstrapped variance estimates.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94