Predicting a binary variable whose values in the sample are severely unbalanced

Question

I have a data set with a binary variable which is 0 for 94% of observations and 1 for 6% of observations. If I fit a model (say logistic regression) to predict this variable in a way that maximizes goodness of fit over the sample then I'm liable to get a model that unconditionally predicts 0 since that predicts accurately for 94 % of the observations.

But that's not very useful for me. Theoretically both the classes are equally interesting to me. So I do a fit giving weight of 1/0.94 to each observation with value 0 and 1/0.06 to each of the observation with value 1.

Is this a sensible thing to do and does it have a standard name?

Is the distribution in the sample similar to the distribution in the population? Do you have any reason to believe that this is actually a problem? — Tim, Sep 16 '21 at 05:13
Yes. It is the set of workers who quit their jobs. I want to understand how those who quit are different than those who don't. — Jyotirmoy Bhattacharya, Sep 16 '21 at 07:38
*Sounds ethically risky.* Example: you find out that parents of young children leave more often--should you take actions towards this groups? This may lead to discrimination and be unethical and illegal. Moreover, it sounds like you would like to do some kind of causal inference, while ML/statistics would tell you only about correlations. Also, you would likely have many unobserved variables and confounders there. Moreover, this doesn't sound like a prediction problem, but rather testing, isn't it? — Tim, Sep 16 '21 at 07:57

score 9 · Answer 1 · answered Sep 16 '21 at 06:16

When you fit a logistic regression, you do not get flat 0 predictions. Rather, you get probabilistic predictions, which are far more useful and informative than hard 0/1 predictions. Evaluate these using proper scoring rules.

Do not blindly compare the predicted class membership probabilities to some threshold, unless you know what you are doing, and know that the threshold you are using is actually useful for your context. (Hint: a threshold of 0.5, or one that maximizes accuracy, will usually not be useful.)

Weighting your model is really the same as oversampling the minority class, which will only bias your parameter estimates, see below. Don't do it, again, unless you know what you are doing.

More information here:

Predicting a binary variable whose values in the sample are severely unbalanced

1 Answers1