Logistic regression where dummy dependent variable is heavily clustered at '0'

Question

As the title suggests, the distribution of my DV is as follows: 0=4,230; 1=41. There are 9 primary IVs I'd like to regress it on as well as 7 CVs. Something tells me that the skew in my DV will compromise any test of significance/effects. Is this true? And, if so, how should I proceed in this situation? Thanks in advance for the help!

It may be worth your while to read [this thread](https://stats.stackexchange.com/q/291371/7290) & my answer there. — gung - Reinstate Monica, Aug 11 '17 at 02:23
Could not understand your question due to some strange things. 1) What made you call your binary DV specifically "dummy"? (was there any nominal variable? how is it used?). 2) What does fractional `0=4,230` mean? 3) What is CV? (covariate?) Please edit the Q to make everything clear. — ttnphns, Aug 11 '17 at 06:04
In addition to the gung's link and the answers below please search "complete or quasi-complete separation in logisic regression" — ttnphns, Aug 11 '17 at 07:09

score 2 · Answer 1 · answered Aug 11 '17 at 02:13

2

With only 41 1's and that many independent variables it seems quite possible that you'll have degenerate data (for example, all levels of an independent variable might only have 0's as responses).

Even if the model can be fit, the variance around the parameter estimates will probably be very large, the estimate of that variance will be suspect, and confidence intervals may not be reliable. You'll have a high chance of a type II error. Essentially you might not have enough data to find any significant effects.

answered Aug 11 '17 at 02:13

david25272

872
5
6

I actually went ahead and ran the model. There were several significant effects. But I suppose beyond the p-values, the results will largely be uninterpretable? – Zach Goldberg Aug 11 '17 at 04:44
`all levels of an independent variable might only have 0's as responses` How can that be? – ttnphns Aug 11 '17 at 07:20
I think you may have misunderstood me. My IVs are all normally distributed ordinal and continuous scales. It's the dependent variable that has mostly 0s. – Zach Goldberg Aug 11 '17 at 17:29

score 1 · Answer 2 · answered Aug 11 '17 at 00:58

First, your main problem is not going to be with the significance tests, but with bias in your estimates. The root of the problem, as I also mention below, is not so much the "skew" in the distribution of your DV, but because of the very small number of 1s.

As far as I know there are a couple of ways of dealing with rare events. One would be at the sampling stage, where you retain all $y_i=1$ and then sample from your $y_i=0$ so that it's not a rare event any more. How many 0s you would need to sample, could be left up to cross-validation, especially given that you only have so few 1s. Another approach is to run some sort of penalized logit (the Firth model), which also deals with the separation issues you'll probably face.

A good (semi-recent) review of this issue and discussion of solutions can be found here. The canonical discussion in my discipline is here.

Finally, a similar issue has been discussed in a previous post, but I'm not considering yours a duplicate, because of the extremely small number of 1s you have. I think this is a very important distinction, because other than any other potential problems associated with rare events, you also potentially have very limited variation in the Xs associated with the $y_i=1$.

Logistic regression where dummy dependent variable is heavily clustered at '0'

2 Answers2