Unbalanced distribution of sample size between groups in logistic regression: should one worry?

Question

I need to fit logistic regression models to a dataset where infection (present/absent) is my dependent variable and neighborhood (three factors: Rich, Poor, Very Poor) my independent variable.

According to a reviewer who (as I) is not well versed in stats, one potential problem with my data is that the variable neighborhood has a quite unevenly distributed sample size for each factor, such that:

Rich = 853  
Poor = 100  
Very poor = 131

The reviewer suggested randomly subsetting the "Rich" group to get a sample of about 100 samples and then meet this alleged assumption of approximately equal sample sizes between groups within the same variable.

Because of the hypothesis behind our study, I need to set "Rich" as the reference category against which to compare the remaining two.

Is the reviewer's suggestion founded? AFAIK, there's no violation of assumption whatsoever in logistic regression if the two categories of the independent variable are unbalanced, or even sparse, and no violation assumption even if it's the dependent variable.

If some categories are very rare in *absolute* terms the odds-ratio estimates can be significantly biased away from one. (Imagine what your model would look like if you had only one poor neighbourhood with infection either present or absent.) But that's got nothing to do with *relative* rarity, & of course isn't fixed by throwing away observations from other categories. Various kinds of down-sampling are sometimes done, as explained [here](http://stats.stackexchange.com/questions/67903), but it's to reduce computation time for large data-sets while minimizing the loss of precision. — Scortchi - Reinstate Monica, Dec 10 '13 at 18:34
I don't think this is a duplicate. The linked thread is about an unbalanced Y variable. This thread is about an unbalanced X variable. — gung - Reinstate Monica, Nov 03 '16 at 13:10

score 9 · Accepted Answer · edited Apr 13 '17 at 12:44

You are right that logistic regression does not make any assumptions about the distribution of your independent variable. What will occur as a result of your situation is that you will have less power than if you had equal $n$s. However, reducing the $n$ in the Rich group will only lessen your power further. Rather, the idea is that if you had the same total $N$, but equally divided, you would have more power. Although written in a different context (viz, t-tests), you can get the general idea from my answer here: How should one interpret the comparison of means from different sample sizes?

Unbalanced distribution of sample size between groups in logistic regression: should one worry?

1 Answers1

Related