Unequal sample sizes: When to call it quits

Question

I'm peer reviewing an academic journal article and the authors wrote the following as justification for not reporting any inferential statistics (I deidentified the nature of the two groups):

In total, 25 of the 2,349 (1.1%) respondents reported X. We appropriately refrain from presenting analyses that statistically compare group X to group Y (the other 2,324 participants) since those results could be heavily driven by chance with an outcome this rare.

My question is are the authors of this study justified in throwing in the towel with respect to comparing groups? If not, what might I recommend to them?

score 24 · Accepted Answer · edited Apr 13 '17 at 12:44

Statistical tests do not make assumptions about sample size. There are, of course, differing assumptions with various tests (e.g., normality), but the equality of sample sizes is not one of them. Unless the test used is inappropriate in some other way (I can't think of an issue right now), the type I error rate will not be affected by drastically unequal group sizes. Moreover, their phrasing implies (to my mind) that they believe it will. Thus, they are confused about these issues.

On the other hand, type II error rates very much will be affected by highly unequal $n$s. This will be true no matter what the test (e.g., the $t$-test, Mann-Whitney $U$-test, or $z$-test for equality of proportions will all be affected in this way). For an example of this, see my answer here: How should one interpret the comparison of means from different sample sizes? Thus, they may well be "justified in throwing in the towel" with respect to this issue. (Specifically, if you expect to get a non-significant result whether the effect is real or not, what is the point of the test?)

As the sample sizes diverge, statistical power will converge to $\alpha$. This fact actually leads to a different suggestion, which I suspect few people have ever heard of and would probably have trouble getting past reviewers (no offense intended): a compromise power analysis. The idea is relatively straightforward: In any power analysis, $\alpha$, $\beta$, $n_1$, $n_2$, and the effect size $d$, exist in relationship to each other. Having specified all but one, you can solve for the last. Typically, people do what is called an a-priori power analysis, in which you solve for $N$ (generally you are assuming $n_1=n_2$). On the other hand, you can fix $n_1$, $n_2$, and $d$, and solve for $\alpha$ (or equivalently $\beta$), if you specify the ratio of type I to type II error rates that you are willing to live with. Conventionally, $\alpha=.05$ and $\beta=.20$, so you are saying that type I errors are four times worse than type I errors. Of course, a given researcher might disagree with that, but having specified a given ratio, you can solve for what $\alpha$ you should be using in order to possibly maintain some adequate power. This approach is a logically valid option for the researchers in this situation, although I acknowledge the exoticness of this approach may make it a tough sell in the larger research community that probably has never heard of such a thing.

This is incredibly helpful. I also found your response to [How should one interpret the comparison of means from different sample sizes?](http://stats.stackexchange.com/questions/31326/how-should-one-interpret-the-comparison-of-means-from-different-sample-sizes/31330#31330) useful in my own understanding of this issue. After reading your response, I will bring up the possibility of a compromise power analysis to the authors (it sounds like a safe bet that they are not familiar with it) and maybe suggest being more specific in their comments with respect to concerns about power. — Aaron Duke, Jan 19 '14 at 03:17
You're welcome, @AaronD. In my opinion, you should definitely encourage them to change their phrasing at a minimum as it is either misleading, or implies they misunderstand the topic. I would predict that they won't attempt the compromise power analysis, but they could also just report descriptive statistics (means & SDs) & an effect size w/ appropriate confidence intervals. — gung - Reinstate Monica, Jan 19 '14 at 03:22

score 6 · Answer 2 · edited Jan 20 '14 at 15:20

While the answer from @gung is excellent, I think there is one important issue that should be considered when looking at wildly different group sizes. Generally, as long as all the requirement of the test are fulfilled the difference in group sizes is not important.

However, in some cases the different group size will have a dramatic effect on the robustness of the test against violations against these assumption. The classical two-sample unpaired t-test for example assumes variance homongenity and is robust against violations only if both groups are similarily sized (in order of magnitude). Otherwise higher variance in the smaller group will lead to Type I errors. Now with the t-test this is not much of a problem since commonly the Welch t-test is used instead and it does not assume variance homogenity. However, similar effects can arise in linear models.

In summary, I would say that this is in no way a hindrance to a statistical analysis, but it has to be kept in mind when deciding how to proceed.

I believe the crux of the matter here is not applicability of tests but rather their meaningfulness and interpretability. The question refers to "respondents." This strongly suggests the possibility of a nonzero non-response rate. Even a tiny non-response rate (a small fraction of one percent) relative to the study size would amount to an enormous non-response rate relative to the small subgroup. That calls into question the representativeness of any subgroup this small. As a result, it's a huge hindrance to any statistical analysis. — whuber, Jan 20 '14 at 16:34

Unequal sample sizes: When to call it quits

2 Answers2

Linked