In a multiple arm experiment, why does the control have to be larger than the experiment arms?

Question

I am running a study with two experimental groups, each with slight modifications to their variables, and I want to be able to compare both to the same control. I have heard the control group should be larger than the experimental groups by a factor of $\sqrt{\text{number of experimental arms}}$.

Does anyone know why this would be the case?

I've found that, on this site, we get a lot of questions of the form "I've heard people say you should do X, but they haven't explained why. Why is that?" In most cases, I can only reply that I don't see any reason to do X myself. This is such a case. If you could cite somebody saying this, or remember somebody's rationale for it, then I might be able to address that. Otherwise, I don't know what else to tell you. — Kodiologist, Nov 25 '16 at 01:31
@Kodiologist is quite right - please do cite/explain this requirement. — Scortchi - Reinstate Monica, Nov 26 '16 at 11:25
I would also like to know a citation for that rule (which seems very reasonable). The only source I can find is this lecture note, which doesn't even list an author: https://onlinecourses.science.psu.edu/stat503/node/16 Any published articles that give this method? — Harvey Motulsky, Apr 27 '18 at 00:17

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

This is about the power to detect a difference when you have multiple comparisons and attempt to control for that appropriately; specifically, when you're using Dunnett's test. In a situation such as yours, a person could just run two t-tests (I am not suggesting you do this). If you did that, you would be making multiple comparisons (literally two), and that could lead to inflated familywise type I error rates. In addition, the two tests would not be independent of each other. For example, imagine that there is no effect (thus type I errors are possible), but the mean for the control arm bounced lower by chance alone. Then both contrasts could become significant (since they are both being compared to a misleadingly lower value).

For cases like this (multiple comparisons to a single group), it is common to use Dunnett's test. Now let's imagine that you are doing this, but that the null does not hold (thus we are now concerned about whether your power will be adequate). Given a fixed total $N$, how should we divvy those observations into three groups to optimize our power? In general, you would want equal $n$s in your groups (cf, my answer here: How should one interpret the comparison of means from different sample sizes?). However, this situation is special: you will be making multiple comparisons against the same control group, so maybe there is some other way to partition the units to optimize power. That is where the idea you refer to comes from. Below, I have worked up a simple simulation in R that works with a total $N = 48$. This lets us allocate the data $1$ to $1$ to $1$ ($16$ each), $\sqrt{2} \approx 1.41$ to $1$ to $1$ ($20$ in the control arm and $14$ in each of the treatment arms), or, going even further, $2$ to $1$ to $1$ ($24$ in the control arm and $12$ each in the treatment arms).

set.seed(7078)  # this makes the simulation exactly reproducible
# N  = 48       # the total N is 48
B = 100000      # I will do 100k simulations
sig_11_t1   = vector(length=B)  # these vectors will hold the results / p-values
sig_11_t2   = vector(length=B)  #  of the various tests of the treatments 
                                #  against the control
sig_11.4_t1 = vector(length=B)
sig_11.4_t2 = vector(length=B)
sig_21_t1   = vector(length=B)
sig_21_t2   = vector(length=B)
for(i in 1:B){
  er = rnorm(48)  # here I generate standard normal data for the errors
                  # below I assign the data to the different arms according
                  #  to the different schemes for distributing the ns
                  #  (note that the effects are the same in all cases):
  c_11   = er[1:16];  t1_11   = er[17:32];  t2_11   = 1+er[33:48]
  c_11.4 = er[1:20];  t1_11.4 = er[21:34];  t2_11.4 = 1+er[35:48]
  c_21   = er[1:24];  t1_21   = er[25:36];  t2_21   = 1+er[37:48]

                  # this combines the data above into datasets, the data
                  #  will be called 'values', & the treatments will be called
                  #  'ind' (indicator for which group):
  d_11   = stack(list(c=c_11,   t1=t1_11,   t2=t2_11))
  d_11.4 = stack(list(c=c_11.4, t1=t1_11.4, t2=t2_11.4))
  d_21   = stack(list(c=c_21,   t1=t1_21,   t2=t2_21))

                  # these fit the three models:
  m_11   = lm(values~ind, d_11)
  m_11.4 = lm(values~ind, d_11.4)
  m_21   = lm(values~ind, d_21)

                  # these are the Dunnett's tests (I'm not bothering w/ 
                  #  an omnibus test for this simulation, but you would
                  #  probably do that for real data)
  D_11   = summary(glht(m_11,   linfct=mcp(ind="Dunnett")))
  D_11.4 = summary(glht(m_11.4, linfct=mcp(ind="Dunnett")))
  D_21   = summary(glht(m_21,   linfct=mcp(ind="Dunnett")))

                  # these extract the p-values from the Dunnett's tests, &
                  #  store them in the appropriate vectors
  sig_11_t1[i]   = D_11$test$pvalues[1]
  sig_11_t2[i]   = D_11$test$pvalues[2]
  sig_11.4_t1[i] = D_11.4$test$pvalues[1]
  sig_11.4_t2[i] = D_11.4$test$pvalues[2]
  sig_21_t1[i]   = D_21$test$pvalues[1]
  sig_21_t2[i]   = D_21$test$pvalues[2]
}
# these determine the proportion 'significant':
mean(sig_11_t1<.05)    # [1] 0.02764
mean(sig_11_t2<.05)    # [1] 0.70979
mean(sig_11.4_t1<.05)  # [1] 0.02707
mean(sig_11.4_t2<.05)  # [1] 0.72071
mean(sig_21_t1<.05)    # [1] 0.02568
mean(sig_21_t2<.05)    # [1] 0.70380

Notice that the way I set the simulation up, the same errors are being used for each set of tests, which should afford greater clarity to the results. I made the the null true for the tests of t1, but false for t2, which means that we can interpret the rates of significance for t1 as type I errors, and the significance rates for t2 as an estimate of the power of the test / experiment. The individual type I error rates are held down appropriately (they are clearly $<0.05$, but the familywise error rates should be approximately $0.05$). With regard to the power of the tests, it is maximized when you use the ratio of $1.41$ to $1$ to $1$: you have less power whether you deviate from that ratio by moving to equality or further from equality. That specific ratio yields an extra $\approx 1-2\%\,$ more power—the result was significant over 1,000 more times out of 100,000 simulated 'experiments'—than otherwise. (Although the effect here seems small enough not to bother about, this simulation isn't sophisticated enough to see how the additional power plays out as sample and effect sizes vary.)

Great answer. Can you add a few words about whether the value of this recommendation is wholly specific to Dunnett's test? — Kodiologist, Nov 27 '16 at 19:32
@Kodiologist, I believe this is specific to Dunnett's test, unless there is another test / correction that is specific to the many vs. 1 comparison issue that I'm unaware of--I don't think this would come up with a generic correction. — gung - Reinstate Monica, Nov 27 '16 at 19:54

In a multiple arm experiment, why does the control have to be larger than the experiment arms?

1 Answers1