This is about the power to detect a difference when you have multiple comparisons and attempt to control for that appropriately; specifically, when you're using Dunnett's test. In a situation such as yours, a person could just run two t-tests (I am not suggesting you do this). If you did that, you would be making multiple comparisons (literally two), and that could lead to inflated familywise type I error rates. In addition, the two tests would not be independent of each other. For example, imagine that there is no effect (thus type I errors are possible), but the mean for the control arm bounced lower by chance alone. Then both contrasts could become significant (since they are both being compared to a misleadingly lower value).
For cases like this (multiple comparisons to a single group), it is common to use Dunnett's test. Now let's imagine that you are doing this, but that the null does not hold (thus we are now concerned about whether your power will be adequate). Given a fixed total $N$, how should we divvy those observations into three groups to optimize our power? In general, you would want equal $n$s in your groups (cf, my answer here: How should one interpret the comparison of means from different sample sizes?). However, this situation is special: you will be making multiple comparisons against the same control group, so maybe there is some other way to partition the units to optimize power. That is where the idea you refer to comes from. Below, I have worked up a simple simulation in R
that works with a total $N = 48$. This lets us allocate the data $1$ to $1$ to $1$ ($16$ each), $\sqrt{2} \approx 1.41$ to $1$ to $1$ ($20$ in the control arm and $14$ in each of the treatment arms), or, going even further, $2$ to $1$ to $1$ ($24$ in the control arm and $12$ each in the treatment arms).
set.seed(7078) # this makes the simulation exactly reproducible
# N = 48 # the total N is 48
B = 100000 # I will do 100k simulations
sig_11_t1 = vector(length=B) # these vectors will hold the results / p-values
sig_11_t2 = vector(length=B) # of the various tests of the treatments
# against the control
sig_11.4_t1 = vector(length=B)
sig_11.4_t2 = vector(length=B)
sig_21_t1 = vector(length=B)
sig_21_t2 = vector(length=B)
for(i in 1:B){
er = rnorm(48) # here I generate standard normal data for the errors
# below I assign the data to the different arms according
# to the different schemes for distributing the ns
# (note that the effects are the same in all cases):
c_11 = er[1:16]; t1_11 = er[17:32]; t2_11 = 1+er[33:48]
c_11.4 = er[1:20]; t1_11.4 = er[21:34]; t2_11.4 = 1+er[35:48]
c_21 = er[1:24]; t1_21 = er[25:36]; t2_21 = 1+er[37:48]
# this combines the data above into datasets, the data
# will be called 'values', & the treatments will be called
# 'ind' (indicator for which group):
d_11 = stack(list(c=c_11, t1=t1_11, t2=t2_11))
d_11.4 = stack(list(c=c_11.4, t1=t1_11.4, t2=t2_11.4))
d_21 = stack(list(c=c_21, t1=t1_21, t2=t2_21))
# these fit the three models:
m_11 = lm(values~ind, d_11)
m_11.4 = lm(values~ind, d_11.4)
m_21 = lm(values~ind, d_21)
# these are the Dunnett's tests (I'm not bothering w/
# an omnibus test for this simulation, but you would
# probably do that for real data)
D_11 = summary(glht(m_11, linfct=mcp(ind="Dunnett")))
D_11.4 = summary(glht(m_11.4, linfct=mcp(ind="Dunnett")))
D_21 = summary(glht(m_21, linfct=mcp(ind="Dunnett")))
# these extract the p-values from the Dunnett's tests, &
# store them in the appropriate vectors
sig_11_t1[i] = D_11$test$pvalues[1]
sig_11_t2[i] = D_11$test$pvalues[2]
sig_11.4_t1[i] = D_11.4$test$pvalues[1]
sig_11.4_t2[i] = D_11.4$test$pvalues[2]
sig_21_t1[i] = D_21$test$pvalues[1]
sig_21_t2[i] = D_21$test$pvalues[2]
}
# these determine the proportion 'significant':
mean(sig_11_t1<.05) # [1] 0.02764
mean(sig_11_t2<.05) # [1] 0.70979
mean(sig_11.4_t1<.05) # [1] 0.02707
mean(sig_11.4_t2<.05) # [1] 0.72071
mean(sig_21_t1<.05) # [1] 0.02568
mean(sig_21_t2<.05) # [1] 0.70380
Notice that the way I set the simulation up, the same errors are being used for each set of tests, which should afford greater clarity to the results. I made the the null true for the tests of t1
, but false for t2
, which means that we can interpret the rates of significance for t1
as type I errors, and the significance rates for t2
as an estimate of the power of the test / experiment. The individual type I error rates are held down appropriately (they are clearly $<0.05$, but the familywise error rates should be approximately $0.05$). With regard to the power of the tests, it is maximized when you use the ratio of $1.41$ to $1$ to $1$: you have less power whether you deviate from that ratio by moving to equality or further from equality. That specific ratio yields an extra $\approx 1-2\%\,$ more power—the result was significant over 1,000 more times out of 100,000 simulated 'experiments'—than otherwise. (Although the effect here seems small enough not to bother about, this simulation isn't sophisticated enough to see how the additional power plays out as sample and effect sizes vary.)