How do you calculate sample sizes for multiple treatments?

Question

How do you calculate sample size for multiple treatments? By this, I mean two things:

Two treatments and one control, where I want to compare T1 vs Control and T2 vs Control. Is the answer as simple as doing a power analysis for T1 vs Control, then using the same sample size for T1 on T2? I can't get a straight answer stating that anywhere (maybe it's too "obvious" for most people, but not me)
Three treatments, where I want to compare them all together (so there would be 3 comparisons in total). I imagine I'll have to change alpha to .05/3 right? The best solution I can find is a power calculation formula for ANOVA, but ANOVA isn't exactly the type of analysis I will do. Instead, I will run 3 regressions for each permutation, not just a full ANOVA (ANOVA doesn't tell me WHICH mean is largest).

The programs I'm familiar with are GPower, PowerUp, and Stata's -power- command. I've seen some theoretical papers on this, but no simple practical guide. Why is that? Like sociologically, why are there so few practical guides or cookbooks on power analysis special extensions? There are a million on simple two-sample t-test, but none on multiple treatments or interactions. Is it because there is no consensus on how to do it well? Do academics in practice ignore these nuances? I'm not necessarily trying to get the more rigorous statistical answer, but just to understand what social scientists find as "good enough" in practice.

Just to show I did due diligence, I found these old posts that weren't detailed enough:

ANCOVA vs multiple regression the same: so why different power analysis results?

How to calculate power (or sample size) for a multiple comparison experiment?

Gpower: calculate power of multiple regression analysis with between groups

power analysis for factorial design

Be extremely specific about your predictors and response in the regression you propose in #2, because I can see one way where that's totally correct and another where that's totally incorrect. — Dave, May 20 '20 at 20:42
Could you clarify? To add concreteness, let's say the outcome is a binary of whether someone dies, and the IV is whether they received a drug treatment. In this case, we are calculating power for a linear probability model. — Hutchins, May 21 '20 at 00:17
That would be modeled by a logistic regression, not a linear regression. A two-sample t-test could be written as a linear regression by having a binary indicator variable as the predictor of a response, but you wouldn’t want a binary response. A linear regression is not restricted to predictions between 0 and 1. — Dave, May 21 '20 at 00:19
That's a cultural difference in fields. In economics, LPM have replaced logistic regression as the standard for binary responses over the past 5 years. Most recent econometric textbooks emphasize LPMs now, and its fairly uncommon to see logistic models in top journals anymore. But anyway, that's a side point. How would you calculate the effect size for a logistic then? — Hutchins, May 21 '20 at 01:48
@Hutchins, do you want to test how T1 & T2 relate to each other (eg, one is a better treatment)? In general, you need to posit all three means & SD's here (or perhaps 3 proportions). — gung - Reinstate Monica, May 21 '20 at 13:59
I want to know how to calculate the sample size in both circumstances 1) T1 v C, T2 v C 2) T1 v C, T2 v C, T1 v T2 Let's say the means are C = 20% die, T1 = 15% die, T2 = 10% die. SD is 5% for all. — Hutchins, May 21 '20 at 14:19

score 3 · Answer 1 · edited Dec 18 '20 at 18:15

First, we have to think clearly about what tests we're going to conduct. I'm not a huge fan of using the linear probability model, although it can be done in this case, as you only have categorical explanatory variables. (Note that you cannot have a constant SD with binary data where the proportions differ.) Do you just want a test that the conditions differ? Do you have to follow that up with planned comparisons? What test for those? How do you want to account for multiple comparisons? You can't just 'do a power analysis' until a lot of decisions have been made. None of that is meant to be critical; I'm trying to point out why you can't just find a simple answer by Googling.

I'm not generally a fan of canned power analyses, unless the situation is very simple and directly maps onto a simple, canonical test. In general, I prefer to simulate the alternative hypothesis / data generating process that I am proposing and conduct the sequence of tests that I intend. This also helps me to think through the statistical analysis plan for the study, and helps me think about what the data could look like, what I might think about them, and what I'd conclude. There are a lot more nuances than people often realize. For a more detailed exposition, it may help you to read through my answer here: Simulation of logistic regression power analysis - designed experiments (the code is rather clunky, but hopefully easy to follow).

Fortunately perhaps, your situation does correspond to simple analyses where it is easy to apply a canned power analysis. Specifically, if you just want to see if the three conditions differ, given that you only have three categorical conditions and the outcome data are binary (lived/died), this corresponds to a chi-squared test of a 2x3 contingency table. Alternatively, if you just want to test if $T_1$ differs from $C$, and if $T_2$ differs from $C$, you can conduct two $z$-tests of differences in proportions. Those won't be independent, so you might want to use a Bonferroni correction, in which case you simply use $\alpha=.025$ in your power analyses, and then use whichever $n$ is bigger. I can demonstrate these using the pwr library in R. (It might help you to work through the introductory vignette.)

First, I input the probabilities you have specified as the alternative hypothesis. Then I compute Cohen's measure of effect size, $w$, for a two-way contingency table. The contingency table will have $(r-1)(c-1)=2$ degrees of freedom, so we can simply get the required $N$ using the canned function ?pwr.chisq.test:

library(pwr)

#             C   T1   T2               # conditions
P = rbind(c(.20, .15, .10),             # prob die
          c(.80, .85, .90) )            # prob live
P = P/3;  P                             # matrix of cell probabilities
#            [,1]      [,2]       [,3]
# [1,] 0.06666667 0.0500000 0.03333333
# [2,] 0.26666667 0.2833333 0.30000000
w = ES.w2(P=P);  w  # [1] 0.1143324     # Cohen's measure of effect size w
pwr.chisq.test(w=w, N=NULL, df=2, sig.level=.05, power=.80)
# 
#      Chi squared power calculation 
# 
#               w = 0.1143324
#               N = 737.0537
#              df = 2
#       sig.level = 0.05
#           power = 0.8
# 
# NOTE: N is the number of observations
ceiling(737.0537/3)  # [1] 246  # you'll need n=246 participants in each condition

A different approach is simply to conduct two separate tests of the treatment conditions against the control. Since these aren't independent, we can test both against a lower alpha. Once again, first we stipulate the probabilities you want to be able to detect, then compute Cohen's measure of effect size, $h$. From there, it's easy to get the required $N$ from the canned function ?pwr.2p.test:

h1 = ES.h(.20, .15);  h1  # [1] 0.1318964  # Cohen's measure of effect size h
h2 = ES.h(.20, .10);  h2  # [1] 0.2837941

pwr.2p.test(h=h1, n=NULL, sig.level=0.025, power=.80)
# 
#      Difference of proportion power calculation for binomial distribution 
#        (arcsine transformation) 
# 
#               h = 0.1318964
#               n = 1092.743
#       sig.level = 0.025
#           power = 0.8
#     alternative = two.sided
# 
# NOTE: same sample sizes
pwr.2p.test(h=h2, n=NULL, sig.level=0.025, power=.80)
# 
#      Difference of proportion power calculation for binomial distribution
#        (arcsine transformation) 
# 
#               h = 0.2837941
#               n = 236.0353
#       sig.level = 0.025
#           power = 0.8
#     alternative = two.sided
# 
# NOTE: same sample sizes

This route implies you'll need $1093$ participants in each condition. That's a lot of data! However, it may be closer to what you really want to be able to demonstrate. It's worth remembering at this point that there's very little information in a binary data point, there's less the closer the probability gets to the upper or lower bound, and $.15$ is really close to $.20$ (although I acknowledge that every life is precious so the small difference may nonetheless be clinically meaningful).

If you're really committed to using the linear probability model, and want to show that each condition differs from the others, we need to move to a simulation-based approach. How do you want to address the necessary heteroscedasticity? Among other options, you could use weighted least squares—I'll do that here. How do you want to conduct the multiple comparisons? There are lots of ways; in this case I'll use Tukey's test.

The power analyses above give me a ballpark estimate of where to start. This will require a lot of computation, so I take some steps to make it faster: I generate all the data and the weights ahead of time. I try to minimize the number of calculations that I'm asking R to perform, etc. Done this way, it only takes my old machine about 15 seconds. I'm assuming the analytical plan is to first determine if there is a significant global effect, and if so, you want to go further and show that all three conditions differ. Thus, you want four significant p-values for the study to be considered successful. That is, we are solving for all-ways power (see my linked answer at top).

set.seed(906)  # this makes the example exactly reproducible
n     = 1093   # number of patients per arm
B     = 1000   # number of iterations in the simulation
p.mat = matrix(NA, nrow=4, ncol=B)                 # matrix to store the p-values
cond  = rep(c("C", "T1", "T2"), each=n)            # condition variable
y.mat = matrix(c(rbinom(n*B, size=1, prob=.20),    # resulting data
                 rbinom(n*B, size=1, prob=.15),
                 rbinom(n*B, size=1, prob=.10) ),
               nrow=n*3, ncol=B, byrow=T)
w.mat = matrix(NA, nrow=n*3, ncol=B)               # matrix to store the weights
i2s = n+1;  i2e = 2*n;  i3s = (2*n)+1;  i3e = 3*n  # row indexes
for(j in 1:B){                                     # computing the weights
  w.mat[1:n,j]     = 1/(n*mean(y.mat[1:n,j]     )*(1-mean(mean(y.mat[1:n,j]))))
  w.mat[i2s:i2e,j] = 1/(n*mean(y.mat[i2s:i2e,j])*(1-mean(mean(y.mat[i2s:i2e,j]))))
  w.mat[i3s:i3e,j] = 1/(n*mean(y.mat[i3s:i3e,j])*(1-mean(mean(y.mat[i3s:i3e,j]))))
}
for(j in 1:B){                         # fitting the models & storing the p-values
  m            = aov(y.mat[,j]~cond, weights=w.mat[,j])
  p.mat[1,j]   = summary(m)[[1]][1,5]  # global p-value
  p.mat[2:4,j] = TukeyHSD(m)$cond[,4]  # 3 p-values for comparisons
}
## power: i.e., the proportion of runs where all p's were significant
mean(apply(p.mat, 2, function(j){  mean(j<.05)==1  }))  # [1] 0.676

With this analytical strategy, using $n = 1093$ in each arm ($N = 3279$), I estimate you have $\approx 68\%$ power to show that all three conditions differ from each other. If you want, you can search over larger $n$'s to find how many patients it would take to achieve $80\%$ power.

Why is there such a difference between the power estimated using the second and third method (80 vs 68%%)? I am not fully clear which method is more reliable or what the advantage of the third method is over the second method. @gung - Reinstate Monica — SysEng, Dec 16 '21 at 17:39
@SysEng, I'm not sure I follow your question, but I think the answer is because the earlier approach only needs to show *either* $15\%\ne20%$ *or* $10\%\ne20\%$, whereas the latter needs to show all three: $15\%\ne20%$, $10\%\ne20\%$, & $15\%\ne10%$. — gung - Reinstate Monica, Dec 17 '21 at 01:26

How do you calculate sample sizes for multiple treatments?

1 Answers1

Linked