Why is the sample size calculation not in agreement with t-test results?

Question

I used R package pwr (G*Power gives the same result) which say for a one-sided t-test of means with alpha=.05, beta=.20, that I would need 40 samples for both control and treatment groups (n=80) to detect a 35% reduction (corresponding to effect size .57) in the mean of the outcome variable Y. The effect size is calculated based on a prior study's mean and variance for the control group, as well as the variance of the treatment group.

http://www.quantitativeskills.com/sisa/statistics/t-test.php?mean1=13.98&mean2=9.087&N1=25&N2=25&SD1=10.47959&SD2=6.31&CI=95&ES=true&Submit1=Calculate

However...using just 25 samples per group, the above result (a 35% decrease from the mean) gives a p-value for the one-sided test of .0253. (I again used data we would reasonably expect based on a prior study).

So why do power calculations say that 40 per group would be needed, yet apparently 25 per group also works?

score 1 · Accepted Answer · answered Aug 28 '18 at 16:49

Because these are not looking at the same thing.

Power is the probability of finding a statistically significant result, given that there is an effect of a certain size.

The p-value is the probability of obtaining a result of that magnitude, given that the effect in the population is zero.

Here, I generate some data where there d = 0.5 (exactly) and I have a sample size of 32 per group.

> t.test(scale(rnorm(32)), scale(rnorm(32)) + 0.5)

    Welch Two Sample t-test

data:  scale(rnorm(32)) and scale(rnorm(32)) + 0.5
t = -2, df = 62, p-value = 0.04989
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.9997428793 -0.0002571207
sample estimates:
   mean of x    mean of y 
7.087972e-18 5.000000e-01

So it seems like I had a large enough sample?

This gives me a p-value of 0.05. If I do a power analysis for that effect:

> power.t.test(n = 32, delta = 0.5, sd = 1)

     Two-sample t test power calculation 

              n = 32
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.5035956
    alternative = two.sided

NOTE: n is number in *each* group

I have only 50% power. So if I were to repeat that study, I have a 50% chance of obtaining a statistically significant result. That's not enough.

You can test that, by generating data that match the population, sampling from it and seeing if you get a statistically significant result:

> set.seed(1234)
> t.test(rnorm(32), rnorm(32) + 0.5)

    Welch Two Sample t-test

data:  rnorm(32) and rnorm(32) + 0.5
t = -1.2128, df = 60.537, p-value = 0.2299
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.7940351  0.1945365
sample estimates:
  mean of x   mean of y 
-0.25831390  0.04143539

That time I did not. Half the time I will, and half the time I won't.

25 per group was enough, because you got lucky. You're not going to get lucky every time.

(Also, one sided tests are usually ill-advised, GPower does it be default, which I find weird.)

Ah, so it seems my lack of understanding was because I didn't take into account that the means I used in my link are unreliable (i.e. I forgot the whole concept of standard deviation), so it's as if I did your experiment but used a set.seed() which just happened to work (as you said, got lucky). Thank you! — StatsNTats, Aug 28 '18 at 17:11

Why is the sample size calculation not in agreement with t-test results?

1 Answers1