I am running an experiment that is unpleasant and time-consuming, so I want to minimize the number of times I have to do it. I have done a sample size calculation to determine that I need $100$ repetitions of the experiment.
A colleague noticed how many times I have to repeat this experiment and mentioned that, if I get a sufficiently narrow confidence interval, I can stop before I reach the full $100$. This sounds wonderful experimentally and awful statistically, and when I run a simulation to see what happens if I quit when $p\le 0.05$, my t-test performance is awful.
That approach is p-hacking.
The gist of my code is that I simulate two distributions and save the p-value. Then I take the first $10$, $20$, $30$,$\dots$ of the full $100$ observations (per group) and test those two subsets against one another. If $p\le 0.05$, then I save that p-value and move on to the next iteration.
set.seed(2021)
hacked <- unhacked <- rep(NA, 1000)
for (i in 1:1000){
x0 <- rnorm(100, 0, 1)
y0 <- rnorm(100, 0, 1)
unhacked[i] <- hacked[i] <- t.test(x0, y0)$p.value
for (size in seq(10, 100, 10)){
x1 <- x0[1:size]
y1 <- y0[1:size]
p <- t.test(x1, y1)$p.value
if (p <= 0.05){
hacked[i] <- p
break
}
}
}
plot(unhacked, ecdf(unhacked)(unhacked))
points(hacked, ecdf(hacked)(hacked), col='red')
abline(0, 1)
ecdf(unhacked)(c(0.01, 0.05, 0.10))
ecdf(hacked)(c(0.01, 0.05, 0.10))
For the hacked test, I get a false-positive rate of $20\%$, four times what it is supposed to be. The unhacked test does what it is supposed to do.
However, that simulation is about the p-value, not the confidence interval width. When I have tried to mimic the simulation but use confidence interval width, I do not get the same effect of awful performance. If I quit when I reach the confidence interval width that I should get with the full $100$ observations, I do get slightly different results, but the differences appear to be minimal.
Is my colleague's suggestion a valid statistical procedure?
If not, what would I want to simulate to show the flaws of such a procedure?
I wonder if some of this has to do with the fact that, in the simulation, I know what the standard deviation is, so I can give the confidence interval width in standard deviation units, rather than "3 nanometers" or "9 femtoseconds" like I would have to with experimental data where I do not know the population parameters