Stop collecting data when our confidence interval is sufficiently narrow?

Question

I am running an experiment that is unpleasant and time-consuming, so I want to minimize the number of times I have to do it. I have done a sample size calculation to determine that I need $100$ repetitions of the experiment.

A colleague noticed how many times I have to repeat this experiment and mentioned that, if I get a sufficiently narrow confidence interval, I can stop before I reach the full $100$. This sounds wonderful experimentally and awful statistically, and when I run a simulation to see what happens if I quit when $p\le 0.05$, my t-test performance is awful.

That approach is p-hacking.

The gist of my code is that I simulate two distributions and save the p-value. Then I take the first $10$, $20$, $30$,$\dots$ of the full $100$ observations (per group) and test those two subsets against one another. If $p\le 0.05$, then I save that p-value and move on to the next iteration.

set.seed(2021)
hacked <- unhacked <- rep(NA, 1000)
for (i in 1:1000){
    x0 <- rnorm(100, 0, 1)
    y0 <- rnorm(100, 0, 1)

    unhacked[i] <- hacked[i] <- t.test(x0, y0)$p.value

    for (size in seq(10, 100, 10)){

        x1 <- x0[1:size]
        y1 <- y0[1:size]
        p <- t.test(x1, y1)$p.value
        
        if (p <= 0.05){
            hacked[i] <- p
            break
        }
    }
}
plot(unhacked, ecdf(unhacked)(unhacked))
points(hacked, ecdf(hacked)(hacked), col='red')
abline(0, 1)
ecdf(unhacked)(c(0.01, 0.05, 0.10))
ecdf(hacked)(c(0.01, 0.05, 0.10))

For the hacked test, I get a false-positive rate of $20\%$, four times what it is supposed to be. The unhacked test does what it is supposed to do.

However, that simulation is about the p-value, not the confidence interval width. When I have tried to mimic the simulation but use confidence interval width, I do not get the same effect of awful performance. If I quit when I reach the confidence interval width that I should get with the full $100$ observations, I do get slightly different results, but the differences appear to be minimal.

Is my colleague's suggestion a valid statistical procedure?
If not, what would I want to simulate to show the flaws of such a procedure?

I wonder if some of this has to do with the fact that, in the simulation, I know what the standard deviation is, so I can give the confidence interval width in standard deviation units, rather than "3 nanometers" or "9 femtoseconds" like I would have to with experimental data where I do not know the population parameters

That should be a standard thing in clinical trials, where double-blind testing new medical procedures can also be *unpleasant* (not to speak of potential higher mortality in the treatment-as-usual group, or side effects in the experimental group). I would assume something like this has been treated by medical statisticians. — Stephan Kolassa, May 11 '21 at 12:26
How do you determine (in practice, not *in silico*) "the confidence interval width that I should get with the full 100 observations"? After all, you know neither the mean nor the variance of the underlying distribution. — whuber, May 11 '21 at 13:05
@whuber That is a valid point, but when I have modified my simulations to involve a gross misestimate of the variance, I still wind up with the correct type I error rate. The power does take a hit, though. // Rather than specify beforehand how wide of a confidence interval we want, I could imagine our approach to be seeing a narrow confidence interval and deciding, "Yep, we have the right sample size now." That strikes me as maybe being *worse*, — Dave, May 11 '21 at 23:23
Because the value of $\sigma$ used to find $n$ is often only a rough guess, it might be reasonable to pause _once_ (say at $n/2)$ to use data accumulated so far for a better estimate of $\sigma$ and a corresponding new projected $n,$ which might be smaller or larger than originally planned. — BruceET, May 12 '21 at 02:04
You might want to look into the concept of [Pocock boundary](https://en.wikipedia.org/wiki/Pocock_boundary) which deals exactly with the implications of stopping a trial prematurely. Please note that your colleague is not inherently wrong, in many cases (for example a drug having adverse effects) we stop certain trials prematurely if continuing the trial causes unduly harm/risk to the participant. Being "super-hand-wavy": if after collecting 50 samples our 99.9th CIs are as good as our expected 95th CIs then we can stop the trial but still have an overall p-value for the trial as 0.05. — usεr11852, May 16 '21 at 02:40
(The 99.9 comes from the [Haybittle–Peto boundary](https://en.wikipedia.org/wiki/Haybittle%E2%80%93Peto_boundary)) — usεr11852, May 16 '21 at 02:41

score 2 · Answer 1 · answered Dec 29 '21 at 16:11

As pointed out in the comments, there are schemes for sequential testing already popular in clinical trials. A nice intro is here. There are also classic results on optimal sequential tests: look for the sequential probability ratio test (SPRT). The sense in which this is optimal is described here: it takes the fewest samples, in expectation, of any test of that hypothesis.

But you didn't ask about either of those things, exactly. Your colleague's termination criterion is for the width of a confidence interval. Unlike a probability ratio or a p-value, this doesn't change when the estimate of the mean wanders far away from 0, either rightly or by chance alone. In fact, the width of a conventional confidence interval is independent of the sample mean when the samples are Gaussian (proof), and they are uncorrelated for any iid set of samples (even if not Gaussian). This bodes well for your colleague's rule.

For a more detailed analysis, it turns out the procedure is indeed anticonservative. Suppose the observations are iid Gaussian with mean 0 and variance $\sigma^2$. Suppose the rule is to stop when the standard error falls below $T$, then compute a t-test p-value against a mean of 0 as if the sample size were pre-specified. Consider an instance where this procedure terminates at exactly n samples. Conditional on exactly n samples, the sample mean is Gaussian with mean 0 and variance $\sigma^2/n$. The standard error is independent of that and it is distributed as $\frac{\sigma^2\chi^2_{n-1}}{n*(n-1)}$, but truncated at $T$. The magnitude of the t-statistic is thus stochastically larger than it would be if you terminated at $n$ samples regardless of the standard error estimate. This should yield anticonservative behavior. Since this is simultaneously true for all $n$, it's true of the whole procedure: no matter what $n$ is, the p-value will be slightly inflated.

To find out how badly the error rate exceeds the nominal one, your current approach is exactly right -- you can try this out on simulated data. If it's really slow, maybe it could also be done more efficiently by directly sampling n and the mean and the standard error, but absent that problem, I can't think of a good reason to get fancier.

Stop collecting data when our confidence interval is sufficiently narrow?

1 Answers1