10

I'm involved in running experiments, where we want to obtain sufficient sample size to obtain a certain width of CI (or equivalently a certain power).

We currently run a pilot, of a few hundred units, calculate the variance (we ignore the size of the effect) and then estimate the sample that would be required to obtain a CI width that we desire.

The sample is an estimate, so sometimes (~half) the CI ends up being smaller than we expect, and sometimes larger. When it's larger, the customer is unhappy.

One approach that's been suggested is to keep sampling until the CI width is sufficiently small. This feels uncomfortably close to p-hacking to me, but we're not calculating p-values, and we're (still) not looking at the size of the effect.

Is this legitimate?

Jeremy Miles
  • 13,917
  • 6
  • 30
  • 64
  • 4
    If you do it straight as you describe it, it's indeed invalid. The keyword for what you're looking for is "Sequential Analysis" https://en.wikipedia.org/wiki/Sequential_analysis although I'm not an expert on that. – Christian Hennig Aug 23 '21 at 20:35
  • Keep in mind that hypothesis tests and confidence intervals are inverses of each other. – Dave Aug 23 '21 at 20:41
  • @Lewian - I don't think this is the same as sequential analysis, or has the same issues, because I'm not testing for significance or looking at where the upper or lower CI land. All I care about is the size of the CI. If it's significant early, I keep going. If it's not significant later, I stop. – Jeremy Miles Aug 23 '21 at 20:46
  • @Dave - they are the inverse if I look at whether the CIs include zero. I don't. – Jeremy Miles Aug 23 '21 at 20:47
  • This [recent Q & A](https://stats.stackexchange.com/questions/540671/continuous-sample-size-determination-based-on-the-control-group/540705#540705) discusses why a sequential approach such as as yours is inappropriate if you use standard tests (or formulas for CIs). [Strictly speaking I guess it can't be "P-hacking" unless you're using P-values.] // In your work a sequential approach might be useful. However, if you do use a sequential approach, then you need to use methods that take your particular sequential scheme into account. (Hence @Lewian's suggested link to sequential methods.) – BruceET Aug 23 '21 at 21:11
  • @BruceET - that discussion seems to be mixing up discussion of p-values, effect sizes and CIs/SEs. All I'm looking at is SE, so I'm not sure I need a sequential approach. (Which I would if looking at p). – Jeremy Miles Aug 23 '21 at 21:22
  • 1
    Sometimes a bad idea is so tempting that excuses abound and objectivity is lost. – BruceET Aug 23 '21 at 21:26
  • Indeed - some might say that about p-values themselves. :) I don't think I've lost objectivity - I want to see if there's a problem with this approach, 'cos if there is I wont' do it (and say sorry to customer). – Jeremy Miles Aug 23 '21 at 22:15
  • Why not take account of the error (i.e. the fact that its a random variable not a known population quantity) in the pilot estimate of variance when forming the coverage calculation? If you have a model, forming a pivot and calculating coverage is at least amenable to simulation at the pilot stage, or you could take a bootstrapping approach in large samples. – Glen_b Aug 24 '21 at 06:41
  • That's a good idea. Thanks. – Jeremy Miles Aug 24 '21 at 16:53

2 Answers2

5

As a partial answer to my question, I ran a simulation to see if it led to inappropriate rejection of $H_0$.

library(dplyr)
set.seed(1234)
start_n <- 20
increment_n <- 20

target_se <- 0.05

vec_p <- numeric()
vec_se <- numeric()
vec_n <- numeric()
vec_mean <- numeric()

# H0 true
for (i in 1:1000) {
  y <- rnorm(start_n)
  keep_running <- TRUE
  while(keep_running == TRUE) {
    se <- sd(y) / sqrt(length(y))
    p <- t.test(y)$p.value
    keep_running <- se > target_se 
    y <- c(y, rnorm(increment_n))
  }
  vec_se <- c(vec_se, se)
  vec_p <- c(vec_p, p)
  vec_n <- c(vec_n, length(y))
  vec_mean <- c(vec_mean, mean(y))
}

mean(vec_p < 0.05)
table(vec_n)

Which gives:

Type I error rate:  0.045
vec_n
320 340 360 380 400 420 440 460 480 500 520 
  1   2  17  56 166 242 289 161  55   9   2 

(vec_n is the sample size that was reached before the experiment stopped.)

The type I error rate tends to be a touch lower than 0.05, which is explained by @Michael Lew's answer.

Jeremy Miles
  • 13,917
  • 6
  • 30
  • 64
4

Sampling until a nominated confidence interval width is obtained is technically similar to sequential testing and might be thought by some to be similar to p-hacking, but that does not mean that you not do it!

If your concern is accurate estimation of the population variance then a 'stop when CI is less than' strategy is going to give you low estimates more often than not, because the sampling is more likely to stop after an observation that lowers the sample standard deviation than after an observation that increases it. However, that bias may be quite small and thus might well be of no practical concern. It will depend on the sample size and thus the nominated CI width. It will be less with a large sample because the large sample will have a relatively stable CI estimate prior to stopping whereas a small sample CI will fluctuate much more with each new observation.

If your concern is to accurately estimate the population mean then I don't think there is any issue because your stopping rule is not dependent on the sample mean.

P-hacking is not an all or none phenomenon and procedures that might sometimes be illegitimate may in other circumstances be good practice! It depends on inferential objectives as well as experimental design considerations. See section 3 here: https://link.springer.com/chapter/10.1007/164_2019_286

Michael Lew
  • 10,995
  • 2
  • 29
  • 47
  • 1
    Ah, thanks. This makes sense. I ran simulations which gave a type I error rate a touch below 0.05, which I couldn't understand. Now I realize it's because I'm underestimating the variance (but I'm not very concerned about that.) – Jeremy Miles Aug 23 '21 at 21:05
  • IMHO the advice that it's OK to go ahead with a sequential scheme without appropriate adjustments is not appropriate or helpful. – BruceET Aug 23 '21 at 21:18
  • 1
    @BruceET - my simulation seems to show that the type I error rate is not inflated (and is deflated by a small amount). – Jeremy Miles Aug 23 '21 at 21:26
  • 1
    I think that @BruceET should read the linked document and then explain his concern as something more than an unsupported opinion. – Michael Lew Aug 23 '21 at 23:25