Testing for significance after every subject

Question

I came across a study (in an internal presentation, I don't think it was published externally) where the researchers were running participants through one of two conditions, with a single outcome variable.

Every single time someone finished the experiment, they ran a significance test, and when the p-value reached 0.05 they stopped collecting data. (Note that there was no risk of harm - the study was done using Mechanical Turk, so the researchers were interested in minimizing the costs).

This seems like an interim analysis on steroids. There are established methods of handling interim analysis, but I've never seen anything like this before (or since).

One argument that could be made is that they were interested in the parameter estimate, not the significance value, and if there were no effect, then it is likely that the final estimate from this analysis would be very small. (I guess I could write a simulation to see).

How should one go about interpreting a significant result from such a study? Is it possible to correct that p-value?

Related: http://stats.stackexchange.com/questions/3967/sequential-hypothesis-testing-in-basic-science — Andrew M, Nov 14 '15 at 03:04

Glen_b · Accepted Answer · 2015-11-17T03:49:53.110

If they want to test after every subject, but they're performing the test as if it was done only once, the overall type I error rate is huge (and the p-values don't imply anything; it's like retossing a pair of dice in Monopoly but ignoring the result until you get the one that lands you on the property you need... when you get it, it's no longer an impressive feat).

They should probably be using proper sequential hypothesis testing (sequential analysis), which accounts for this effect.

How should one go about interpreting a significant result from such a study?

I'd tend to interpret it as noise.

Is it possible to correct that p-value?

If the test is two-tailed and there's nothing that stops you collecting more data other than hitting p=0.05, the effective p-value may be quite high. Some initial simulations under some plausible guesses at the situation suggest the p-value probably well exceeds 0.7.

The number of trials until the stopping rule was hit will tell you something, but I'd probably just simulate to get an idea of how often you'd stop at least that early when there was no effect (arguably a kind of p-value). I doubt I'd bother with carefully trying to work out the p-value, though -- because if they were prepared to do that, what else did they do wrong?

Jeremy Miles · Answer 2 · 2015-11-17T03:43:21.463

I wrote a simulation, to test this.

With a small effect size (difference of 0.1, population SD of 1 in both groups), a maximum sample size of 5000, and 1000 simulations it gets a significant result 99% of the time. In 93% of cases, the result is in the correct direction. Interestingly, the effect is massively overestimated - the average absolute difference is 0.3., as can be seen in the histogram below.

d <- 0.1     #Population difference
n <- 5000    #Maximum number of subjects to run


set.seed(123456)

nSims <- 1000
ds <- rep(NA, nSims)    #Store the p-values
ps <- rep(NA, nSims)    #Store the differences


for(loop in 1:nSims){
  df <- data.frame(x=sample(c(0, 1), n, replace=TRUE)  , y=rnorm(n))
  df$y <- df$y + df$x * d

  stop <- FALSE
  count <- 20   #Going to count the number of rows through the data frame
  while(stop == FALSE){
    tRes <- t.test(df[1:count,]$y ~ df[1:count,]$x)   #Do t-test
    if(tRes  $p.value < 0.05) {
      ds[loop] <- tRes$estimate[2] - tRes$estimate[1]
      ps[loop] <- tRes$p.value
      stop <- TRUE
    }
    count = count + 1
    if(count==n) stop <- TRUE
    }
  #cat(loop)
  print(tRes)  
  print(loop)

}
ps[!is.na(ps)]
ds[!is.na(ds)]

I reran it, with an effect size of 0. This took about 16 hours to run. Only slightly more than half of the simulations got a significant value, the histogram of effect sizes looked like this:

And the mean absolute difference was 0.40.

In short, everything @Glen_b said.

+1 That's a nice simulation (my simulations under the null used `replicate` and ran faster, but what I did was also slightly different). However, I couldn't find what @gung had to say. Can you point me to where he said it? Did he delete a comment or something? — Glen_b, Nov 17 '15 at 03:55
@Glen_b - yes, replicate would have been faster. Apologies, I wrote gung when I meant glen_b. — Jeremy Miles, Nov 17 '15 at 09:45

Testing for significance after every subject

2 Answers2

Linked