0

Here's my scenario. I take 1000 samples each from my control and treatment. I note that the difference in means is not statistically significant (stat-sig) under the null hypothesis testing. I then add 1000 more samples each to control and treatment, and test the entire batch of 2000 samples for stat-sig (I am using paired t-test). I keep doing this for exactly 5 times (till I get 5000 samples each for control and treatment), and I stop earlier if my p-value is stat sig. Is this p-hacking? If so, is there anything like the Benjamini Hochberg correction that I should apply every time I calculate the p-value for an augmented batch (the original batch + 1000 new samples)

More details on why I don't simply start with 5000 samples or use power analysis to determine my sample size

  1. After sampling, I need to have a manual evaluation of each of the sample (I can't get into details of what it is, but this is something that cannot be automated). This is expensive. So if I can get away with sampling a smaller batch size, then I save money and time. This is my main motivation to see if there is a stat-sig difference between control and treatment for a smaller batch size, before I add more samples.
  2. I don't know the effect size and cannot estimate this. So trying to estimate a sample size seems to be a no go.
Data Max
  • 31
  • 3
  • 1
    This sounds like a *group sequential* experimental design. It's [the worst kind of p-hacking](https://stats.stackexchange.com/questions/20676) if you use p-values for fixed sample sizes, but if you use p-values computed properly for the design, it's perfectly legitimate. – whuber Aug 17 '20 at 19:29
  • Thanks @whuber. Since the sample size continuously grows (in steps of 1000), does that mean that this is legitimate? – Data Max Aug 17 '20 at 22:07
  • 1
    That's why this is called "group" sequential rather than just "sequential" (which concerns adding one observation at a time). – whuber Aug 18 '20 at 13:24

1 Answers1

1

I found out that the scenario I described in my question does constitute p-hacking. In order to do it properly, we'll have to use a method similar to what's described in this paper: https://projecteuclid.org/download/pdf_1/euclid.ss/1177012099. Essentially, we'll have to allocate "a budgeted amount" of the significance level, and distribute it to the various groups. For e.g., for a significance level of 0.05, we start of with allocating 0.001 for the first batch (which means that p-value should be < 0.001 for us to reject the null), allocate 0.006 to the next combined batch, and so on. It won't all add up to 0.05, and the way to allocate is discussed in various papers on this subject.

Data Max
  • 31
  • 3