63

I was wondering exactly why collecting data until a significant result (e.g., $p \lt .05$) is obtained (i.e., p-hacking) increases the Type I error rate?

I would also highly appreciate an R demonstration of this phenomenon.

Reza
  • 876
  • 7
  • 10
  • 7
    You probably mean "p-hacking," because "harking" refers to "Hypothesizing After Results are Known" and, although that could be considered a related sin, it's not what you seem to be asking about. – whuber Oct 26 '17 at 17:35
  • 2
    Once again, xkcd answers a good question with pictures. https://xkcd.com/882/ – Jason Oct 26 '17 at 21:02
  • 9
    @Jason I have to disagree with your link; that doesn't talk about cumulative collection of data. The fact that even cumulative collection of data about the same thing and *using all data you have* to compute the $p$-value is wrong is much more nontrivial than the case in that xkcd. – JiK Oct 27 '17 at 09:35
  • 1
    @JiK, fair call. I was focused on the "keep trying until we get a result we like" aspect, but you're absolutely correct, there's much more to it in the question at hand. – Jason Oct 27 '17 at 11:04
  • @whuber and user163778 gave very similar replies as discussed for the practically identical case of "A/B (sequential) testing" in this thread: https://stats.stackexchange.com/questions/244646/why-is-it-wrong-to-stop-an-a-b-test-before-optimal-sample-size-is-reached/244823#244823 There, we argued in terms of Family Wise Error rates and necessity for p-value adjustment in repeated testing. This question in fact can be looked at as a repeated testing problem! – tomka Oct 30 '17 at 20:01
  • I do not think my reply would contribute much new, which is why I linked the other thread. If you read my answer there, you will find the problem you posed is equivalent to collecting data and testing after every, say, n observations whether the hypothesis is rejected. This makes it a repeated testing problem, where tests are correlated. One way of dealing with this is adjusting the pvalue to control the family wise error rate at a nominal level as I described there. – tomka Oct 30 '17 at 23:00
  • I asked a related question at https://stats.stackexchange.com/questions/549511/why-is-chi-squared-for-paired-coin-tosses-so-sensitive-to-slight-deviations-from?noredirect=1#comment1009758_549511. In my question I share a google sheet that shows p value from chi square analysis of paired coin flips, and graphs this. It isn't R, but perhaps may be helpful. – lamplamp Oct 24 '21 at 16:06

5 Answers5

95

The problem is that you're giving yourself too many chances to pass the test. It's just a fancy version of this dialog:

I'll flip you to see who pays for dinner.

OK, I call heads.

Rats, you won. Best two out of three?


To understand this better, consider a simplified--but realistic--model of this sequential procedure. Suppose you will start with a "trial run" of a certain number of observations, but are willing to continue experimenting longer in order to get a p-value less than $0.05$. The null hypothesis is that each observation $X_i$ comes (independently) from a standard Normal distribution. The alternative is that the $X_i$ come independently from a unit-variance normal distribution with a nonzero mean. The test statistic will be the mean of all $n$ observations, $\bar X$, divided by their standard error, $1/\sqrt{n}$. For a two-sided test, the critical values are the $0.025$ and $0.975$ percentage points of the standard Normal distribution, $ Z_\alpha=\pm 1.96$ approximately.

This is a good test--for a single experiment with a fixed sample size $n$. It has exactly a $5\%$ chance of rejecting the null hypothesis, no matter what $n$ might be.

Let's algebraically convert this to an equivalent test based on the sum of all $n$ values, $$S_n=X_1+X_2+\cdots+X_n = n\bar X.$$

Thus, the data are "significant" when

$$\left| Z_\alpha\right| \le \left| \frac{\bar X}{1/\sqrt{n}} \right| = \left| \frac{S_n}{n/\sqrt{n}} \right| = \left| S_n \right| / \sqrt{n};$$

that is,

$$\left| Z_\alpha\right| \sqrt{n} \le \left| S_n \right| .\tag{1}$$

If we're smart, we'll cut our losses and give up once $n$ grows very large and the data still haven't entered the critical region.

This describes a random walk $S_n$. The formula $(1)$ amounts to erecting a curved parabolic "fence," or barrier, around the plot of the random walk $(n, S_n)$: the result is "significant" if any point of the random walk hits the fence.

It is a property of random walks that if we wait long enough, it's very likely that at some point the result will look significant.

Here are 20 independent simulations out to a limit of $n=5000$ samples. They all begin testing at $n=30$ samples, at which point we check whether the each point lies outside the barriers that have been drawn according to formula $(1)$. From the point at which the statistical test is first "significant," the simulated data are colored red.

Figure

You can see what's going on: the random walk whips up and down more and more as $n$ increases. The barriers are spreading apart at about the same rate--but not fast enough always to avoid the random walk.

In 20% of these simulations, a "significant" difference was found--usually quite early on--even though in every one of them the null hypothesis is absolutely correct! Running more simulations of this type indicates that the true test size is close to $25\%$ rather than the intended value of $\alpha=5\%$: that is, your willingness to keep looking for "significance" up to a sample size of $5000$ gives you a $25\%$ chance of rejecting the null even when the null is true.

Notice that in all four "significant" cases, as testing continued, the data stopped looking significant at some points. In real life, an experimenter who stops early is losing the chance to observe such "reversions." This selectiveness through optional stopping biases the results.

In honest-to-goodness sequential tests, the barriers are lines. They spread faster than the curved barriers shown here.

library(data.table)
library(ggplot2)

alpha <- 0.05   # Test size
n.sim <- 20     # Number of simulated experiments
n.buffer <- 5e3 # Maximum experiment length
i.min <- 30     # Initial number of observations
#
# Generate data.
#
set.seed(17)
X <- data.table(
  n = rep(0:n.buffer, n.sim),
  Iteration = rep(1:n.sim, each=n.buffer+1),
  X = rnorm((1+n.buffer)*n.sim)
)
#
# Perform the testing.
#
Z.alpha <- -qnorm(alpha/2)
X[, Z := Z.alpha * sqrt(n)]
X[, S := c(0, cumsum(X))[-(n.buffer+1)], by=Iteration]
X[, Trigger := abs(S) >= Z & n >= i.min]
X[, Significant := cumsum(Trigger) > 0, by=Iteration]
#
# Plot the results.
#
ggplot(X, aes(n, S, group=Iteration)) +
  geom_path(aes(n,Z)) + geom_path(aes(n,-Z)) +
  geom_point(aes(color=!Significant), size=1/2) +
  facet_wrap(~ Iteration)
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 12
    +1. Does any given random walk eventually cross the barriers with probability 1? I know that the expected distance after $n$ steps is $\mathcal O(\sqrt{n})$ and I looked up now that the proportionality constant is $\sqrt{2/\pi}$, which is less than 1.96. But I am not sure what to make of it. – amoeba Oct 26 '17 at 20:21
  • 10
    @amoeba That is a great question, which I tried my best to dodge :-). If I could have computed the answer quickly (or knew it offhand) I would have posted it. Unfortunately I'm too busy to address it analytically right now. The longest simulation I have done was 1,000 iterations looking out as far as $n=5,000,000$ with $\alpha=0.05$. The proportion of "significant" results seems to stabilize near $1/4$. – whuber Oct 26 '17 at 21:06
  • Great post! Might one possibly change the plot from `!Significant->FALSE` to `Significant->TRUE` to ease the mental load of reading? – Hatshepsut Oct 27 '17 at 00:46
  • 4
    The question about the probability of hitting the $\alpha=0.05$ boundary is interesting. I imagine that Einsteins theory of Brownian motion, relating it to a diffusion equation, could be an interesting angle. We have a distribution function spreading out with a rate $\sim \sqrt{n}$ and a "loss of particles" equal to half the value of the distribution function at this boundary (half moves away from zero, over the border, the other half goes back). As this distribution function spreads out, and gets thinner, the "loss" becomes less. I imagine this effectively will create a limit, ie this 1/4. – Sextus Empiricus Oct 27 '17 at 01:41
  • 2
    @amoeba : ​ https://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm ​ ​ ​ ​ –  Oct 27 '17 at 10:45
  • 6
    Intuitive reason why you'll get a $p<0.05$ at some point almost surely: Let $n_1=10$ and $n_{k+1}=10^{n_k}$. The $p$-value after the first $n_{k+1}$ trials is pretty much independent of the $p$-value after the first $n_k$ trials. So you'll have an infinite number of "almost" independent $p$-values, so one of them is guaranteed to be $<0.05$. Of course, the real convergence is much faster than this argument says. (And if you don't like $10^{n_k}$, you may try $A(n_k)$ or $\mathrm{BB}(n_k)$...) – JiK Oct 27 '17 at 11:34
  • (cont'd) And I believe you can formalize this argument quite easily: Just let $n_{k+1}=f(n_k)$ so that $f(n)$ is the number so that getting a $p<0.05$ over $f(n)$ trials has probability at least $0.01$ even if $n$ first trials are opposite to what you want to achieve; such a number should always exist. – JiK Oct 27 '17 at 11:47
  • 2
    `In 20% of these simulations, a "significant" difference was found` – CL. Oct 27 '17 at 12:40
  • 1
    @Hatshepsut You caught me--I used a quick and dirty method of making the "significant" points red. I have replaced the legend in the figure. – whuber Oct 27 '17 at 14:02
  • @JiK Yes, I had developed a similar argument overnight. I'm unconvinced that the convergence is as fast as you suggest, though. – whuber Oct 27 '17 at 14:03
  • 11
    @CL. I [anticipated your objection several years ago:](https://stats.stackexchange.com/a/157646/919) 17 is my public seed. In fact, in early (much longer) trials I was consistently getting *greater* rates of significance substantially larger than 20%. I set the seed at 17 to create the final image and was disappointed that the effect was not so dramatic. *C'est la vie.* A related post (illustrating your point) is at https://stats.stackexchange.com/a/38067/919. – whuber Oct 27 '17 at 14:05
  • @whuber +1 for a nice demonstration of the issue, but I am going to point out the elephant in the room: the problem only arises if you intend to decide the "significance" of your result by choosing an arbitrary p-value. In that case, you are right, optional stopping will result in a probability of falsely rejecting the null substantially greater than the intended 5%. But it is well known that a Bayesian analysis is not affected by optional stopping. See https://alexanderetz.com/2016/03/06/sunday-bayes-optional-stopping-is-no-problem-for-bayesians/ Moral: give up p-values and all will be well! – gareth Oct 27 '17 at 15:28
  • 2
    @gareth The essence of why Bayesian analysis is "not affected" by optional stopping is that Bayesians don't care about error rates. Clearly if one is doing whuber's simulation, checking Bayes Factor instead of p-value on each iteration, and deciding that there is enough evidence once BF>10, then this procedure will also result in always, sooner or later, finding evidence under the null. To that Bayesians say: whatever, we don't care about it. That's fine, but that's not what I would call "not affected". – amoeba Oct 27 '17 at 15:44
  • 1
    The test will be significant infinitely often: Let $T_{n}$ be a sequence of variables so that $T_{n}\to T$ in distribution, and let $A$ satisfy $P\left(T\in A\right)>0$. Then $P\left(T_{n}\in A\textrm{ i.o.}\right)=1$. For by Fatou's lemma, $P\left(\lim\sup T_{n}\in A\right)\geq\lim\sup P\left(T_{n}\in A\right)>0$, but Hewitt-Savage's $0-1$ law tells us that the probability is either $0$ or $1$, so $1$ it is. Now take $T_{n}=n^{-\frac{1}{2}}S_{n}$ and let $T$ be $N(0,1)$-distributed, and let $A$ be the (two-sided) tails corresponding to any $\alpha$-value of interest. – Grassie Oct 27 '17 at 17:11
  • 1
    @amoeba I think I have a pretty good answer to the question in your comment. See my answer below. – Jarle Tufto Oct 27 '17 at 18:32
  • It's true that this process almost surely eventually hits the boundary. But practically speaking note that if it takes a long time, the final sample mean will be very close to zero so the "effect" which you claim to have found will be almost indistinguishable from the null hypothesis. E.g. in the case of the standard normal distribution, if it takes you one trillion experiments to hit p=0.05 then your final sample mean will be about 0.0000016. Such a small purported effect size shouldn't be impressive even to somebody who didn't know that you cheated to get it. – fblundun Mar 17 '21 at 00:04
  • @fblundun That's interesting, but isn't the relevant consideration the risk that the process will *not* take a very long time and terminate quickly with an erroneous determination? Judging from the examples, that risk can be substantially greater than the nominal limit $\alpha.$ – whuber Mar 17 '21 at 13:29
  • 1
    @whuber yes I agree. Just thought it was noteworthy that since the purported effect size decreases as the sample size increases, it isn't possible to use this sampling method to almost surely generate arbitrarily convincing evidence of a given effect size. – fblundun Mar 18 '21 at 10:15
  • 1
    @fblundun That's an interesting and useful observation. To produce such "evidence" almost surely, one would have to discard earlier results and keep starting over. (That's not a bad model for *J. Parapsychology* articles.) – whuber Mar 18 '21 at 13:27
18

People who are new to hypothesis testing tend to think that once a p value goes below .05, adding more participants will only decrease the p value further. But this isn't true. Under the null hypothesis, a p value is uniformly distributed between 0 and 1 and can bounce around quite a bit in that range.

I've simulated some data in R (my R skills are quite basic). In this simulation, I collect 5 data points - each with a random selected group membership (0 or 1) and each with a randomly selected outcome measure ~N(0,1). Starting on participant 6, I conduct a t-test at every iteration.

for (i in 6:150) {
  df[i,1] = round(runif(1))
  df[i,2] = rnorm(1)
  p = t.test(df[ , 2] ~ df[ , 1], data = df)$p.value
  df[i,3] = p
}

The p values are in this figure. Notice that I find significant results when the sample size is around 70-75. If I stop there, I'll end up beleiving that my findings are significant because I'll have missed the fact that my p values jumped back up with a larger sample (this actually happened to me once with real data). Since I know both populations have a mean of 0, this must be a false positive. This is the problem with adding data until p < .05. If you add conduct enough tests, p will eventually cross the .05 threshold and you can find a significant effect is any data set.

enter image description here

TPM
  • 574
  • 2
  • 15
  • 1
    Thanks but your `R` code doesn't run at all. – Reza Oct 26 '17 at 22:49
  • 3
    @Reza you'd need to create `df` first (preferably at its final size). Since the code starts writing at row 6 the implication (which fits with the text of the answer) is that df already exists with 5 rows already filled in. Maybe something like this was intended: `n150 – Glen_b Oct 27 '17 at 04:45
  • @user263778 very focused useful and pertinent answer .But there is too much confusion about interpreting the p-value so called - dancing beauty. –  Oct 27 '17 at 08:31
  • @user163778 - you should include the code to initialize everything too – Dason Oct 27 '17 at 19:41
17

This answer only concerns the probability of ultimately getting a "significant" result and the distribution of the time to this event under @whuber's model.

As in the model of @whuber, let $S(t)=X_1 + X_2 + \dots + X_t$ denote the value of the test statistic after $t$ observations have been collected and assume that the observations $X_1,X_2,\dots$ are iid standard normal. Then $$ S(t+h)|S(t)=s_0 \sim N(s_0, h), \tag{1} $$ such that $S(t)$ behaves like a continuous-time standard Brownian motion, if we for the moment ignore the fact that we have a discrete-time process (left plot below).

Let $T$ denote the first passage time of $S(t)$ across the the time-dependent barriers $\pm z_{\alpha/2}\sqrt{t}$ (the number of observations needed before the test turns significant).

Consider the transformed process $Y(\tau)$ obtained by scaling $S(t)$ by its standard deviation at time $t$ and by letting the new time scale $\tau=\ln t$ such that $$ Y(\tau)=\frac{S(t(\tau))}{\sqrt{t(\tau)}}=e^{-\tau/2}S(e^\tau). \tag{2} $$ It follows from (1) and (2) that $Y(\tau+\delta)$ is normally distributed with \begin{align} E(Y(\tau+\delta)|Y(\tau)=y_0) &=E(e^{-(\tau+\delta)/2}S(e^{\tau+\delta})|S(e^\tau)=y_0e^{\tau/2}) \\&=y_0e^{-\delta/2} \tag{3} \end{align} and \begin{align} \operatorname{Var}(Y(\tau+\delta)|Y(\tau)=y_0) &=\operatorname{Var}(e^{(\tau+\delta)/2}S(e^{\tau+\delta})|S(e^\tau)=y_0e^{\tau/2}) \\&=1-e^{-\delta}, \tag{4} \end{align} that is, $Y(\tau)$ is a zero-mean Ornstein-Uhlenbeck (O-U) process with a stationary variance of 1 and return time 2 (right plot below). An almost identical transformation is given in Karlin & Taylor (1981), eq. 5.23.

enter image description here

For the transformed model, the barriers become time-independent constants equal to $\pm z_{\alpha/2}$. It is then known (Nobile et. al. 1985; Ricciardi & Sato, 1988) that the first passage-time $\mathcal{T}$ of the O-U process $Y(\tau)$ across these barriers is approximately exponentially distributed with some parameter $\lambda$ (depending on the barriers at $\pm z_{\alpha/2}$) (estimated to $\hat\lambda=0.125$ for $\alpha=0.05$ below). There is also an extra point mass in of size $\alpha$ in $\tau=0$. "Rejection" of $H_0$ eventually happens with probability 1. Hence, $T=e^\mathcal{T}$ (the number of observations that needs to be collected before getting a "significant" result) is approximately Pareto distributed with density $f_T(t)=f_\mathcal{T}(\ln t)\frac{d\tau}{dt}=\lambda/t^{\lambda+1}$. The expected value is $$ ET\approx 1+(1-\alpha)\int_0^\infty e^\tau \lambda e^{-\lambda \tau}d\tau.\tag{5} $$ Thus, $T$ has a finite expectation only if $\lambda>1$ (for sufficiently large levels of significance $\alpha$).

The above ignores the fact that $T$ for the real model is discrete and that the real process is discrete- rather than continuous-time. Hence, the above model overestimates the probability that the barrier has been crossed (and underestimates $ET$) because the continuous-time sample path may cross the barrier only temporarily in-between two adjacent discrete time points $t$ and $t+1$. But such events should have negligible probability for large $t$.

The following figure shows a Kaplan-Meier estimate of $P(T>t)$ on log-log scale together with the survival curve for the exponential continuous-time approximation (red line).

enter image description here

R code:

# Fig 1
par(mfrow=c(1,2),mar=c(4,4,.5,.5))
set.seed(16)
n <- 20
npoints <- n*100 + 1
t <- seq(1,n,len=npoints)
subset <- 1:n*100-99
deltat <- c(1,diff(t))
z <- qnorm(.975)
s <- cumsum(rnorm(npoints,sd=sqrt(deltat)))
plot(t,s,type="l",ylim=c(-1,1)*z*sqrt(n),ylab="S(t)",col="grey")
points(t[subset],s[subset],pch="+")
curve(sqrt(t)*z,xname="t",add=TRUE)
curve(-sqrt(t)*z,xname="t",add=TRUE)
tau <- log(t)
y <- s/sqrt(t)
plot(tau,y,type="l",ylim=c(-2.5,2.5),col="grey",xlab=expression(tau),ylab=expression(Y(tau)))
points(tau[subset],y[subset],pch="+")
abline(h=c(-z,z))

# Fig 2
nmax <- 1e+3
nsim <- 1e+5
alpha <- .05
t <- numeric(nsim)
n <- 1:nmax
for (i in 1:nsim) {
  s <- cumsum(rnorm(nmax))
  t[i] <- which(abs(s) > qnorm(1-alpha/2)*sqrt(n))[1]
}
delta <- ifelse(is.na(t),0,1)
t[delta==0] <- nmax + 1
library(survival)
par(mfrow=c(1,1),mar=c(4,4,.5,.5))
plot(survfit(Surv(t,delta)~1),log="xy",xlab="t",ylab="P(T>t)",conf.int=FALSE)
curve((1-alpha)*exp(-.125*(log(x))),add=TRUE,col="red",from=1,to=nmax)
Jarle Tufto
  • 7,989
  • 1
  • 20
  • 36
  • Thanks! Do you have any (standard) references for these results? For instance, why is the Y process an Ornstein-Uhlenbeck and where can we find the passage time result stated? – Grassie Oct 28 '17 at 08:03
  • 1
    I haven't seen this transformation anywhere else but I believe (3) and (4) that follows easily from (1) and (2) and normality completely characterises the O-U process. Google scholar returns a lot of results on approximate exponentiality of first-passage time distributions for the O-U process. But I believe that $\mathcal{T}$ in this case (within the continuous-time approximation) is exactly exponentially distributed (except for an extra point mass in $\tau=0$) because $Y(0)$ comes from the stationary distribution of the process. – Jarle Tufto Oct 28 '17 at 08:47
  • @Grassie Also see https://math.stackexchange.com/questions/1900304/brownian-motion-hitting-probability-for-square-root-boundaries – Jarle Tufto Oct 28 '17 at 09:16
  • @Grassie Actually, my argument based on memorylessness was flawed. The durations of excursions away from the boundaries are not exponentially distributed. Hence, based on the same argument as in https://stats.stackexchange.com/questions/298828/how-to-calculate-average-length-of-adherence-to-vegetarianism-when-we-only-have/298866#298866, even though $Y(0)$ comes from the stationary distribution, the first passage time is not exactly exponentially distributed. – Jarle Tufto Oct 28 '17 at 10:08
  • @Grassie I just came across the transformation of Brownian motion to O-U in Karlin & Taylor's second course in stochastic procesess. – Jarle Tufto Feb 12 '20 at 12:12
6

It needs to be said that the above discussion is for a frequentist world view for which multiplicity comes from the chances you give data to be more extreme, not from the chances you give an effect to exist. The root cause of the problem is that p-values and type I errors use backwards-time backwards-information flow conditioning, which makes it important "how you got here" and what could have happened instead. On the other hand, the Bayesian paradigm encodes skepticism about an effect on the parameter itself, not on the data. That makes each posterior probability be interpreted the same whether you computed another posterior probability of an effect 5 minutes ago or not. More details and a simple simulation may be found at http://www.fharrell.com/2017/10/continuous-learning-from-data-no.html

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    Let's imagine a lab led by Dr B, who is a devout Bayesian. The lab is studying social priming and has produced a steady stream of papers showing various effects of priming, each time supported by Bayes factor BF>10. If they never do sequential testing, it looks pretty convincing. But let's say I learn that they always do sequential testing and keep getting new subjects until they obtain BF>10 *in favour of priming effects*. Then clearly this whole body of work is worthless. The fact that they do sequential testing + selection makes a huge difference, no matter if it's based on p-values of BF. – amoeba Oct 29 '17 at 14:39
  • 1
    I don't use Bayes' factors. But had they used posterior probabilities and had run each experiment until the posterior probability of a positive effect $\geq 0.95$, there would be absolutely nothing wrong with these probabilities. Look at the quote at the start of my blog article - see the link in my answer above. The degree of belief about the priming effect comes from the data and the prior belief. If you (like me) are very dubious of such priming effects, you'd better use a fairly skeptical prior when computing the posterior probabilities. That's it. – Frank Harrell Oct 29 '17 at 20:20
  • 1
    I read your blog post, noticed the quote, and looked at a similar paper ([Optional stopping: No problem for Bayesians](http://rouder.psyc.missouri.edu/sites/default/files/Rouder-PBR-2014.pdf)) that somebody else linked to in the comments to another answer. I still don't get it. If the "null" (absent priming effects) is true, then if Dr B is willing to sample long enough, he will be able to get posterior probability >0.95 *every single time he runs the experiment* (exactly as Dr F would be able to get p<0.05 every single time). If this is "absolutely nothing wrong" then I don't know what is. – amoeba Oct 29 '17 at 20:42
  • @amoeba: Bayesian statisticians do not work by collecting statistically significant results rejecting the null hypothesis. Instead they produce a posterior distribution. In your suggestion that they continue until the point a (hidden sequential) frequentist might claim $p \lt 0.05$, this posterior distribution is likely to be highly concentrated (how many tests do you think are likely to be needed?) close to the value in question, and would draw the conclusion that the difference which the frequentalist claimed to be *significant* is likely to be very small. – Henry Oct 30 '17 at 08:05
  • 1
    @Henry That's fine. A frequentist, after obtaining p<0.05, will also see that the difference is very small, and will be able to draw the same conclusion. – amoeba Oct 30 '17 at 08:32
  • 1
    But they don't; they get enamored with "statistical significance". And as my simulation shows, if there is truly no difference, Bayesians will not very often end up finding posterior probability > 0.95 in a length of time that a budget could actually support. Bigger point: the posterior probabilities at *any moment* are well-calibrated. If the post prob is 0.98 at one moment you will find that the prob. that the parameter is > 0 is exactly 0.98, i.e, you would be wrong 0.02 of the time in stopping and claiming a positive effect. – Frank Harrell Oct 30 '17 at 15:24
  • 3
    Well, I dispute this "bigger point". I don't think this is true. As I keep repeating, under the null of zero effect and with any given prior (let's say some broad continuous prior centered at zero), repeated sampling will *always* sooner or later yield >0.98 posterior probability concentrated above zero. A person who is sampling until this happens (i.e. applying this stopping rule), will be wrong *every single time*. How can you say that this person will be wrong only 0.02 of the time? I don't understand. Under these particular circumstances, no he will not, he will *always* be wrong. – amoeba Oct 30 '17 at 21:49
  • You are stuck in the frequentist world and need to realize that even though the parameter may have a single value if only it were known, attaching a probability distribution to it is the best way to handle our uncertainty about that. Once you do that, and run simulations off that distribution, you'll quickly see the "wrong only 0.02 of the time" point. Computing probabilities that are forward in time/information flow changes everything. – Frank Harrell Oct 31 '17 at 11:36
  • 3
    I don't think I am. My bigger point is that it is unfair & inconsistent to simultaneously blame frequentist procedures for suffering from sequential testing and defend Bayesian procedures as unaffected by sequential testing. My point (which is a mathematical fact) is that they both are affected in exactly the same way, meaning that sequential testing can increase Bayesian type I error all the way to 100%. Of course if you say that you are, as a matter of principle, not interested in type I error rates, then it's irrelevant. But then frequentist procedures should not be blamed for that either. – amoeba Oct 31 '17 at 21:31
  • That means you can identify the error in my simulation code. Kindly do so. Bayesian posterior probabilities are free of the problem you accuse them of. Bayesians encode skepticism in the parameter space, frequentists in the data space. – Frank Harrell Oct 31 '17 at 22:28
  • Frank, please consider this: it is well-known that one-sided p-value can be interpreted as Bayesian posterior probability of negative effect under a broad prior. [Marsman and Wagenmakers 2017](http://www.ejwagenmakers.com/2017/MarsmanWagenmakers2017ThreeInsights.pdf) is one recent reference that discusses this, but the fact is long known (see refs there). One-sided p-value is 1/2 of the two-sided p-value. So p=0.02 can be interpreted as posterior probability P=0.01 of negative effect under very wide prior. If so, how can they possibly have different properties with respect to seq testing?! – amoeba Nov 01 '17 at 11:29
  • That works only in the special case that assumes (1) a certain prior and (2) the sample size is fixed and there is only one look at the data. When the sample size is a random variable this breaks down. To properly criticize posterior probabilities you'd need to show they are poorly calibrated. – Frank Harrell Nov 01 '17 at 12:50
  • 1
    @amoeba and FrankHarrell, great discussion! I understand amoeba's point and logic, but I do not entirely understand the response, most likely because I know too little about the Bayesian way of thinking. It is both very exciting and puzzling to observe that FrankHarrell is not convinced by amoeba's argument. Are you guys sure you are talking about the same thing, or do you happen to silently disagree on definitions or so such that the contradiction is only apparent but not real? I would like to see a separate post on this if possible :) – Richard Hardy Nov 01 '17 at 13:10
  • Since most people are trained solely in classical stats, it is very difficult to think from an orthogonal viewpoint. The contradiction is real because of conditioning on the opposite assertion, just as the P(female | you are a US senator) is 0.21 but P(being a US senator | female) is 21/165M. Conditioning on data to obtain probability distributions of parameters (the foundation of Bayes-Laplace) is foreign to frequentists who compute P(someone else's data more extreme than mine | their data were generated by the same process as mine except with "no effect" inserted for the treatment effect). – Frank Harrell Nov 01 '17 at 13:18
  • @RichardHardy Perhaps consider asking a separate question. One has to be careful to formulate it such that it is specific, answerable, does not get closed as opinion-based, and preferably does not get converted to "communiti wiki" status either. – amoeba Nov 01 '17 at 13:26
  • Frank, yes, of course this assumes a certain prior. This is a specific example designed to demonstrate my point. "When the sample size is a random variable this breaks down" - why would it? Once we fix the null hypothesis, the prior, the test, etc., p-value is a specific function of the data, and the posterior probability as well. Under the conditions I listed above, they are equal for all sample sizes. If we do sequential testing, they will keep being equal. – amoeba Nov 01 '17 at 13:33
  • There are many answers to your question. The simplest is that no one knows how to even **compute** the p-value in the purely sequential framework. – Frank Harrell Nov 01 '17 at 16:11
  • Plus don't forget that when computing type I error you can computing the probability of an event that Bayesians (and most scientists) don't care about. Put another way, if you really want to compute something to compare with Bayes in such a way that you assume the null hypothesis to be true, you can even dispense with running any experiment at all because you already know the answer. – Frank Harrell Nov 01 '17 at 16:36
  • As I said above, if one does not care about type I error rate, then it's fine, I am not here to debate that. The only thing I've been saying here is that **if** we talk about type I error rate at all, **then** type I error rate of Bayesian procedures does "suffer" from sequential testing. A Bayesian lab studying mind reading and doing sequential testing until obtaining posterior 0.98 in favor of mind reading will be wrong 100% of the times. It's not clear to me if you agree with this statement (but claim that it does not matter because type I error rate does not matter) or if you disagree. – amoeba Nov 01 '17 at 19:55
  • (1) If you care about type I error that much, you are assuming you know the truth when doing the simulation that shows Bayes is wrong. (2) Type I error applies to a sequence of trials, not to your trial. (3) I don't care about type I error. A judge who prides herself in type I error would be disbarred. She should instead maximize the probability of getting the right verdict for the defendant in front of her, not losing sleep over her long-run operating characteristics. – Frank Harrell Nov 01 '17 at 20:31
  • OK, but could you please explicitly confirm that you agree with my statement from the previous comment, namely: A hypothetical Bayesian lab studying mind reading (let's assume they measure some value $m$ and want to demonstrate that it is positive) and always employing sequential testing until obtaining posterior 0.98 in favor of mind reading (i.e. their stopping rule is posterior $P(m>0)\ge 0.98$) will be wrong in 100% of their experiments/papers? I appreciate that you don't consider this a problem; that's OK, I don't want to debate this now. But can we agree that this statement is true? – amoeba Nov 02 '17 at 20:15
  • No, can't agree with that. That would require an infinite allowable sample size. Clearest way to think about the big picture is that if I use a smooth prior distribution and you choose a conflicting prior, say one that has an absorbing state, e.g., a discontinuity that puts a large point mass at a zero effect, then in your eyes I can sample to a foregone conclusion. – Frank Harrell Nov 03 '17 at 03:23
  • Yes, I was assuming infinite allowable sample size in my previous comment. Infinite sample size is clearly not realistic, but that is the setting of whuber's and other answers in this thread: if one keeps sampling until $p – amoeba Nov 09 '17 at 23:09
  • You are stating too simply what we agree on. You have to specify what we are assuming in each setting. You are assuming there is no effect, so no experiment needs to be run, or you are assuming that the prior distribution has an absorbing state. You also assume that Bayesians like to spend time on type I error probabilities :-). It is important to note that the error probabilities you are interested in (they are not _rates_) are nothing that an experimenter would be interesed in for her single experiment. – Frank Harrell Nov 10 '17 at 13:39
  • @amoeba It is not true that the process of sampling repeatedly until a target posterior probability or Bayes factor is reached always terminates. $\sum X_i / \sqrt{n}$ gets arbitrarily large but $\sum X_i / n$ doesn't, so you can always reach any target p-value but not always any target posterior probability. E.g. suppose my prior for the parameter $p$ of a Bernoulli distribution is that it is equally likely to be 1/4 or 3/4. To get my posterior probability that $p$ = 3/4 above 1/2, I need to sample until I get more 1s than 0s. If $p$ = 1/4, the probability that this ever happens is only 1/3. – fblundun Mar 16 '21 at 23:33
  • @fblundun Thanks, but I am not sure how this particular example relates to the case I was discussing above, namely: "under the null of zero effect and with [...] some broad continuous prior centered at zero[,] repeated sampling will always sooner or later yield >0.98 posterior probability concentrated above zero". – amoeba Mar 18 '21 at 08:58
  • @amoeba that statement is only true if the "null hypothesis" is true, i.e. the true population mean $\mu$ is exactly 0. If $\mu$ is really $-\epsilon$, it is *not* almost sure that repeated sampling will eventually yield >0.98 probability that $\mu$ is > 0. But as your prior is continuous, the probability that $\mu$ is exactly 0 is 0 - so the issue will never arise. It doesn't make sense to criticize a sampling methodology for a problem which happens with frequency 0. You need to assign a positive prior probability to the null hypothesis to justify caring about what happens when it is true. – fblundun Mar 18 '21 at 10:09
  • @fblundun But if $\mu$ is really $-\epsilon$ then I think (?) it is not almost sure that repeated sampling will eventually yield a p-value < 0.05 for the null hypothesis that $\mu=0$. Which is what is discussed in this thread (see simulations in other answers). If it "doesn't make sense to criticize" Bayesian methodology in this scenario, then it doesn't make sense to criticize the frequentist methodology in the SAME scenario either. – amoeba Mar 18 '21 at 12:29
  • @amoeba you're right that arbitrary p-values can only be reached almost surely if $\mu \ge 0$. I wasn't criticizing frequentism, just disputing your claims that your hypothetical Bayesian lab's work is worthless. That said, a sample generated with an unknown stopping rule can be used for Bayesian inference in the normal way (without risk of being systematically tricked into believing in mind reading). But as far as I can tell, a frequentist can't make use of the sample without knowing the stopping rule? That would be a concrete advantage the Bayesian has over the frequentist in such scenarios. – fblundun Mar 18 '21 at 13:36
  • @fblundun I don't think these comments are the right place to have this discussion :) So let's have it some other time. As I mentioned above, one-sided p-value (for positive effect) can be interpreted as Bayesian posterior probability of the negative effect under a broad prior. From this point of view, everything that applies to p-values can apply to posterior probabilities and vice versa. – amoeba Mar 18 '21 at 16:49
  • @amoeba so are you persuaded that Dr B's research isn't "worthless"? – fblundun Mar 18 '21 at 19:04
  • @fblundun Well, I still maintain that it's exactly as worthy or worthless as the research of hypothetical Dr F who has published a steady stream of papers demonstrating social priming with p<0.01, where for each paper they keep increasing the sample size until obtaining p<0.01. Whether this should be called "worthless" or not, you can decide. – amoeba Mar 19 '21 at 10:25
  • 1
    @amoeba Dr F can almost surely produce such papers if the null hypothesis is true. But either Dr B assigns a positive prior probability to the null hypothesis, in which case he *can't* a.s. produce such papers unless social priming is real; or he assigns it 0 prior probability and so do you, in which case you would agree the probability is 0 that he can wrongly produce such papers a.s.; or he assigns it 0 prior probability and you don't, in which case your problem shouldn't be with his stopping rule but with his prior which you think is "infinitely" wrong about the null hypothesis. – fblundun Mar 19 '21 at 18:53
  • @fblundun I see. It is an interesting point. Thanks. I upvoted your last comment a long time ago, and wasn't sure how to reply to that. May get back to it at some point later. – amoeba Dec 16 '21 at 16:47
3

We consider a researcher collecting a sample of size $n$, $x_1$, to test some hypothesis $\theta=\theta_0$. He rejects if a suitable test statistic $t$ exceeds its level-$\alpha$ critical value $c$. If it does not, he collects another sample of size $n$, $x_2$, and rejects if the test rejects for the combined sample $(x_1,x_2)$. If he still obtains no rejection, he proceeds in this fashion, up to $K$ times in total.

This problem seems to already have been addressed by P. Armitage, C. K. McPherson and B. C. Rowe (1969), Journal of the Royal Statistical Society. Series A (132), 2, 235-244: "Repeated Significance Tests on Accumulating Data".

The Bayesian point of view on this issue, also discussed here, is, by the way, discussed in Berger and Wolpert (1988), "The Likelihood Principle", Section 4.2.

Here is a partial replication of Armitage et al's results (code below), which shows how significance levels inflate when $K>1$, as well as possible correction factors to restore level-$\alpha$ critical values. Note the grid search takes a while to run---the implementation may be rather inefficient.

Size of the standard rejection rule as a function of the number of attempts $K$

enter image description here

Size as a function of increasing critical values for different $K$

enter image description here

Adjusted critical values to restore 5% tests as a function of $K$

enter image description here

reps <- 50000

K <- c(1:5, seq(10,50,5), seq(60,100,10)) # the number of attempts a researcher gives herself
alpha <- 0.05
cv <- qnorm(1-alpha/2)

grid.scale.cv <- cv*seq(1,1.5,by=.01) # scaled critical values over which we check rejection rates
max.g <- length(grid.scale.cv)
results <- matrix(NA, nrow = length(K), ncol=max.g)

for (kk in 1:length(K)){
  g <- 1
  dev <- 0
  K.act <- K[kk]
  while (dev > -0.01 & g <= max.g){
    rej <- rep(NA,reps)
    for (i in 1:reps){
      k <- 1
      accept <- 1
      x <- rnorm(K.act)
      while(k <= K.act & accept==1){
        # each of our test statistics for "samples" of size n are N(0,1) under H0, so just scaling their sum by sqrt(k) gives another N(0,1) test statistic
        rej[i] <- abs(1/sqrt(k)*sum(x[1:k])) > grid.scale.cv[g] 
        accept <- accept - rej[i]
        k <- k+1
      }
    }
    rej.rate <- mean(rej)
    dev <- rej.rate-alpha
    results[kk,g] <- rej.rate
    g <- g+1
  }
}
plot(K,results[,1], type="l")
matplot(grid.scale.cv,t(results), type="l")
abline(h=0.05)

cv.a <- data.frame(K,adjusted.cv=grid.scale.cv[apply(abs(results-alpha),1,which.min)])
plot(K,cv.a$adjusted.cv, type="l")
Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106