What is the confidence interval of a p-value?

Question

The $p$-value is used to report how strongly we can presume against an hypothesis. As is clear, this $p$ value is itself estimated from data and if new data where collected in the same conditions, the new $p$ value will very unlikely be the same.

Halsey, Curran-Everett, Vowler & Drummond (2015) in a commentary to Nature Methods showed that the uncertainty surrounding a $p$-value can be fairly large. In a reply, Lazzeroni, Lu & Belitskaya-Lévy (2016, same journal) gave an example of an observed $p$ value of 0.049 whose confidence interval goes from 0.00000008 to 0.99.

My question is: do we know the sampling distribution of $p$ values? According to the latter, it does not depend on sample size (and presumably on the sample's standard deviation as all these are used to "standardize" the test statistic). Presumably, it might depend on the test procedure?

I know that if $H_0$ is true, the distribution of $p$-values is uniform over the range 0 to 1 (but can't remember where I learned this). As $H_0$ is more and more inadequate, the distribution of $p$-values becomes peaked, leaning over the 0% probabilities (for left-tail tests).

It is fairly easy with bootstrap to get a visual representation of the distribution of the $p$-values. However, a more satisfying answer would be to have a formula (closed-form is even better) so that we can know exactly what characteristics affect that distribution, and henceforth, the width of the confidence interval.

Do you know of such a formula, or if it is even possible to have one?

In Lazzeroni LC, Lu Y, Belitskaya-Lévy I. _P-values in genomics: Apparent precision masks high uncertainty_. Molecular Psychiatry 9: 1336-1340 (2014), the authors suggest to convert the observed p into a z score, find the lower and upper bound of that z-score, than convert them back to probabilities. Does it make sense? — Denis Cousineau, Jan 02 '17 at 21:42
Testing the idea of Lazzeroni et al. (2014), I realized that it does not work. The confidence interval obtained from simulations are not exactly the same for different sample sizes, which contradicts the idea of using z scores. — Denis Cousineau, Jan 04 '17 at 17:55
I believe what you want would be a *prediction interval* for future p-values *constructed under the same conditions as the original p-value*? Perhaps you do mean confidence interval rather than prediction interval, but talking about a confidence interval for an observed value is very confusing to me. Whether you meant prediction or confidence interval, I'm pretty sure you want to specify that the interval refers to the mean of future p-values from future studies. — Cliff AB, Jan 04 '17 at 20:17
@Cliff If you accept that there is a sampling distribution of p-values (which seems uncontroversial), then the fact that p-values are bounded implies this sampling distribution has an expectation. Its expectation evidently is a property of the underlying distribution *within the context of a specific model and specific test statistic.* Given that, it looks like this expectation could reasonably viewed as a property of the distribution itself, permitting one to apply all the conventional concepts of estimate, estimator, and confidence interval. — whuber, Jan 04 '17 at 20:27
Halsey et al paper that OP mentioned and the reasoning behind it is discussed at great length in this recent thread: http://stats.stackexchange.com/questions/250269 - which I would say is perhaps even a duplicate (@whuber). The general conclusion of that thread is Halsey et al (who borrow their claims from the earlier work by Cumming) are sloppy and do not state their assumptions. I strongly dislike their paper. — amoeba, Jan 04 '17 at 20:31
@whuber: I believe I understand the argument, but the question seems to have more clarity if you specify that you are interested in characterizing the distribution of p-values under the given conditions rather than a "confidence interval of a p-value", which can easily be interpreted as a single observed value. — Cliff AB, Jan 04 '17 at 20:34
Thanks to @amoeba, I indeed meant that the conditions are identical: same sample size, same testing procedure, same sampling method. — Denis Cousineau, Jan 04 '17 at 20:37
@amoeba I'm glad you found this thread and weighed in. Because it appears to be focused on the technical issue of defining and computing a distribution of p-values, rather than on the underlying philosophy and meaning of such a distribution, I am reluctant to identify it as a duplicate of your thread. — whuber, Jan 04 '17 at 20:42
@whuber Yes, I agree. Still it might be useful for the OP to read those discussions. — amoeba, Jan 04 '17 at 20:44
@whuber I find interesting that OP here mentioned boostrap. Bootstrap is considered to be a general technique for constructing confidence intervals around basically any statistic. P-value is obviously a statistic. So if we apply bootstrap we will obtain some interval around it; what is its meaning? I suspect it will not be the same kind of intervals that Cumming (and also Halsey et al and Lazzeroni et al, mentioned here) talk about. — amoeba, Jan 04 '17 at 20:46
@Amoeba Be careful: one does not construct a CI for a statistic; a CI refers to a *parameter.* In classical situations (Z tests, t tests, etc) there is a one-to-one correspondence between the statistic and the p-value. To the extent a statistic can estimate something (typically an effect size), *a fortiori* a p-value must be estimating something, too. But *what* it might estimating has nothing to do with *how* one constructs a CI. A plausible candidate for its estimand is the expected p-value (for a given model, given statistic, and given effect size). The chief difficulty, it seems to me, — whuber, Jan 04 '17 at 20:58
(contd) is that the usual asymptotic theory of increasing sample sizes makes no sense: as the sample size changes, the expectation of the p-value changes. Its limit is either $0$, $1/2$, or $1$, depending on whether $H_A$ holds or $H_0$ holds (which, if it's composite, can result in a limiting p-value of $1$). Thus, a p-value does not estimate a property of an underlying distribution: it's a property that attaches to the distribution *and the specific sample size,* as well as to the test statistic. (I need to stop writing until I have thought this through further...) — whuber, Jan 04 '17 at 21:01
@whuber: the distinction between a property and a parameter is to me quite blur: As long as it has a stable nature (i.e., remains unchanged), both can be tagged parameters. If a parameter $\pi$, estimated with $p$, is defining the current situation, then it is a parameter, isn't it? — Denis Cousineau, Jan 04 '17 at 21:06
The subtlety is that ordinarily we think of parameters or properties as being independent of how we go about making observations: they would be the same if we took two or two hundred observations. P-values don't behave that way. — whuber, Jan 04 '17 at 21:10
@whuber: What if we define the population as composed of elements $X_i$ that are realizations of $N \left(100+t_{n-1,\pi} \times \sigma /\sqrt{n} \right)$ in which $t_{n-1,\pi}$ is the $\pi$ quantile of the Student t distribution with $n-1$ degrees of freedom, and where $n$ is determined when sampling is done (and cannot change in the course of sampling). For this very specific (and strange) population, $\pi$ is constant irrespective of sample size with respect to a null hypothesis $H_0 : \mu = 100$. — Denis Cousineau, Jan 04 '17 at 21:19
"We define π [the parameter estimated by a p-value] as the p-value that would be seen if the true, unknown population parameter values were used in place of the sample estimates in the p-value formula. For a given population, the π-value is a fixed probability. It depends on both population effect size and sample size, but is independent of all data." (From Lazzeroni *et al.*, Supplemental Information, p. 4.) — whuber, Jan 04 '17 at 21:26
This article may also be useful: http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values. This seems to surround a poor understanding of what $p$ is and how it comes about: "The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).There are several reasons why P values can’t be the error rate. First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance." — Tavrock, Jan 04 '17 at 21:37
"...Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations. Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely: The null is true but your sample was unusual. The null is false. Determining which case is more likely requires subject area knowledge and replicate studies." — Tavrock, Jan 04 '17 at 21:37
What seems to get lost in this answer: http://stats.stackexchange.com/questions/250269/cumming-2008-claims-that-distribution-of-p-values-obtained-in-replications-dep/251454#251454 is that the paper revolves around the idea that, given a $p$ value, I can tell you what your Confidence Interval for your test was, and reverse engineer your data. Somehow, we don't want to make the same claim of "tell me what your $\alpha$ is for your confidence interval, and I can predict your $p$ value" thought it is essentially the same question. — Tavrock, Jan 04 '17 at 21:58

score 3 · Answer 1 · answered Jan 04 '17 at 20:04

3

The problem is that a p value is not an estimate of a parameter so the idea of a confidence interval does not apply. It also does not make sense to talk about the uncertainty surrounding a p value. The p value is certain; the conclusion you draw from it is not.

answered Jan 04 '17 at 20:04

David Lane

1,194
1
8
9

6

You appear to be denying the premises of the question, including the point of view that the p-value is uncertain. That's going to be controversial, because it's well known--and intuitively obvious--that when an experiment is repeated a different p-value is almost sure to arise. You might find the thread at http://stats.stackexchange.com/questions/181611 somewhat relevant. – whuber Jan 04 '17 at 20:09
Hello @David, glad to see you on StackExchange. Although I agree with you that in general, p is not a parameter, I am sure we could image a world in which populations are characterized by a parameter $\pi$. In this world, all the samples would have a constant size and all the sampling method will be constant as well. In this improbable world (if you allow the pun), $\pi$ is a parameter, and $p \equiv \hat\pi$, probably the best, unbiased estimate of $\pi$. Hence, if I frame my question relative to this world, can we have a confidence interval around an observed $p$? – Denis Cousineau Jan 04 '17 at 20:12
Hi @Denis. That makes a lot of sense. However, I think the critique of significance testing (others make) that since p values differ across replications they are not informative is incorrect. Of course different replications will provide different degrees of conclusiveness about the direction of an effect (I assume the effect is almost never 0). That doesn't bear on the conclusiveness of a given study. – David Lane Jan 04 '17 at 20:27
2

Sure, as @David says, $p$ *are* informative. They are just variable. If we could get some confidence interval, and find that the whole interval is very narrow and close to zero, that would add additional strength to a conclusion. – Denis Cousineau Jan 04 '17 at 20:32
1

@whuber Of course the p value is uncertain before you do the experiment. However, the important uncertainty is the direction of the effect, not the p value. The p value is a tool to guide inference, not the object of the inference. That's why it doesn't make sense to say the data provide evidence for a significant effect. – David Lane Jan 04 '17 at 20:38
Imagine a diagnostic test for determining whether a disease is present. The conclusiveness of the test even for a given patient varies for a number of reasons. Assume the test is conclusive in a particular instance. I argue that the interpretation of this diagnostic test in this instance does not depend on how conclusive it might have been on another occasion or the distribution of conclusivity over numerous other occasions. . – David Lane Jan 04 '17 at 21:05
p-value [is a random variable](http://andrewgelman.com/2016/08/05/the-p-value-is-a-random-variable/) so there is uncertainty. – Tim Jan 04 '17 at 21:16
$p$ values are based on the $\alpha$ you choose in your hypothesis test. The $p$ value simply reflects back the information you provided (which is why it does not have a confidence interval). The following articles may be helpful: http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests:-confidence-intervals-and-confidence-levels , http://blog.minitab.com/blog/michelle-paret/alphas-p-values-confidence-intervals-oh-my , and http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values – Tavrock Jan 04 '17 at 21:25
This is a very long thread. I have gone through most of the comments and the answer but not the numerous links. I may have missed it but I did not see the definition of p-value which is "the probability under the null hypothesis that you will see a value as extreme or more extreme as the the observed statistic. In that statement we have "null hypothesis assumed true" and "observed statistic". The null hypothesis assumes certain values for the parameter and the problem may involve a specific family of distributions is being considered. – Michael R. Chernick Jan 07 '17 at 13:09
The observed statistic means a sample was taken and the test statistic was computed. So the quoted p-value varies for many reasons. I see it as a estimate of a random quantity which we could call the true p-value (a parameter) and the observed p-value (a statistic). The observed p-value at least conceptually have a distribution and a confidence interval associated with it. – Michael R. Chernick Jan 07 '17 at 13:15

What is the confidence interval of a p-value?

1 Answers1

Linked