36

I've been reading up on $p$-values, type 1 error rates, significance levels, power calculations, effect sizes and the Fisher vs Neyman-Pearson debate. This has left me feeling a bit overwhelmed. I apologise for the wall of text, but I felt it was necessary to provide an overview of my current understanding of these concepts, before I moved on to my actual questions.


From what I've gathered, a $p$-value is simply a measure of surprise, the probability of obtaining a result at least as extreme, given that the null hypothesis is true. Fisher originally intended for it to be a continuous measure.

In the Neyman-Pearson framework, you select a significance level in advance and use this as an (arbitrary) cut-off point. The significance level is equal to the type 1 error rate. It is defined by the long run frequency, i.e. if you were to repeat an experiment 1000 times and the null hypothesis is true, about 50 of those experiments would result in a significant effect, due to the sampling variability. By choosing a significance level, we are guarding ourselves against these false positives with a certain probability. $P$-values traditionally do not appear in this framework.

If we find a $p$-value of 0.01 this does not mean that the type 1 error rate is 0.01, the type 1 error is stated a priori. I believe this is one of the major arguments in the Fisher vs N-P debate, because $p$-values are often reported as 0.05*, 0.01**, 0.001***. This could mislead people into saying that the effect is significant at a certain $p$-value, instead of at a certain significance value.

I also realise that the $p$-value is a function of the sample size. Therefore, it cannot be used as an absolute measurement. A small $p$-value could point to a small, non-relevant effect in a large sample experiment. To counter this, it is important to perform an power/effect size calculation when determining the sample size for your experiment. $P$-values tell us whether there is an effect, not how large it is. See Sullivan 2012.

My question: How can I reconcile the facts that the $p$-value is a measure of surprise (smaller = more convincing) while at the same time it cannot be viewed as an absolute measurement?

What I am confused about, is the following: can we be more confident in a small $p$-value than a large one? In the Fisherian sense, I would say yes, we are more surprised. In the N-P framework, choosing a smaller significance level would imply we are guarding ourselves more strongly against false positives.

But on the other hand, $p$-values are dependent on sample size. They are not an absolute measure. Thus we cannot simply say 0.001593 is more significant than 0.0439. Yet this what would be implied in Fisher's framework: we would be more surprised to such an extreme value. There's even discussion about the term highly significant being a misnomer: Is it wrong to refer to results as being "highly significant"?

I've heard that $p$-values in some fields of science are only considered important when they are smaller than 0.0001, whereas in other fields values around 0.01 are already considered highly significant.

Related questions:

Zenit
  • 1,586
  • 2
  • 17
  • 19
  • Also, do not forget that a "significant" p value does not tell you anything about your theory. This is even admitted by the most ardent defenders: [Precis of Statistical significance: Rationale, validity, and utility. Siu L. Chow. BEHAVIORAL AND BRAIN SCIENCES (1998) 21, 169–239](http://websites.psychology.uwa.edu.au/labs/cogscience/Publications/Lewandowsky-Mayberry%20%281996%29%20-%20Critics%20Rebuttted.pdf) Data is interpreted when being turned into evidence. The assumptions an interpretation is based on need to be enumerated and then, if possible, checked. What is being measured? – Livid Feb 14 '15 at 20:18
  • 2
    +1, but I would encourage you to focus the question and remove the side questions. If you are interested why some people argue that confidence intervals are better than p-values, ask a separate question (but make sure it hasn't been asked before). – amoeba Feb 16 '15 at 14:44
  • 3
    Apart from that, how is your question not a duplicate of [Why are lower p-values not more evidence against the null?](http://stats.stackexchange.com/questions/63499) Have you seen that thread? Perhaps you can add it to the list in the end of your post. See also a similar question [What sense does it make to compare p-values to each other?](http://stats.stackexchange.com/questions/21419), but I am reluctant to recommend that thread, because the accepted answer there is IMHO incorrect/misleading (see discussion in the comments). – amoeba Feb 16 '15 at 14:52
  • After your update, there is hardly any question left! It makes it a bit of a confusing thread for future references. Suggestion: post your update (with 4 bullet points) as an *answer*; remove it from the question; directly edit the original question to make it more concise, clear, and focused (as I suggested before). Then it will be a nice clear thread for future references. In addition: triggered by your post, I have posted a new answer in [Why is it wrong to refer to results as being "highly significant"?](http://stats.stackexchange.com/questions/107640). You might want to take a look. – amoeba Feb 17 '15 at 15:12
  • I've carried out your suggestions. I do apologise for the lack of coherence in my original post and update. It was a direct result of my own confusion on the topic. I hope the bold question manages to capture the essence of my original question for future readers. Thanks for the additional answer as well! – Zenit Feb 18 '15 at 16:34
  • 2
    Gelman has much of relevance to say about p-values. e.g. 1. [here (Gelman and Stern, Am.Stat. 2006 pdf)](http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf), 2. [here on his blog](http://andrewgelman.com/2011/09/09/the-difference-between-significant-and-not-significant/), 3. [his blog again](http://andrewgelman.com/2013/01/09/the-difference-between-significant-and-non-significant-is-not-itself-statistically-significant/) and perhaps also 4. [here (Gelman, 2013 published comment on another paper, pdf)](http://www.stat.columbia.edu/~gelman/research/published/pvalues3.pdf) – Glen_b Feb 21 '15 at 00:21
  • 2
    Thanks for the links, @Glen_b; I know the Gelman & Stern paper well and often refer to it myself, but haven't seen this 2013 paper or its discussion before. However, I would like to caution OP about interpreting Gelman & Stern in the context of his/her question. G&S offer a nice example with two studies estimating an effect as $25\pm 10$ and $10\pm 10$; in one case $p<0.01$, in another $p>0.05$, but the *difference* between estimates is not significant. This is important to keep in mind, but if now, following OP, we ask if the first study is more convincing, I would certainly say yes. – amoeba Feb 21 '15 at 15:26
  • probably this adds to the discussion: http://stats.stackexchange.com/questions/166323/misunderstanding-a-p-value/166327#166327 –  Jul 30 '16 at 07:09

4 Answers4

23

Are smaller $p$-values "more convincing"? Yes, of course they are.

In the Fisher framework, $p$-value is a quantification of the amount of evidence against the null hypothesis. The evidence can be more or less convincing; the smaller the $p$-value, the more convincing it is. Note that in any given experiment with fixed sample size $n$, the $p$-value is monotonically related to the effect size, as @Scortchi nicely points out in his answer (+1). So smaller $p$-values correspond to larger effect sizes; of course they are more convincing!

In the Neyman-Pearson framework, the goal is to obtain a binary decision: either the evidence is "significant" or it is not. By choosing the threshold $\alpha$, we guarantee that we will not have more than $\alpha$ false positives. Note that different people can have different $\alpha$ in mind when looking at the same data; perhaps when I read a paper from a field that I am skeptical about, I would not personally consider as "significant" results with e.g. $p=0.03$ even though the authors do call them significant. My personal $\alpha$ might be set to $0.001$ or something. Obviously the lower the reported $p$-value, the more skeptical readers it will be able to convince! Hence, again, lower $p$-values are more convincing.

The currently standard practice is to combine Fisher and Neyman-Pearson approaches: if $p<\alpha$, then the results are called "significant" and the $p$-value is [exactly or approximately] reported and used as a measure of convincingness (by marking it with stars, using expressions as "highly significant", etc.); if $p>\alpha$ , then the results are called "not significant" and that's it.

This is usually referred to as a "hybrid approach", and indeed it is hybrid. Some people argue that this hybrid is incoherent; I tend to disagree. Why would it be invalid to do two valid things at the same time?

Further reading:

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    (+1) But see Section 4.4 of Michael Lew's paper: some would rather equate the amount of evidence with the likelihood than with the p-value, which makes a difference when p-values from experiments with different sampling spaces are being compared. So they talk of "indexing" or "calibrating" the evidence/likelihood. – Scortchi - Reinstate Monica Feb 18 '15 at 16:08
  • Sorry, I meant to say, more precisely, that, in this view, the relative "evidence" (or "support") for different values a parameter may take is the ratio of their likelihood functions evaluated for the observed data. So in Lew's example, one head out of six tosses is the same evidence against the null hypothesis, regardless of whether the sampling scheme is binomial or negative binomial; yet the p-values differ - you might say that under one sampling scheme you were less likely to amass as much evidence against the null. (Of course rights to the word "evidence", as with "significant", ... – Scortchi - Reinstate Monica Feb 18 '15 at 18:15
  • ... haven't yet been firmly established.) – Scortchi - Reinstate Monica Feb 18 '15 at 18:16
  • Hmmm, thanks a lot for drawing my attention to this section; I read it before but apparently missed its importance. I must say that at the moment I am confused by it. Lew writes that the p-values should not be "adjusted" by taking stopping rules into account; but I don't see any adjustments in his formulas 5-6. What would "unadjusted" p-values be? – amoeba Feb 18 '15 at 18:45
  • He *seems* to be saying that the statistician using the negative binomial distribution has, perhaps unwittingly, "adjusted" his p-values to account for the sequential design of the experiment: he should have calculated p-values for a binomial sampling scheme. I'm not sure the statistician would see it that way - he's simply noted what was fixed & what was random in his experiment & used an appropriate model. Most of the paper is concerned to show that using p-values needn't conflict with the weak likelihood principle; the attempt here to bring them into line with the strong LP is unorthodox. – Scortchi - Reinstate Monica Feb 18 '15 at 18:59
  • 1
    @Scortchi: Hmmm. I really don't understand why one of these p-values is "adjusted" and another one not; why not vice versa? I am not at all convinced by Lew's argument here, and I don't even fully understand it. Thinking about that, I found [Lew's question from 2012](http://stats.stackexchange.com/questions/40856) about the likelihood principle and p-values, and posted an answer there. The point is that one doesn't need different stopping rules to get different p-values; one can simply consider different test statistics. Perhaps we can continue to discuss there, I would appreciate your input. – amoeba Feb 18 '15 at 23:05
  • You say that lower p-values are more evidence against the null. I would say that a p-value is the probability of observing (a value as extreme or more extreme for the test statistic of) your sample when the null is true. So if (the value for the test statistic of) your sample has a lower p-value, couldn't it be that you had good luck with the sample ? I would agree with @Scortchi saying that he 'don't 'know what's meant by smaller p-values being "better" ''. I would say that because lower p-value is either ''more evidence'' or ''more luck with the sample''. –  Jul 31 '16 at 08:46
  • @fcop When additional evidence is brought into a court, can we say there is now "more evidence", or do we have to say that it's "either more evidence or more luck with getting something that superficially appears to be evidence but is actually not"? In any case, I largely agree with Scortchi's answer here (it has my +1). He says in the 2nd sentence that lower p-values provide more reason to be "surprised" by the data if we believed the null. That's what is meant by "more evidence" in the frequentist testing paradigm. – amoeba Jul 31 '16 at 14:37
  • [It would be great if a downvoter would identify themselves and provide some critical comments.] – amoeba Jul 31 '16 at 19:19
  • I'm a upvoter, but *Obviously the lower the reported $p$-value, the more skeptical readers it will be able to convince!* I think you are skeptical about their *a prior* choice of $\alpha$ rather than the *a posterior* observed $p$? If you consider a smaller $p$ to be more convincing, then (at that moment) you are using the Fisherian framework because IMO Neyman & Pearson don't even care about how "convincing" a single sample is, and instead focus on not making mistakes too often in the long term. – nalzok Jul 22 '19 at 16:35
10

I don't know what's meant by smaller p-values being "better", or by us being "more confident in" them. But regarding p-values as a measure of how surprised we should be by the data, if we believed the null hypothesis, seems reasonable enough; the p-value is a monotonic function of the test statistic you've chosen to measure discrepancy with the null hypothesis in a direction you're interested in, calibrating it with respect to its properties under a relevant procedure of sampling from a population or random assignment of experimental treatments. "Significance" has become a technical term to refer to p-values' being either above or below some specified value; thus even those with no interest in specifying significance levels & accepting or rejecting hypotheses tend to avoid phrases such as "highly significant"—mere adherence to convention.

Regarding the dependence of p-values on sample size & effect size, perhaps some confusion arises because e.g. it might seem that 474 heads out of 1000 tosses should be less surprising than 2 out of 10 to someone who thinks the coin is fair—after all the sample proportion only deviates a little from 50% in the former case—yet the p-values are about the same. But true or false don't admit of degrees; the p-value's doing what's asked of it: often confidence intervals for a parameter are really what's wanted to assess how precisely an effect's been measured, & the practical or theoretical importance of its estimated magnitude.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 1
    +1. I think what the question was getting at, is: are smaller p-values more convincing -- that's how I understand "better" in the title (in general, the question would greatly benefit if the OP tried to focus it)? If one gets $p=0.04$ or $p=0.000004$, one would perhaps call the results "significant" in both cases, but are they more *convincing* in the latter case? The practice of putting "stars" near p-values assumes that they are; are they? (This is essentially asking about the often-criticized "hybrid" between Fisher and Neyman-Pearson; personally, I don't have a problem with it.) – amoeba Feb 18 '15 at 10:48
1

Thank you for the comments and suggested readings. I've had some more time to ponder on this problem and I believe I've managed to isolate my main sources of confusion.

  • Initially I thought there was a dichotomy between viewing the p-value as a measure of surprise versus stating that it's not an absolute measure. Now I realise these statements don't necessarily contradict each other. The former allows us to be more or less confident in the extremeness (unlikeness even?) of an observed effect, compared to other hypothetical results of the same experiment. Whereas the latter only tells us that what might be considered a convincing p-value in one experiment, might not be impressive at all in another one, e.g. if the sample sizes differ.

  • The fact that some fields of science utilise a different baseline of strong p-values, could either be a reflection of the difference in common sample sizes (astronomy, clinical, psychological experiments) and/or an attempt to convey effect size in a p-value. But the latter is an incorrect conflation of the two.

  • Significance is a yes/no question based on the alpha that was chosen prior to the experiment. A p-value can therefore not be more significant than another one, since they are either smaller or larger than the chosen significance level. On the other hand, a smaller p-value will be more convincing than a larger one (for a similar sample size/identical experiment, as mentioned in my first point).

  • Confidence intervals inherently convey the effect size, making them a nice choice to guard against the issues mentioned above.

Zenit
  • 1,586
  • 2
  • 17
  • 19
0

The p-value cannot be a measure of surprise because it is only a measure of probability when the null is true. If the null is true then each possible value of p is equally likely. One cannot be surprised at any p-value prior to deciding to reject the null. Once one decides there is an effect then the p-value's meaning vanishes. One merely reports it as a link in a relatively weak inductive chain to justify the rejection, or not, of the null. But if it was rejected it actually no longer has any meaning.

John
  • 21,167
  • 9
  • 48
  • 84
  • +1 for the fact "when the null is true then every p-value is equally likely'' however, I think this holds only for continuous random variables ? –  Jul 30 '16 at 07:04
  • Note that I said, every "possible" value of p is equally likely. So this is true for discreet or continuous variables. With discreet variables the number of possible values is lower. – John Jul 30 '16 at 16:07
  • are you sure that the distribution of the p-values (under $H_0$) is always uniform for discrete variables because this link seems to say someting different: http://stats.stackexchange.com/questions/153249/non-uniform-distribution-of-p-values-when-simulating-binomial-tests-under-the-nu –  Jul 31 '16 at 07:38
  • I believe the leading answer demonstrates that this is a non-issue. The reason that the distribution looks non-uniform is because the possible p-values are unequally spaced. Glenn even calls it quasi-uniform. I suppose it's possible that with some very sparse tests of binomial data with small Ns then perhaps the probability of specific p-values is unequal but if you consider the probability of p-values in a given range it will be closer to uniform. – John Jul 31 '16 at 08:21
  • That could be, but uniform is a well-defined concept, and I think it's good to be precise. That's probably the reason why @Glen_b calls it ''pseudo''. If it's all the same then the question refered supra should be calssified as ''irrelevant'' or not ? –  Jul 31 '16 at 08:52
  • John, I am really confused by your answer and am not sure I understand what you mean. Say somebody runs a between-subject experiment with two groups of 10 people (treatment and control), performing a t-test between the groups. Imagine one gets $p=0.04$ or $p=0.0000000004$, but in both cases rejects the null based on the commonly used $\alpha=0.05$. Imagine you read these two papers. Are you seriously saying you will not think that the second study has more empirical support? CC to @fcop. – amoeba Jul 31 '16 at 14:40
  • 1
    @amoeba: let's say that the t-test you mention tests $H_0: \mu=0.5$ and you get $p=0.0000000004$. It could be that, with the same sample you test $H_0: \mu=0.45$ and you get $p=0.0000000001$, would you then say that there is more evidence for $\mu=0.45$ ? –  Jul 31 '16 at 15:15
  • I would not say the second study has more empirical support *based on the p-value* because now that I've decided to reject the null the p-value has no meaning. It's the probability of finding a result as extreme, or moreso, if the null was true. I'm quite sure the null is not true. So, the p-value no longer means anything because I've deemed the conditions under which it does mean something implausible. In your situation there would be other genuinely meaningful statistics on which I may make relative judgments. – John Jul 31 '16 at 15:36
  • But p-value is just a monotonic function of a t-statistic which (for fixed $n$) is just a monotonic function of the effect size, which is as "meaningful statistic" as it gets. (That said, I fully agree that one should be rather looking at "meaningful statistics".) – amoeba Jul 31 '16 at 15:43
  • @fcop Heh? Why would I say there is more evidence for $\mu=0.45$? There is some misunderstanding here. In your example the sample mean is clearly further from $0.45$ than from $0.5$ so there is more evidence AGAINST $0.45$. – amoeba Jul 31 '16 at 15:45
  • @amoeba, the p-value is a monotonic function of the t-statistic when applied to a distribution of t-values that assume the null is true. If the null is not true that distribution does not exist and therefore the p-value is no longer calculable. It becomes a nonsense statistic. – John Jul 31 '16 at 16:32
  • I don't think I follow, @John (even though I believe that deep down we are in agreement). What is a sensible statistic when you compare two groups (and are happy to assume approximate normality)? I guess Cohen's $d$ is one possible answer. For fixed $n$, the $p$-value for $H_0:\mu=0$ is a monotonic function of $d$, whether the null is true or not. – amoeba Jul 31 '16 at 16:55
  • Sure, mathematically p is a function of the effect. It doesn't much matter which effect you pick. But the p-value is only sensible if there is no effect in the population. If you believe there is one it no longer means anything. The fact that they're mathematically related doesn't mean you can just throw out all of the logic that went into every calculating p in the first place. – John Jul 31 '16 at 17:23
  • Consider this, I can make up null distributions of many many shapes and calculate p-values that are monotonic functions of an effect. But most of those shapes you wouldn't believe are representative of a null population difference. You said it yourself, you cared about the normality assumption. That went into the specific p-value you're calculating. How much more absurd then is the p when the assumption of no difference is no longer considered plausible. – John Jul 31 '16 at 17:25
  • This makes sense, @John, but the whole point of this thread (as I see it) is about this hypothetical: `If you believe there is one [effect]`. When *do* you believe it? Example: there is some research hypothesis and you are skeptical about it. An experimenter walks in with $p=0.04$ and you remain skeptical. Another experimenter walks in with $p=0.000004$ and you have to admit that they are on to something. That's what I mean when I say that smaller p-values are "move convincing". Earlier you wrote `now that I've decided to reject the null the p-value has no meaning`. But *when* do you decide?! – amoeba Jul 31 '16 at 19:15
  • I don't see the thread as that at all and was only addressing the premise that it was a measure of surprise. That's a foundation of the argument presented by the questioner. This is becoming inappropriate for a series of comments on SE. But to briefly address your question and scenario, you decide based on an a priori criterion. That can be done a lot of ways. Apparently yours was somewhere between 0.04 and 0.000004. Whatever it was though, based on one experiment you should remain skeptical. The argument that p-values are measures of evidence was abandoned by even Fisher in the end. – John Aug 01 '16 at 05:07
  • @John I take it that you are firmly in favour of the strict Neyman-Pearson approach to statistical testing, and do not accept Fisher's approach as meaningful. All right, let's leave it here then. – amoeba Aug 01 '16 at 09:47