20

Why do statisticians discourage us from referring to results as "highly significant" when the $p$-value is well below the conventional $\alpha$-level of $0.05$?

Is it really wrong to trust a result that has 99.9% chance of not being a Type I error ($p=0.001$) more than a result that only gives you that chance at 99% ($p=0.01$)?

amoeba
  • 93,463
  • 28
  • 275
  • 317
z8080
  • 1,598
  • 1
  • 19
  • 38
  • 16
    It may be worthwile to read @gung's [answer here](http://stats.stackexchange.com/a/51823/21054). Shortly: For the decision "significant vs. not-significant" or "reject null hypothesis vs. don't reject null hypothesis" it only matters whether the $p$-value is below your $\alpha$ which you set *before* the study (Neyman & Pearson). On the other hand, you can regard the $p$-value as a continuous measure of evidence against the null hypothesis which has no "cutoff" (Fisher). – COOLSerdash Jul 11 '14 at 19:05
  • 10
    You appear to have a serious misconception about p-values (p-values **are not** error probabilities) that, if corrected, might help you understand why you might hear certain things from statisticians. – guy Jul 11 '14 at 19:12
  • And let's just follow on guy's point by reiterating the definition of the $p$-value which *is the probability of observing a test statistic as or more extreme than the one produced by your data IF the null hypothesis is true*. – Alexis Jul 11 '14 at 19:19
  • 10
    I confess that I sometimes use phrases like "highly significant." Elsewhere in the reports many of the initial results might have to be adjusted for multiple testing, wherein "highly significant" acquires the more technical meaning of "remains significant even after appropriate adjustment for multiple comparisons." Even when all readers agree on the appropriate $\alpha$ to use (which is rare for analyses used by multiple stakeholders), what is "significant" or not depends on the set of hypotheses each reader had in mind before looking at the report. – whuber Jul 11 '14 at 19:20
  • @longtalker: It'd probably be better if you cited something in particular instead of merely asserting that "statisticians discourage us..." – Steve S Jul 11 '14 at 19:29
  • @whuber - do you think that it is generally meaningful to differentiate "highly" without having a criteria for it? – EngrStudent Jul 11 '14 at 20:10
  • 3
    @Engr No, I don't. When I write reports I explain the meanings of such terms. For instance, if the client is likely to use an $\alpha$ of $0.05$ but $10$ different hypotheses might be tested, then for their convenience, with a full explanation, I might label p-values less than $0.05$ as "significant" and p-values less than $0.05/10$ as "highly significant." I try to avoid such language but often that is impossible when readers are *expecting* to see statements about statistical significance (even if they're hazy on what that really means). – whuber Jul 11 '14 at 20:56
  • 7
    Not all statisticians say it's wrong. I use the term myself on (admittedly rare) occasion - e.g. to signify that on this data the null would have been rejected by people operating at substantially lower significance levels than the one I was using, but it's important not to attach more meaning to it than it has. I'd simply say that one must exercise caution - sometimes quite a lot of it - when *interpreting* the meaning of such a phrase, rather than it being specifically *wrong*. Some of the points [here](http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf) would be relevant. – Glen_b Jul 12 '14 at 05:23
  • 7
    (ctd)... by comparison, I think a bigger concern is people using hypothesis tests that simply don't answer their question of interest (which I think is the case very often). Better to focus on that glaring and important issue, rather than be overly dogmatic about a minor infelicity in the way they express a very small p-value. – Glen_b Jul 12 '14 at 05:28
  • @Glen_b - Thank you for the link. It was good. Sometimes my answers are to "ghosts" and speaking to younger versions of myself who are not there. "We speak the truth. -Temperance Brennan". Statisticians are the door-guardians to scientific truth, most importantly in the presence of variation, uncertainty, and complexity. Philosophically and morally they must have the highest duty toward the truth - because for the technical illiterate, such things are inaccessible through any other avenue. – EngrStudent Jul 13 '14 at 01:01
  • 1
    It probably is fine provided you are speaking to a statistical audience, who will (amongst other things) know that there is a difference between "highly [statistically] significant" and "practically significant" (i.e. effect size etc). A non-statistical audience is likely to assume that means the finding is in some way important and/or certain and the phrase is best avoided IMHO. A p-value of 0.001 does not mean that you are 99.9% sure that the null hypothesis is incorrect (c.f. http://stats.stackexchange.com/questions/43339/whats-wrong-with-xkcds-frequentists-vs-bayesians-comic). – Dikran Marsupial Feb 17 '15 at 12:31
  • Don't listen to them, it's fine. It's absolutely fine. It's a formally defined and well assigned meaning in widespread use and widely documented in stats textbooks. – Owl Aug 07 '18 at 14:28

3 Answers3

18

I think there is not much wrong in saying that the results are "highly significant" (even though yes, it is a bit sloppy).

It means that if you had set a much smaller significance level $\alpha$, you would still have judged the results as significant. Or, equivalently, if some of your readers have a much smaller $\alpha$ in mind, then they can still judge your results as significant.

Note that the significance level $\alpha$ is in the eye of the beholder, whereas the $p$-value is (with some caveats) a property of the data.

Observing $p=10^{-10}$ is just not the same as observing $p=0.04$, even though both might be called "significant" by standard conventions of your field ($\alpha=0.05$). Tiny $p$-value means stronger evidence against the null (for those who like Fisher's framework of hypothesis testing); it means that the confidence interval around the effect size will exclude the null value with a larger margin (for those who prefer CIs to $p$-values); it means that the posterior probability of the null will be smaller (for Bayesians with some prior); this is all equivalent and simply means that the findings are more convincing. See Are smaller p-values more convincing? for more discussion.

The term "highly significant" is not precise and does not need to be. It is a subjective expert judgment, similar to observing a surprisingly large effect size and calling it "huge" (or perhaps simply "very large"). There is nothing wrong with using qualitative, subjective descriptions of your data, even in the scientific writing; provided of course, that the objective quantitative analysis is presented as well.


See also some excellent comments above, +1 to @whuber, @Glen_b, and @COOLSerdash.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 2
    Agreed. The $P$-value is a quantitative indicator; hence talk like this, although imprecise outside some context, is not _ipso facto_ invalid, any more than saying "Bill is tall" and "Fred is really tall" is invalid use of English. We should want to see the numbers too and their context, etc., etc. None of this stops those who want or need to make sharp decisions at $P < 0.05$ or whatever doing exactly as they wish, but their preferences don't rule on this. – Nick Cox Feb 16 '15 at 23:26
  • It's not sloppy at all. It's well documented as having a formal definition. – Owl Jul 27 '18 at 08:19
3

A test is a tool for a black-white decision, i.e. it tries to answer a yes/no question like 'is there a true treatment effect?'. Often, especially if the data set is large, such question is quite a waste of resources. Why asking a binary question if it is possible to get an answer to a quantitative question like 'how large is the true treatment effect?' that implicitly answers also the yes/no question? So instead of answering an uninformative yes/no question with high certainty, we often recommend the use of confidence intervals that contains much more information.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Michael M
  • 10,553
  • 5
  • 27
  • 43
  • 2
    +1 Although you might be more explicit in how this answers the OP's question (it's not so obvious). –  Jul 11 '14 at 19:02
  • @Matthew: I fully agree. – Michael M Jul 11 '14 at 19:06
  • THanks Michael. But I guess the confidence intervals (that give the "continuous scale" answer) would refer to effect size, right? Even so, isn't there a need for a binary answer as well to complement the continuous answer, i.e. whether or not this effect (whose size is described by the CIs) meets the agreed α-level? Or maybe you can even give CIs for the p-value itself? – z8080 Jul 12 '14 at 13:03
  • (A) "Effect size" is usually referring to a standardized version of the treatment effect and thus less easy to interprete than the effect itself. (B) CI for p values are sometimes added for simulated p values to express simulation uncertainty. *(C)* If your level is 0.05, then in almost every test situation, the black/white decision from the test can be derived by looking at the corresponding 95% ci. – Michael M Jul 14 '14 at 14:08
  • (cont.) Your question is somehow related to the following one: Is it more useful to state that even the 99.9999% c.i. is incompatible with the null or that even the lower bound of the 95% c.i. for the true effect is very promising? – Michael M Jul 14 '14 at 14:11
3

This is a common question.

A similar question may be "Why is p<=0.05 considered significant?" (http://www.jerrydallal.com/LHSP/p05.htm)

@Michael-Mayer gave one part of the answer: significance is only one part of the answer. With enough data, usually some parameters will show up as "significant" (look up Bonferroni correction). Multiple testing is a specific problem in genetics where large studies looking for significance are common and p-values <10-8 are often required (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2621212/).

Also, one issue with many analyses is that they were opportunistic and not pre-planned (i.e. "If you torture the data enough, nature will always confess." - Ronald Coase).

Generally, if an analysis is pre-planned (with a repeated-analysis correction for statistical power), it can be considered significant. Often, repeated testing by multiple individuals or groups is the best way to confirm that something works (or not). And repetition of results is most often the right test for significance.

Bill Denney
  • 506
  • 4
  • 10