17

This is a discussion question on the intersection of statistics and other sciences. I often face the same problem: researchers in my field tend to say that there is no effect when the p-value is not less than the significance level. In the beginning, I often replied this is not how hypothesis testing works. Given how often this question arises, I would like to discuss this issue with more experienced statisticians.

Let us consider a recent paper in scientific journal from “the best publishing group” Nature Communications Biology (there are multiple examples, but let's focus on one)

Researchers interpret a not statistically significant result in the following way:

Thus chronic moderate caloric restriction can extend lifespan and enhance health of a primate, but it affects brain grey matter integrity without affecting cognitive performances.

Proof:

However, performances in the Barnes maze task were not different between control and calorie-restricted animals (LME: F = 0.05, p = 0.82; Fig. 2a). Similarly, the spontaneous alternation task did not reveal any difference between control and calorie-restricted animals (LME: F = 1.63, p = 0.22; Fig. 2b).

The authors also suggest the explanation of the absence of the effect - but the key point is not the explanation but the claim itself. The provided plots look significantly different "by eye" for me (Figure 2).

Moreover, authors ignore the prior knowledge:

deleterious effects of caloric restriction on cognitive performance have been reported for rats and for cerebral and emotional functions in humans

I can understand the same claim for the huge sample sizes (no effect = no practically significant effect there), but in particular situation complex tests were used and it is not obvious for me how to perform power calculations.

Questions:

  1. Did I overlook any details that make their conclusions valid?

  2. Taking into account the need to report negative results in science, how to prove that it is not "the absence of result" (that we have with $p > \alpha$), but "negative result (eg there is no difference between groups)" using statistics? I understand that for huge sample sizes even small deviations from null cause rejection, but let's assume that we have ideal data and still need to prove that null is practically true.

  3. Should statisticians always insist on mathematically correct conclusions like "having this power we were not able to detect effect of significant size"? Researchers from other fields strongly dislike such formulations of negative results.

I would be glad to hear any thoughts on the problem and I've read and understood related questions on this web site. There is a clear answer to questions 2)-3) from the point of view of statistics, but I would like to understand how this questions have to be answered in case of interdisciplinary dialogue.

UPD: I think a good example of negative result is the 1st stage of medical trials, safety. When scientists can decide that the drug is safe? I guess they compare two groups and do statistics on this data. Is there a way to say that this drug is safe? Cochrane uses accurate "no side effect were found", but doctors say that this drug is safe. When the balance between accuracy and simplicity of description mets and we can say "there is no consequence for health"?

German Demidov
  • 1,501
  • 10
  • 22
  • 3
    You call results which are not-statistically significant a "negative" study. This is defenestrating language. I revised it to call it as it is: non-statistically significant e.g. $p > \alpha$. If I am wrong, please tell me how. Otherwise, it is useful language for you and your collaborators for describing a study. $p > \alpha$ only means that $p > \alpha$. If $n = 500,000$ that can be a very "positive" finding in some respects; perhaps this is the first large scale epidemiologic study to inspect the relation of a chemical exposure and human health which finds that it is in fact safe. – AdamO Apr 09 '18 at 16:13
  • @AdamO you are right. My question is tricky and I can not clearly describe it. In other words, when the author want to call his paper "Study reveals no difference between A and B", and there are thousands of such papers, he claims "I found negative result and nobody should ever try to repeat it". What the author should do to be able to claim this? I understand the practical and statistical significance, but non-statisticians are asking me "how do you prove negative results?" - and I do not know how to answer. Your formulation was correct, I changed it to "show" the way non-statisticians think – German Demidov Apr 09 '18 at 16:23
  • 5
    Side note: I would **never** suggest using Nature as a guideline for how to properly use statistics. – Cliff AB Apr 09 '18 at 16:26
  • 1
    @AdamO I have an example of two papers published more or less the same time, in one paper authors claimed strongly negative result (it was their main conclusion), in the second, more powerful study, they found and effect. But, if the first author would write "having power of 80% with effect size of 1 we were not able to find a significant effect" - he would not be published even in the journal of negative results. – German Demidov Apr 09 '18 at 16:27
  • @CliffAB yes, but papers from this particular journal may be of tremendous effect on people. Eg, if there is a cognitive decrease as a consequence of caloric restriction, but somebody reads this papers that there is no effect and decides to apply calorie restriction on himself or even to motivate other people... – German Demidov Apr 09 '18 at 16:35
  • 2
    *but non-statisticians are asking me "how do you prove negative results?" - and I do not know how to answer.* What about [hypothesis often used in equivalence trials](https://en.wikipedia.org/wiki/Equivalence_test)? This includes an extra term as the "margin of equivalence" and can take the mean difference into account. – Penguin_Knight Apr 09 '18 at 16:37
  • 1
    @GermanDemidov I don't doubt studies *have* been reported in such a way, I'm sure there are *many, many* examples of that. However, I don't think it is inline with sound statistical practices. What you will find, in terms of advice on this forum, is recommendations for the latter rather than mimicking what has been done before. Since we have precise language--(not) statistically significant--to refer to test results, let's use it (if we must) and avoid ambiguity. The dirty truth is that reviewers don't publish studies which are called "negative" (1/2) – AdamO Apr 09 '18 at 16:39
  • 1
    @GermanDemidov However, I have published many papers where we argued that we applied epidemiologically sound methods on well-collected data, and found results which indicated a lack of effect (or where we found a significant effect of confounding). After describing limitations, we described the study results as not-statistically significant, and have managed to get publications into relatively respected journals. A "negative" study seems to bear two meanings: 1) not statistically significant and 2) not worth publishing. This is a cardinal sin against science as the epi journals argue. (2/2) – AdamO Apr 09 '18 at 16:43
  • @AdamO thank you, it became much more clear for me! I think parts of these comments worth to be included in your answer. – German Demidov Apr 09 '18 at 16:45
  • The paper you quote **is not from *Nature*!** It is from *Communications Biology*, a new journal from the Nature publishing group. CC @CliffAB. – amoeba Apr 09 '18 at 21:05
  • @amoeba sorry you are right. – German Demidov Apr 09 '18 at 21:12
  • 2
    It's a common mistake that Nature Publishing Group is exploiting, but the difference in prestige between the journals is enormous. That said, of course papers in Nature itself can also have sloppy statistics. – amoeba Apr 09 '18 at 21:26
  • 1
    @AdamO Why aren't you looking for evidence of equivalence in those studies? There's even [FDA guidelines on preferred forms of equivalence testing and such in the US](https://www.fda.gov/downloads/drugs/guidances/ucm070244.pdf). To quote Altman and Bland "absence of evidence isn't evidence of absence." – Alexis Apr 09 '18 at 21:39
  • In all sincerity, how are non-statistically significant (@ α = .05) papers treated compared to statistically significant ones, both in review and after publication? What are the statistics on each one's propensity to get published? – smci Apr 10 '18 at 05:16

4 Answers4

12

Speaking to the title of your question: we never accept the null hypothesis, because testing $H_{0}$ only provides evidence against $H_{0}$ (i.e. conclusions are always with respect to the alternative hypothesis, either you found evidence for $H_{A}$ or your failed to find evidence for $H_{A}$).

However, we can recognize that there are different kinds of null hypothesis:

  • You have probably learned about one-sided null hypotheses of the form $H_{0}: \theta \ge \theta_{0}$ and $H_{0}: \theta \le \theta_{0}$

  • You have probably learned about two-sided null hypotheses (aka two-tailed null hypotheses) of the form $H_{0}: \theta = \theta_{0}$, or synonymously $H_{0}: \theta - \theta_{0} = 0$ in the one-sample case, and $H_{0}: \theta_{1} = \theta_{2}$, or synonymously $H_{0}: \theta_{1} - \theta_{2} = 0$ in the two-sample case. I suspect this specific form of null hypothesis is what your question is about. Following Reagle and Vinod, I term null hypotheses of this form positivist null hypotheses, and make this explicit with the notation $H^{+}_{0}$. Positivist null hypotheses provide, or fail to provide evidence of difference or evidence of an effect. Positivist null hypotheses have an omnibus form for $k$ groups: $H_{0}^{+}: \theta_{i} = \theta_{j};$ for all $i,j \in \{1, 2, \dots k\};$ $\text{ and }i\ne j$.

  • You may just now be learning about joint one-sided null hypotheses, which are null hypotheses of this form $H_{0}: |\theta - \theta_{0}|\ge \Delta$ in the one-sample case, and $H_{0}: |\theta_{1} - \theta_{2}|\ge \Delta$ in the two-sample case, where $\Delta$ is the minimum relevant difference that you care about a priori (i.e. you say up front that differences smaller than this do not matter). Again, following Reagle and Vinod, I term null hypotheses of this form negativist null hypotheses, and make this explicit with the notation $H^{-}_{0}$. Negativist null hypotheses provide evidence of equivalence (within $\pm\Delta$), or evidence of absence of an effect (larger than $|\Delta|$). Negativist null hypotheses have an omnibus form for $k$ groups: $H_{0}^{-}: |\theta_{i} = \theta_{j}|\ge \Delta;$ for all $i,j \in \{1, 2, \dots k\};$ $\text{ and }i\ne j$ (Wellek, chapter 7)

The very cool thing to do is combine tests for difference with tests for equivalence. This is termed relevance testing, and places both statistical power and effect size explicitly within the conclusions drawn from a test, as detailed in the description of the [tost] tag. Consider: if you reject $H_{0}^{+}$ is that because there is a true effect of a size you find relevant? Or is it because your sample size was simply so large your test was over-powered? And if you fail to reject $H_{0}^{+}$, is that because there is no true effect, or because your sample size was too small, and your test under-powered? Relevance tests address these issues head-on.

There are a few ways to perform tests for equivalence (whether or not one is combining with tests for difference):

  • Two one-sided tests (TOST) translates the general negativist null hypothesis expressed above into two specific one-sided null hypotheses:
    • $H^{-}_{01}: \theta - \theta_{0} \ge \Delta$ (one-sample) or $H^{-}_{01}: \theta_{1} - \theta_{2} \ge \Delta$ (two-sample)
    • $H^{-}_{02}: \theta - \theta_{0} \le -\Delta$ (one-sample) or $H^{-}_{01}: \theta_{1} - \theta_{2} \le -\Delta$ (two-sample)
  • Uniformly most powerful tests for equivalence, which tend to be much more arithmetically sophisticated than TOST. Wellek is the definitive reference for these.
  • A confidence interval approach, I believe first motivated by Schuirman, and refined by others, such as Tryon.


References Reagle, D. P. and Vinod, H. D. (2003). Inference for negativist theory using numerically computed rejection regions. Computational Statistics & Data Analysis, 42(3):491–512.

Schuirmann, D. A. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657–680.

Tryon, W. W. and Lewis, C. (2008). An inferential confidence interval method of establishing statistical equivalence that corrects Tryon’s (2001) reduction factor. Psychological Methods, 13(3):272–277.

Tryon, W. W. and Lewis, C. (2009). Evaluating independent proportions for statistical difference, equivalence, indeterminacy, and trivial difference using inferential confidence intervals. Journal of Educational and Behavioral Statistics, 34(2):171–189.

Wellek, S. (2010). Testing Statistical Hypotheses of Equivalence and Noninferiority. Chapman and Hall/CRC Press, second edition.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • 1
    Whoever down-voted me should step up with some feedback about why: it should be clear that I provide detailed answers, and am responsive to input. – Alexis Apr 09 '18 at 21:03
9

You are referring to standard inference practice taught in statistics courses:

  1. form $H_0,H_a$
  2. set the significance level $\alpha$
  3. compare p-value with $\alpha$
  4. either "reject $H_0$, accept $H_a$" or "fail to reject $H_0$"

This is fine, and it's used in practice. I would even venture to guess this procedure could be mandatory in some regulated industries such as pharmaceuticals.

However, this is not the only way statistics and inference applied in research and practice. For instance, take a look at this paper: "Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC". The paper was first to present the evidence of existence of Higgs boson, in so called ATLAS experiment. It was also one of those papers where the list of authors is as long as its actual content :)

  • The paper doesn't mention neither $H_0$ nor $H_a$. The term "hypothesis" is used, and you could guess what was their $H_0$ reading the text.
  • They use the term "significance", but not as $\alpha$-significance threshold in the "standard" inference. They simply express the distance in standard deviations, e.g. "the observed local significances for mH = 125 GeV are 2.7$\sigma$"
  • they present "raw" p-values, and don't run them through "reject/fail to reject" comparisons with significance levels $\alpha$, as I wrote earlier they don't even use the latter
  • they present confidence intervals at usual confidence levels such as 95%

Here's how the conclusion is formulated: "These results provide conclusive evidence for the discovery of a new particle with mass 126.0 ± 0.4 (stat) ± 0.4 (sys) GeV." The words "stat" refers to statistical and "sys" to systematic uncertainties.

So, as you see not everyone does the four step procedure that I outlined in the beginning of this answer. Here, the researchers show the p-value without pre-establishing the threshold, contrary to what is taught in statistics classes. Secondly, they don't do "reject/fail to reject" dance, at least formally. They cut to the chase, and say "here's the p-value, and that's why we say we found a new particle with 126 GeV mass."

Important note

The authors of the Higgs paper did not declare the Higgs boson yet. They only asserted that the new particle was found and that some of its properties such as a mass are consistent with Higgs boson.

It took a couple of years to gather additional evidence before it was established that the particle is indeed the Higgs boson. See this blog post with early discussion of results. Physicists went on to check different properties such as zero spin. And while the evidence was gathered at some point CERN declared that the particle is Higgs boson.

Why is this important? Because it is impossible to trivialize the process of scientific discovery to some rigid statistical inference procedure. Statistical inference is just one tool used.

When CERN was looking for this particle the focus was on first finding it. It was the ultimate goal. Physicist had an idea where to look at. Once they found a candidate, they focused on proving it's the one. Eventually, the totality of evidence, not a single experiment with p-value and significance, convinced everyone that we found the particle. Include here all the prior knowledge and the standard model. This is not just a statistical inference, the scientific method is broader than that.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • wow, your answer is great! this is a really good example. I hope that in maximum 10 years life scientists will also come to this report style! – German Demidov Apr 09 '18 at 17:49
8

I think it is at times appropriate to interpret non-statistically significant results in the spirit of "accept the null hypothesis". In fact, I have seen statistically significant studies interpreted in such a fashion; the study was too precise and results were consistent with a narrow range of non-null but clinically insignificant effects. Here's a somewhat blistering critique of a study (or moreover its press) about the relation between chocolate/red wine consumption and its "salubrious" effect on diabetes. The probability curves for insulin resistance distributions by high/low intake is hysterical.

Whether one can interpret findings as "confirming H_0" depends on a great number of factors: the validity of the study, the power, the uncertainty of the estimate, and the prior evidence. Reporting the confidence interval (CI) instead of the p-value is perhaps the most useful contribution you can make as a statistician. I remind researchers and fellow statisticians that statistics do not make decisions, people do; omitting p-values actually encourages a more thoughtful discussion of the findings.

The width of the CI describes a range of effects which may or may not include the null, and may or may not include very clinically significant values like life-saving potential. However, a narrow CI confirms one type of effect; either the latter type which is "significant" in a true sense, or the former which may be the null or something very close to the null.

Perhaps what is needed is a broader sense of what "null results" (and null effects) are. What I find disappointing in research collaboration is when investigators cannot a priori state what range of effects they are targeting: if an intervention is meant to lower blood pressure, how many mmHg? If a drug is meant to cure cancer, how many months of survival will the patient have? Someone who is passionate with research and "plugged-in" to their field and science can rattle off the most amazing facts about prior research and what has been done.

In your example, I can't help but notice that the p-value of 0.82 is likely very close to the null. From that, all I can tell is that the CI is centered on a null value. What I do not know is whether it encompasses clinically significant effects. If the CI is very narrow, the interpretation they give is, in my opinion, correct but the data do not support it: that would be a minor edit. In contrast, the second p-value of 0.22 is relatively closer to its significance threshold (whatever it may be). The authors correspondingly interpret it as "not giving any evidence of difference" which is consistent with a "do not reject H_0"-type interpretation. As far as the relevance of the article, I can say very little. I hope that you browse the literature finding more salient discussions of study findings! As far as analyses, just report the CI and be done with it!

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • 1
    AdamO, isn't the *F* statistic closest to the null equal to the *mean* of the *F* distribution for a given numerator and denominator degrees of freedom? If anything, I think an *F* statistic close to **0** implies omnibus evidence of equivalence. In fact, Wellek motivates precisely this in the 2010 *Testing Statistical Hypotheses of Equivalence and Noninferiority*, section 7.2 $F$-test for equivalence of $k$ normal distributions, pages 221–225. – Alexis Apr 09 '18 at 14:52
  • @Alexis Thanks for pointing out the F-test properties. Without knowing the degrees of freedom, it's hard for me to comment intelligently about the test. Perhaps I should revise the answer to point solely to the $p$-values. At any rate, the main point of my answer is that we cannot hold the two hypotheses $\mu=\mu_0$ and $\mu \ne \mu_0$ with equal intrigue: one of these is always true, so testing makes no sense. We have to use descriptive methods, but they can be made rigorous with a confidence interval. – AdamO Apr 09 '18 at 15:03
  • Of course! (and +1 if that was not clear) But seriously, you should savvy to equivalence testing: it emerged within clinical epidemiology and biostatistics (an honorable heritage for the field!), but is of general import to frequentist inference. :) – Alexis Apr 09 '18 at 15:14
  • @AdamO thank you for the great answer. I totally agree with the points you made, but I know how to calculate CIs (and express effect size) only for simple tests like comparison of means, even for proportions the effect sizes become more tricky, and the authors of particular paper use some kind of survival test (I am not familiar with) - it is not clear for me what CIs should indicate in this case. Another issue is that tests are weak/strong against different alternatives (eg Shaprio is weak against multimodality) - can we somehow include it into consideration? with permutations/simulations? – German Demidov Apr 09 '18 at 15:33
  • 1
    @GermanDemidov I take a hard line on these matters: I think complicated analyses shouldn't be considered if their effects cannot be interpreted. They *do* have an interpretation. Survival Analysis 2nd ed by Hosmer, Lemeshow, May has a whole chapter (4) dedicated to interpretation of Cox model output. The deficiency of tests, like Shapiro, are best addressed using plots (this often precludes the test itself). Resampling statistics provide a powerful means to calculate CIs under a wide variety of modeling conditions, but they require sound theory to be used correctly. – AdamO Apr 09 '18 at 16:08
  • 3
    In rigid inference framework there is no such a thing as "0.82 is close to the null," because p-value is a random number, its particular level is irrelevant. The p-value can't be large or small in absolute value. Its level only matters in relation to the pre-established threshold, a significance $\alpha$. You compare with a threshold, and based on the outcome of comparison reject it or fail to reject $H_0$. – Aksakal Apr 09 '18 at 18:37
  • @Aksakal that's right, but in your answer you argue in favour to report of raw p-values instead of Neumann-Pearson framework, so (as I've read in some other question) raw p-value itself provides some evidence. – German Demidov Apr 10 '18 at 08:39
  • @Aksakal I think we all agree that testing, if done, must be done properly. I agree with you. I will say it until I choke: if they had reported the damn confidence interval, we might have had some *real* evidence to cross reference their claims. – AdamO Apr 10 '18 at 13:35
6

There are ways to approach this that don't rely on the power calculations (see Wellek, 2010). In particular, you can also test whether you reject the null that the effect is of an a priori meaningful magnitude.

Daniël Lakens advocates in this situation for equivalence testing. Lakens in particular uses "TOST" (two one-sided tests) for mean comparisons, but there are other ways to get at the same idea.

In TOST you test a compound null: the one-sided null hypothesis that your your effect is more negative than the smallest negative difference of interest and the null that your effect is more positive than the smallest positive difference of interest. If you reject both, then you can claim that there is no meaningful difference. Note that this can happen even if the effect is significantly different from zero, but in no case does it require endorsing the null.

Lakens, D. (2017). Equivalence tests: a practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355-362.

Wellek, S. (2010). Testing Statistical Hypotheses of Equivalence and Noninferiority. Chapman and Hall/CRC Press, second edition.

Alexis
  • 26,219
  • 5
  • 78
  • 131
Patrick Malone
  • 394
  • 1
  • 6