Implications of current debate on statistical significance

Question

In the past few years, various scholars have raised a detrimental problem of scientific hypothesis testing, dubbed "researcher degree of freedom," meaning that scientists have numerous choices to make during their analysis that bias towards finding with p-value < 5%. These ambiguous choices are, for example, which case to be included, which case is categorized as outlier, running numerous model specification until something shows up, do not publish null results, etc. (The paper that sparked this debate in psychology is here, see a popular Slate article and follow-up debate by Andrew Gelman here, and the Time magazine also touches on this topic here.)

First, one clarification question:

The Time magazine wrote,

"A power of 0.8 means that of ten true hypotheses tested, only two will be ruled out because their effects are not picked up in the data;"

I am not sure how this fits into the definition of the power function I found in textbook, which is the probability of rejecting the null as a function of parameter $\theta$. With different $\theta$ we have different power, so I don't quite understand the above quote.

Second, some research implications:

In my field of political science / economics, scholars simply use up all the country-year data available. Thus, should we not be concerned with sample fiddling here?
Can the problem of running multiple tests but reporting only one model be fixed simply by the fact that someone else in the discipline will re-test your paper and strike you down immediately for not having robust results? Anticipating this, scholars in my field are more likely to include a robustness check section, where they show that multiple model specifications does not change the result. Is this sufficient?
Andrew Gelman and others raise the point that no matter the data, it would be always possible to find and publish some "pattern" that isn't really there. But this should not be a concern, given the fact that any empirical "pattern" must be supported by a theory, and rival theories within a discipline will just engage in an debate / race to find which camp is able to find more "patterns" in various places. If a pattern is truly spurious, then the theory behind will be quickly struck down when there is no similar pattern in other samples / settings. Isn't this how science progresses?
Assuming that the current trend of journals for null result will actually flourish, is there a way for us to aggregate all the null and positive results together and make an inference on the theory that they all try to test?

See also ["Theory-Testing in Psychology and Physics: A Methodological Paradox"](http://mres.gmu.edu/pmwiki/uploads/Main/Meehl1967.pdf). The "null hypothesis" is always false for your field. Even with proper research practices significance tests and hypothesis tests are probably inappropriate. — Flask, Nov 08 '13 at 03:40
Your question 1 conflicts with question 3. In polsci/economics are there other samples/settings available or not? — Flask, Nov 08 '13 at 16:28

score 11 · Answer 1 · answered Nov 08 '13 at 11:42

11

Instead of using p-values to assess claims we should follow Robert Abelson's advice and use the MAGIC criteria:

Magnitude
Articulation
Generality
Interestingness
Credibility

For more on Abelson see my review of his book

And we should be concentrating on effect sizes, not p-values in statistical output (with the possible exception of some sorts of data mining, on which I am not expert at all). And effect sizes are to be judged in context:

1 in 1000 pairs of pants gets the wrong size label - not a big deal
1 in 1000 airplanes are defective in a way that leads to crashes - a big deal
1 in 1000 nuclear reactors is defective in a way that leads to meltdown - uh oh

A statistician/data analyst should not be some odd person, used like a black box into which data is put and out from which p values are gotten; he/she should be a collaborator in research designed to make a reasonable argument about the meaning of some set of data in the context of some field, given the current theories (or their lack) and current evidence (or lack of same).

Unfortunately, this approach requires thought on the part of the substantive researchers, the data analyst and whoever reviews the results (be it a pointy haired boss, a dissertation committee, a journal editor or whoever). Oddly, even academics seem averse to this sort of thought.

For more on my views, here is an article I wrote that got published in Sciences360.

answered Nov 08 '13 at 11:42

Peter Flom

94,055
35
143
276

4

+1 While I most certainly agree with you, I can imagine that saying 'my claim is supported by MAGIC' might not always be helpful :-) – Marc Claesen Nov 08 '13 at 11:53
1

Yeah, you would have to spell it out, but, if you did, I think it might work: "These are large effects that have few exceptions, affect a large number of people, are interesting because XXXX and are credible because they XXXX" might work. I've not seen it tried. :-) – Peter Flom Nov 08 '13 at 11:56
Right now "credibility" in science is ascertained via the p-value. Do you offer any alternative? And why is p-value a bad way to assess credibility? – Heisenberg Nov 08 '13 at 14:03
1

Yes; a claim is "credible" if there is theory that says how it could happen; if it is replicated, and so on. It is less credible if there is no physical or other theoretical explanation. The less credible a claim the more evidence is needed for it. – Peter Flom Nov 08 '13 at 14:07
Credibility reminds me of the pitfalls of Bayesian statistics. I can't remember who penned the quote but it was said that "with probability 0 that the moon is made of cheese, even astronauts returning with arms full of cheese wouldn't convince." statistics should summarize the weight of even implausible data. – AdamO Nov 08 '13 at 15:26
2

@Anh Credibility in science should be measured by how well the theories predict phenomena not used in developing the theory. When assessing whether the predictions were good ones, credibility requires replication by independent researchers. There is tons of empirical evidence that significance testing and hypothesis testing both appear to actually discourage both behaviours, instead encouraging the counter productive activities of publication bias and "p-hacking" of an arbitrary "significance" cut off. – Flask Nov 08 '13 at 18:09
1

@Flask - I would say the p-values are not necessarily the problem, more that using weak hypothesis tests are the problem. Physics uses p-values too but with hypothesis that lead to point predictions (i.e. an actual null hypothesis). Finding a "positive effect" is basically useless for theory building - you need to make a point estimate to properly confirm the theory. – probabilityislogic Nov 09 '13 at 12:05
@probabilityislogic I agree 100% there is a problem with testing statistical hypotheses that do no correspond to a theoretical hypothesis. I have been thinking of the best way to address this and asked [this question](http://stats.stackexchange.com/questions/74898/how-do-i-combine-multiple-prior-components-and-a-likelihood). If you could give feedback it would be awesome. – Flask Nov 09 '13 at 12:20

AdamO · Answer 2 · 2013-11-08T04:04:04.590

The field of statistical science has addressed these issues since its outset. I keep saying the role of the statistician is to ensure that the type 1 error rate remains fixed. This implies that the risk of making false positive conclusions cannot be eliminated, but can be controlled. This should draw our attention to the extremely large volume of scientific research that's being conducted rather than toward the philosophy and ethics of general statistical practice. For every incredible (uncredible) result that surfaces in the media (or in government policy) at least 19 other uncredible results were shot down for their null findings.

Indeed, if you go to, say, clinicaltrials.gov, you will observe there are (for almost any disease indication) well over 1,000 clinical trials for pharmaceutical agents going on in the US at this very moment. That means, that with a false positive error rate of 0.001, on average at least 1 drug will be put on the shelves that has no effect. The validity of 0.05 as a validated threshold for statistical significance has been challenged again and again. Ironically, it's only the statisticians who feel uncomfortable with using a 1/20 false positive error rate whereas financial stakeholders (be they PIs, or Merck) will pursue beliefs tenaciously regardless of in-vitro results, theoretical proofs, or strength of prior evidence. Honestly, that tenacity is a successful and laudable personal quality of many individuals who are successful in non-statistical roles. They are generally seated above statisticians, in their respective totems, who tend to leverage that tenacity.

I think the Time quote you put forward is completely wrong. Power is the probability of rejecting the null hypothesis given it's false. This more importantly depends on exactly how "false" the null hypothesis is (which in turn depends on a measurable effect size). I rarely talk of power out of the context of the effect which we would deem "interesting" to detect. (for instance, a 4 month survival following chemotherapeutic treatment of stage 4 pancreatic cancer is not interesting, hence there's no reason to recruit 5,000 individuals for a phase 3 trial).

To address the questions you asked

???
Multiplicity is difficult because it does not lead to an obvious decision rule about how to handle the data. For instance, suppose we are interested in a simple test of mean difference. Despite the infinite protestations of my colleagues, it is easy to show a t-test is well calibrated to detect differences in mean regardless of the sampling distribution of the data. Suppose we alternately pursued their path. They would begin by testing for normality using some variant of a well known distributional test (say calibration of the qqplot). If the data appeared sufficiently non-normal, they would then ask whether the data follow any well known transformation, and then apply a Box Cox transformation to determine a power transformation (possibly logarithmic) which maximizes entropy. If an obvious numerical value pops out, they will use that transformation. If not, they will use the "distribution free" Wilcoxon test. For this ad-hoc sequence of events, I cannot begin to hope how to calculate the calibration and power for a simple test of mean differences when the simple, stupid t-test would have sufficed. I suspect stupid acts like this can be linked mathematically to Hodge's superefficient estimation: estimators which are high power under a specific hypothesis we want to be true. Nonetheless, this process is not statistical because the false positive error rate has not been controlled.
The concept that trends can be "discovered" erroneously in any random set of data probably traces back to the well written article by Martin called "Munchaesen's Statistical Grid". This is a very illuminating read and dates back to 1984 before the golden calf of machine learning was born unto us as we presently know it. Indeed, a correctly stated hypothesis is falsifiable, but type 1 errors have grown to be much more costly in our data driven society than they ever were before. Consider, for instance, the falsified evidence of the anti-vaccine research that has led to a massive sequence of pertussis deaths. The results which spurned the public defenestration of vaccines was linked a a single study (which, although wrong, was neither confirmed by external research). There is an ethical impetus to conduct results and report honest-to-goodness strength of evidence. How strong is evidence? It has little to do with the p-value you obtain, but the p-value you said you would call significant. And remember, fudging your data changes the value of p, even when the final confirmatory test reports something different (often much smaller).
YES! You can clearly see in meta-analyses published by journals such as the Cochrane report that the distribution of test results looks more bimodal than noraml, with only positive and negative results making it into journals. This evidence is absolutely bonkers and confusing for anyone in clinical practice. If, instead, we publish null results (that come from studies whose results we would have been interested in, regardless of what they come to be), then we can expect meta-analyses to actually represent evidence that is meaningful and representative.

In "[On probability as a basis for action"](https://www.deming.org/media/pdf/145.pdf) William Deming makes a distinction between "enumerative" and "analytic" studies. He makes the point that the results of every experiment are conditional on the exact environment of the experiment, therefore the statisticians attempts to control the "type I error rate" will always be an off by an unknown amount when a treatment is applied under different conditions. — Flask, Nov 08 '13 at 04:06
@Flask Similarly no mechanical procedure on the international space station is perfectly calibrated, but the engineers' attention to detail and minimization of error ensured that we didn't find a Space Oddity on our hands. — AdamO, Nov 08 '13 at 04:13
The engineers (hopefully) test the components under all expected conditions and then add extra margin of error based on models of they generate. This is the type of behaviour advocated by Deming and differs from trying to draw conclusions about future performance of a treatment or relationship between factors from assessing the sampling error of only one study. It is a very interesting distinction I have not seen mentioned elsewhere. — Flask, Nov 08 '13 at 04:21
I don't think it is at all defensible to say that a "process is not statistical because the false positive error rate has not been controlled." There is far more to statistics than frequentism with its error rate controls, and the non-frequentist bits are the more useful bits for science. You might like to read my recently arXived paper on the topic: http://arxiv.org/abs/1311.0081 — Michael Lew, Nov 08 '13 at 04:46
@AdamO What part of my question 1 do you not understand? One possible cause of "researcher degree of freedom" is that researchers can fudge which case to be included. In political science / economics, we only have so many countries and so many years, so essentially everyone has the same sample. Does this eliminate the above concern? Also, isn't power the probability of rejecting the null given all $\theta \in \Theta$ instead of only $\theta$ in the alternative hypothesis? — Heisenberg, Nov 08 '13 at 04:51
@MichaelLew we are in complete agreement that the (over)use of p-values is confusing and misleading. However, your paper speaks only of testing in the context of exact, finite sample, likelihood based inference and no p-values calculated from the limiting distributions of non-parametric test-statistics, as with the T-test. Asymptotic tests are indeed calibrated when sound statistical practice is applied. I frequently encounter tests (or techniques, more precisely) which are conservative or anti-conservative and touted to be of correct size. — AdamO, Nov 08 '13 at 05:54
@Adamo Actually, the paper speaks of the inappropriateness of the concept of test size as a mark of the scientific utility of a statistical procedure. The repeated sampling principle does not allow the data to be interpreted as evidence. — Michael Lew, Nov 08 '13 at 10:59
@MichaelLew "The repeated sampling principle does not allow the data to be interpreted as evidence" this is certainly a popular opinion among Bayesians, but can neither be shown nor stated as fact. Repeated "sampling" (or more concisely: experimental replication) is the basis of all frequentist inference. — AdamO, Nov 08 '13 at 16:09
@Anh "Researcher degree of freedom" is a confusing term. Statistically speaking, degrees of freedom are multiplicative corrections to statistics to obtain unbiased and efficient tests. Traditional degrees of freedom can be derived mathematically. What I think you mean are "ad-hoc" analyses, with gratuitous statistical fishing, in which inference becomes biased in an immeasurable way. However, allowing multiple investigators to analyze the same data independently and challenge one another's conclusions is a *great* way to conduct science. It certainly helped Netflix. — AdamO, Nov 08 '13 at 16:15
@Adamo The absence of any quantitation of evidence in frequentist inference is indeed a popular opinion among Bayesians (and likelihoodians), but it is well validated and was the explicitly expressed opinion of Neyman and Pearson in the first paper where they devised the frequentist methods! Maybe you should read my paper with an open mind. The information is all there. — Michael Lew, Nov 08 '13 at 20:38

score 3 · Answer 3 · answered Nov 08 '13 at 18:55

First, I am not a statistician, just a researcher who has looked into it alot the last few years to figure out why the methods I observe being used around me are so lacking and why there is so much confusion about basic concepts like the "what is a p-value?" I will give my perspective.

First, one clarification question:

The Time magazine wrote,
"A power of 0.8 means that of ten true hypotheses tested, only two will be ruled out > because their effects are not picked up in the
data;"

I am not sure how this fits into the definition of the power function I found in textbook, which is the probability of rejecting the null as a function of parameter θ. With different θ we have different power, so I don't quite understand the above quote.

Power is a function of θ, variance, and sample size. I am not sure what the confusion is. Also for many cases in which significance testing is used null hypothesis of mean1=mean2 is always false. In these cases significance is only a function of sample size. Please read Paul Meehl's "Theory-Testing in Psychology and Physics: A Methodological Paradox" it clarified many things for me and I have never seen an adequate response. Paul Meehl has a few other papers on this you can find by searching his name.

In my field of political science / economics, scholars simply use up all the country-year data available. Thus, should we not be concerned with sample fiddling here?

If you read the Simmons 2011 paper this is only one of the "p-hacking" techniques mentioned. If it is true that there is only one data set and no one picks out selective samples from it then I guess there is no room for increasing sample size.

Can the problem of running multiple tests but reporting only one model be fixed simply by the fact that someone else in the discipline will re-test your paper and strike you down immediately for not having robust results? Anticipating this, scholars in my field are more likely to include a robustness check section, where they show that multiple model specifications does not change the result. Is this sufficient?

If replication was occurring without publication bias there would be no need for "journals of the null result". I would say the robustness check section is good to have but is not sufficient in the presence of researchers failing to publish what they consider null results. Also I would not consider a result robust just because multiple analysis techniques on the same data come to the same conclusion. A robust result is one that makes a correct prediction of effect/correlation/etc on new data.

A replication is not getting p<0.05 both times. The theory should be considered more robust if it predicted a different effect/correlation/etc than used in the first study. I do not refer to the presence of an effect or correlation, but the precise value or a small range of values compared to possible range of values. The presence of increased/decreased effect or positive/negative correlation are 100% likely to be true in the case of the null hypothesis being false. Read Meehl.

Andrew Gelman and others raise the point that no matter the data, it would be always possible to find and publish some "pattern" that isn't really there. But this should not be a concern, given the fact that any empirical "pattern" must be supported by a theory, and rival theories within a discipline will just engage in an debate / race to find which camp is able to find more "patterns" in various places. If a pattern is truly spurious, then the theory behind will be quickly struck down when there is no similar pattern in other samples / settings. Isn't this how science progresses?

Science cannot function properly if researchers are failing to publish null results. Also just because the pattern was not discovered in the second sample/setting does not mean it does not exist under the conditions of the initial study.

Assuming that the current trend of journals for null result will actually flourish, is there a way for us to aggregate all the null and positive results together and make an inference on the theory that they all try to test?

This would be meta-analysis. There is nothing special about null results in this case other than that researchers do not publish them because the p-values were above the arbitrary threshold. In the presence of publication bias meta-analysis is unreliable as is the entire literature suffering from publication bias. While it can be useful, meta analysis is far inferior for assessing a theory than having that theory make a precise prediction that is then tested. Publication bias does not matter nearly as much as long as new predictions pan out and are replicated by independent groups.

My confusion about the Time quote is that the power function shouldn't be restricted to when the null is true as the quote implies. The domain of the power function is the entire parameter space if I'm not mistaken. And hence, there is no particular "power 0.8" that one can assign to a test. — Heisenberg, Nov 08 '13 at 20:54
I agree fully with you on the point that a theory needs to be tested on new data. But in the case of political science or macro economics, where we only have so many countries and so many years, is the effort necessarily thwarted then? — Heisenberg, Nov 08 '13 at 20:57
@Anh every second there is new data to add. The theory should predict the future. In astronomy there was predicting the positions of comets for example. Also you calculate the power for an expected parameter value. So in the case of the quote, they would be referring to the power to test a theory that predicted a correlation of at least r=.5. — Flask, Nov 09 '13 at 02:15
To clarify r=0.5 would be an example of a correlation predicted by a theory. — Flask, Nov 09 '13 at 02:23

score 2 · Answer 4 · answered Nov 09 '13 at 02:00

I would put it simply as null hypothesis testing is really only about the null hypothesis. And generally, the null hypothesis isn't usually what is of interest, and may not even be "the status quo" - especially in regression type of hypothesis testing. Often in social science there is no status quo, so the null hypothesis can be quite arbitrary. This makes a huge difference to the analysis, as the starting point is undefined, so different researches are starting with different null hypothesis, most likely based on whatever data they have available. Compare this to something like Newton's laws of motion - it makes sense to have this as the null hypothesis, and try to find better theories from this starting point.

Additionally, p-values don't calculate the correct probability - we don't want to know about tail probabilities, unless the alternative hypothesis is more likely as you move further into the tails. What you really want is how well the theory predicts what was actually seen. For example, suppose I predict that there is a 50% chance of a "light shower", and my competitor predicts that there is a 75% chance. This turns out to be correct, and we observe a light shower. Now when decide which weather-person is correct, you shouldn't give my prediction additional credit for also giving a 40% chance of a "thunderstorm", or take credit away from my competitor for giving "thunderstorm" a 0% chance.

A bit of thinking about this will show you that it is not so much how well a given theory fits the data, but more about how poorly any alternative explanation fits the data. If you work in terms of Bayes Factors, you have prior information $I$, data $D$, and some hypothesis $H$, the bayes factor is given by:

$$BF=\frac{P(D|HI)}{P(D|\overline{H}I)}$$

If the data is impossible given that $H$ is false, then $BF=\infty$ and we become certain of $H$. The p-value typically gives you the numerator (or some approximation/transformation thereof). But note also that a small p-value only constitutes evidence against the null if there is an alternative hypothesis that fitsthe data. You could invent situations where a p-value of $0.001$ actually provides support for the null hypothesis - it really depends on what the alternative is.

There is a well known and easily misunderstood empirical example of this where a coin is tossed $104,490,000$ times and the number of heads is $52,263,471$ - slightly off one half. The null model is $y\sim Bin(n,0.5)$ and the alternative is $y|\theta\sim Bin(n,\theta)$ and $\theta\sim U(0,1)$ for a marginal model of $y\sim BetaBin(n,1,1)\sim DU(0,\dots,n)$ (DU= discrete uniform). The p-value for the null hypothesis is very small $p=0.00015$, so reject the null and publish right? But look at the bayes factor, given by:

$$BF=\frac{{n\choose y}2^{-n}}{\frac{1}{n+1}}=\frac{(n+1)!}{2^ny!(n-y)!}=11.90$$

How can this be? The Bayes Factor supports the null hypothesis in spite of the small p-value? Well, look at the alternative - it gave a probability for the observed value of $\frac{1}{n+1}=0.0000000096$ - the alternative does not provide a good explanation for the facts - so the null is more likely, but only relative to the alternative. Note that the null only does marginally better than this - $0.00000011$. But this is still better than the alternative.

This is especially true for the example that Gelman criticises - there was only ever really one hypothesis tested, and not much thought gone into a) what the alternatives explanations are (particularly on confounding and effects not controlled for), b) how much are the alternatives supported by previous research, and most importantly, c) what predictions do they make (if any) which are substantively different from the null?

But note that $\overline{H}$ is undefined, and basically represents all other hypothesis consistent with the prior information. The only way you can really do hypothesis testing properly is by specifying a range of alternatives that you are going to compare. And even if you do that, say you have $H_1,\dots,H_K$, you can only report on the fact that the data supports $H_k$ relative to what you have specified. If you leave out important hypothesis from the set of alternatives, you can expect to get nonsensical results. Additionally, a given alternative may prove to be a much better fit that the others, but still not likely. If you have one test where a p-value is $0.01$ but the one hundred different tests where the p-value is $0.1$ it is much more likely that the "best hypothesis" (best has better connotations than true) actually comes from the group of "almost significant" results.

The major point to stress is that a hypothesis can never ever exist in isolation to the alterantives. For, after specifying $K$ theories/models, you can always add a new hypothesis $$H_{K+1}=\text{Something else not yet thought of}$$ In effect this type of hypothesis is basically what progresses science - someone has a new idea/explanation for some kind of effect, and then tests this new theory against the current set of alternatives. Its $H_{K+1}$ vs $H_1,\dots,H_K$ and not simply $H_0$ vs $H_A$. The simplified version only applies when there is a very strongly supported hypothesis in $H_1,\dots,H_K$ - i.e, of all the ideas and explanations we currently have, there is one dominant theory that stands out. This is definitely not true for most areas of social/political science, economics, and psychology.

Implications of current debate on statistical significance

4 Answers4