What is the history of $p < 0.05$ or 95% confidence?

Question

I'm wondering what the history of $p < 0.05$ or using a 95% confidence interval is. I know that more nuanced reasoning would argue that there is nothing special about 0.05 or 95% (I think decision theory offers guidance on what level of risk to accept) and that the usage of these numbers has more to do with "tradition" and doing what was taught in stats courses that themselves don't discuss why these numbers should be used well.

Wasn't it Fisher who suggested that $p < 0.05$ was small enough to discard a hypothesis, and mostly just as a suggestion rather than a rule? Or am I wrong on that?

The first appearance of a statement like this in Fisher occurs in the first statistical chapter, "Tests of Goodness of Fit, Independence and Homogeneity:" "In preparing this [chi-squared] table we have borne in mind that in practice we do not want to know the exact value of $P$ for any observed $\chi^2,$ but, in the first place, whether or not the observed value is open to suspicion. ... We shall not often be astray if we draw a conventional line at $.05,$ and consider that higher values of $\chi^2$ indicate a real discrepancy." *Stat. Methods for Research Workers* 5th Ed. (1934) p. 82. — whuber, Oct 10 '21 at 18:52

Sextus Empiricus · Answer 1 · 2021-10-10T22:54:28.883

Fisher suggested the 0.05 level indirectly. He mentioned that two standard deviations is an easy rule for significance, and the 0.05 level is what approximately corresponds to it.

From Fisher's 1925 'Statistical methods for research workers'

If, therefore, we know the standard deviation of a population, we can calculate the standard deviation of the mean of a random sample of any size, and so test whether or not it differs significantly from any fixed value. If the difference is many times greater than the standard error, it is certainly significant, and it is a convenient convention to take twice the standard error as the limit of significance ; this is roughly equivalent to the corresponding limit $P=.05$, already used for the $\chi^2$ distribution.

He mentions as well that this level is already used. This refers to Pearson's chi squared test. In the same book he writes about the construction of a table for the values of the $\chi^2$ distribution

we have not reprinted Elderton's table, but have given a new table (Table III. p. 98) in a form which experience has shown to be more convenient. Instead of giving the values of $P$ corresponding to an arbitrary series of values of $\chi^2$, we have given the values of $\chi^2$ corresponding to specially selected values of $P$. We have thus been able in a compact form to cover those parts of the distributions which have hitherto not been available, namely, the values of $\chi^2$ less than unity, which frequently occur for small values of $n$, and the values exceeding $30$, which for larger values of $n$ become of importance.

...

In preparing this table we have borne in mind that in practice we do not want to know the exact value of $P$ for any observed $\chi^2$, but, in the first place, whether or not the observed value is open to suspicion. If $P$ is between $.1$ and $.9$ there is certainly no reason to suspect the hypothesis tested. If it is below $.02$ it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at $.05$, and consider that higher values of $\chi^2$ indicate a real discrepancy.

So the .05 level stems from two types of convenience.

It relates to the 68-97.5-99.7 rule and the 2 sigma value.
And it relates to the lack of computers in the old days and the need to find values for distributions from tables. To make these tables easier, Fisher thought it would be better to give $\chi^2$ as function of $p$ instead of the other way around. So convenient levels needed to be chosen to construct those new type of tables.

BruceET · Accepted Answer · 2021-10-10T19:14:25.770

5

See this historical article by Stigler (2008) in Chance, about Fisher's influence (as you suggest).

Much of early significance testing used the standard normal distribution. As cut-off values get smaller than $-2.0$ there is rapidly diminishing tail probability. So if one wants a relatively small tail probability without insisting on $z$-values too far from $0,$ it seems that cut-off points around $\pm 2$ give a reasonable tradeoff between more extreme z values and smaller probabilities. If one wants "round" numbers for the sum of two tail probabilities, such as $0.01, 0.02, 0.03,$ $0.04, 0.05, 0.06,$ etc., then something near $0.05=5\%$ seems reasonable.

p = seq(.01,.1,by=.01); z = qnorm(p)
plot(z, p, ylim=c(0,.1))

edited Oct 10 '21 at 19:14

answered Oct 10 '21 at 19:02

BruceET

47,896
2
28
76

I am going to "accept" this answer over @SextusEmpiricus because while his answer is very good, I think the article cited here gives a larger context, in that Fisher was far from the first in suggesting some cut-off value for determining whether statistical evidence was indicative of something. – cgmil Oct 11 '21 at 00:36
Thanks. I have up-voted the other answers, each of which has something useful. – BruceET Oct 11 '21 at 00:39
1

It is interesting to add the actual link to Edgeworth. In ['methods of statistics'](https://www.jstor.org/stable/25163974) he described significance of a z-test to compare means: *"Consider whether the difference between the observed Means exceeds two or three times the modulus of that curve. If it does, the difference is not accidental"*. But he did not suggest explicit p-values. The modulus is in this context $c$ in the formula $e^{-x^2/c^2}$, so it corresponds to $k\sqrt{2}\sigma$ which relates to a p-value of $0.0046$ if $k=2$. – Sextus Empiricus Oct 11 '21 at 16:21

score 2 · Answer 3 · answered Oct 10 '21 at 18:20

2

As I remember it, it was indeed Fisher who threw 0,05 out there, as a suggestion, and this has been taken as law in many circles since then. I don't have the book at hand, but I read the passage once, and it can probably be found through a quick Google search.

I am not that familiar with decision theory, so I can't definitively say that Fisher is the only reason for this.

answered Oct 10 '21 at 18:20

Nicholas Fitzhugh

21
3

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 10 '21 at 18:26

What is the history of $p < 0.05$ or 95% confidence?

3 Answers3