How does Fisher calculate his $p$-value?

Question

After reading a lot of great answers on the topic of Fisherian versus Neyman & Pearson, I still cannot understand how Fisher carries out his test.

Here is my understanding of his workflow:

Ask a question.

Propose a null hypothesis based on the question.

Do some experiments, and collect some data.

Assuming the null is true, calculate the $p$-value.

Report the exact $p$-value without comparing it with any explicit criteria.

Take a look at your $p$:

If you subjectively think it's too small, discard the data at your hand and go to the outermost (2) by proposing another null hypothesis.

If you subjectively think large it's enough, discard the data at your hand, do another experiment, collect some new data, and find better ways to test your theory on them.

(From my understanding, the data used in the NHST stage cannot be used in further steps, regardless of whether we are following Fisher or Neyman & Pearson.)

I'm not sure if I'm correct, but there is a problem even if I am. Conventionally, the $p$-value is defined as

the probability of obtaining a test statistic at least as extreme as the actual sample value obtained given that the null hypothesis is true.

OK, but how do you define extreme without an alternative hypothesis? Everyone is a bit loose with their language when it comes to this issue, so while illustrative examples don't hurt, please don't post answers containing only examples. I am essentially asking for a high-level description of how Fisher chooses his critical region.

Note that this is not a problem for the Neyman-Pearson approach, because they didn't even mention the $p$-value in their 1933 paper that proposes the Neyman–Pearson lemma, so we can define Neyman-Pearson hypothesis testing in terms of critical regions, and then establish a bijection between the $p$-value and the critical regions. On the other hand, Fisher doesn't seem to have a single paper this clearly documents his way, and I'm a little confused.

Fisher typically\* uses the likelihood to denote what's more extreme -- lower likelihood is more extreme. \*(at least where he doesn't have an explicit test statistic which makes the ordering obvious)... e.g. consider the two tailed version of what's usually called the Fisher exact test (and its extension to $r\times c$ tables), where the tables are unambiguously ordered by their likelihood. — Glen_b, Jul 21 '19 at 02:01
Actually, this is explicit in the question [P-value: Fisherian vs. contemporary frequentist definitions](https://stats.stackexchange.com/questions/386369/p-value-fisherian-vs-contemporary-frequentist-definitions). (This question was even in the "Related" questions list, which you can presently see in the right hand sidebar -- always a good thing to check.) $\,$ Using likelihood leads to what I see as the central difference between Fisher's approach and the Neyman-Pearson approach: Typically, a Fisher test is an "omnibus" test in the sense that every alternative that lowers likelihood ... ctd — Glen_b, Jul 21 '19 at 06:17
ctd ... will tend to lead to rejection, while a Neyman-Pearson test is designed to have power against a specific alternative (or, more generally, against some specific sequence of alternatives). To me that doesn't make them especially competing notions of testing at all, but tools designed for somewhat different situations (i.e. as Alecos mentions [here](https://stats.stackexchange.com/a/112786/805) when discussing work by Spanos, *complementary*), each good at what they're trying to do, and one may quite reasonably choose one or the other depending on the circumstances. — Glen_b, Jul 21 '19 at 06:25
(This is not presently an answer for two reasons; (i) I am debating whether this should be considered a duplicate, and (ii) if this isn't a duplicate and my comment ('it's based on the likelihood') were to be expanded into an answer, I should like to quote Fisher directly - though I don't expect he will mention the word likelihood specifically in this context.) — Glen_b, Jul 21 '19 at 07:02
'how do you define extreme without an alternative hypothesis?' is answered in your question 'as extreme as the actual sample value obtained' — ReneBt, Jul 21 '19 at 07:07
@nalzok seems you have a clear understanding of Fisher p test. The calculation is apparently based on simulations and observed t statistics. — , Jul 21 '19 at 14:47
@Glen_b Can you point me to a typical Fisherian test? I think the lady-tasting-tea experiment misses something: its null hypothesis is "the lady gives random guesses about if the milk went in first", but rejecting it only implies "the lady does not give random guesses", rather than "the lady's guesses are always correct". — nalzok, Jul 21 '19 at 19:41
The lady tasting tea won't do. If you do the 4 vs 4 cups version, there's only one probability to deal with (no ''more extreme" to deal with), and if you extend the number of cups (which Fisher does discuss) and don't require the taster to be perfect (but simply to beat chance) then it's a one-tailed test (so 'more extreme' is otherwise obvious). You need to either go to the two tailed version or the $r\times c$ table version (mentioned in my initial comment above), where some method of deciding more extreme is required. It's there that you see that ordering by likelihood under the null occurs — Glen_b, Jul 22 '19 at 04:24
At http://math.arizona.edu/~piegorsch/571A/TR194.pdf: "Defined simply, a P-value is a data-based measure that helps indicate departure from a specified null hypothesis, Ho, in the direction of a specified alternative Ha. Formally, it is the probability of recovering a response as extreme as or more extreme than that actually observed, when Ho is true. (Note that ‘more extreme’ is defined in the context of Ha. For example, when testing Ho:␪ = ␪o vs. Ha:␪ > ␪o, ‘more extreme’ corresponds to values of the test statistic supporting ␪ > ␪o .)" — , Jul 22 '19 at 14:56
@Glen_b. I'm not sure Fisher did typically use the likelihood to index extremeness. Yates (1984), JRSS A, **147**, "Tests of Significance for 2x2 Contingency Tables", p. 444, quotes Fisher's reply to a letter from D.J Finney asking about two-tailed tests for FET (1946): "I believe I can defend the simple solution of doubling the total probability, not because it corresponds to any discrete subdivision of cases of the other tail, but because it corresponds with halving the probability, supposedly chosen in advance, with which the one observed is to be compared. [...] How does this strike you?". — Scortchi - Reinstate Monica, Jul 29 '19 at 16:20
@Scortchi I wouldn't for a moment suggest that it was something he was always and everywhere consistently insisting on - especially across about six decades - but even so there's still a suggestion of it in that quote as what might otherwise have been done (specifically, the "discrete subdivision of cases of the other tail"). He's saying that alternatively to that you can make an argument that you could argue to halve the significance level and look in the observed tail. — Glen_b, Jul 29 '19 at 22:38
@Glen_b: Sure, but - & I should've said this - I'm not aware of Fisher's having commented on the issue *at all* except in this letter. It's Finney who brings up the matter of correspondence to "discrete subdivision of cases of the other tail" as a criticism of the double-the-one-tailed-p-value approach, though he doesn't suggest any particular alternative approach. — Scortchi - Reinstate Monica, Jul 30 '19 at 09:47
Yes, certainly it's necessary to provide something like a quote to support my belief (rather than secondary sources making the same claim I did) -- which is why this isn't an answer. — Glen_b, Jul 30 '19 at 10:25
@Glen_b: Perhaps not necessary: if there's a modern Fisherian approach - one drawing heavily on Fisher's ideas, but rejecting some (e.g. fiducial inference) & extending or formalizing others - it'll only be detailed in secondary sources. — Scortchi - Reinstate Monica, Jul 30 '19 at 11:30

Scortchi - Reinstate Monica · Accepted Answer · 2019-08-01T14:24:54.160

Fisher's approach, in a fully parametric framework, was to reduce the data $X$ to a (one-dimensional) statistic sufficient, or conditionally sufficient, for the parameter of interest $\theta$, & to base inference on its distribution under the null hypothesis $\theta=\theta_0$. Typically he used the (or a) maximum-likelihood estimate $\hat\theta(X)$ (in any case the MLE, when unique, will be a one-to-one function of any one-dimensional sufficient statistic when there is one); though I don't recall any explicit discussion, viewing the maximum-likelihood estimate $\hat\theta(X_1)$ as more extreme than $\hat\theta(X_2)$ because it's further away from $\theta_0$ in the same direction follows naturally enough. See Fisher (1934), Proc. Royal Soc. Lond. A, 144 ,"Two New Properties of Mathematical Likelihood", § 2.6, for his emphasis on the connection between (maximum-likelihood) estimation & significance testing.

He doesn't seem to have given a great deal of thought to the calculation of p-values for two-tailed tests (at least for test statistics having discrete distributions). Yates (1984), JRSS A, 147, "Tests of Significance for 2x2 Contingency Tables", p. 444, quotes Fisher's (1946) reply to a letter from D.J Finney asking about two-tailed p-values for Fisher's Exact Test:

I believe I can defend the simple solution of doubling the total probability, not because it corresponds to any discrete subdivision of cases of the other tail, but because it corresponds with halving the probability, supposedly chosen in advance, with which the one observed is to be compared. [...] How does this strike you?

On the face of it this argument belongs more to the Neyman – Pearson approach.

Fisher (1973), Statistical Methods & Scientific Inference, pp 49 – 50, draws a distinction between testing a "general hypothesis"—a model—as a whole, & testing for a particular value of one of its parameters. In the latter case he reiterates the approach above; in the former his advice is this:

In choosing the grounds upon which a general hypothesis should be rejected, personal judgement may & should properly be exercised. The experimenter will rightly consider all points on which, in the light of current knowledge, the hypothesis may be imperfectly accurate, & will select tests, so far as possible, sensitive to these possible faults, rather than to others.

Which doesn't seem poles apart from the approach of stipulating an alternative hypothesis precisely & basing your choice of test statistic on considerations of power.

(+1) In the Neyman–Pearson framework, a test on a $\chi^2$ statistic is usually one-tailed, but Fisher considered it as a two-tailed test. Why do you think it belongs more to the Neyman–Pearson approach? — nalzok, Jul 30 '19 at 18:29
@naizok: There are lots of test statistics following a chi-squared distribution - can you give a context/reference? But anyway, I was referring to "the probability, supposedly chosen in advance, with which the one observed is to be compared" - Fisher is, it seems to me, justifying the p-value $2\min[\Pr(X \leq x), \Pr(X \geq x)]$ by its correspondence with an N-P style accept/reject test having that as its pre-specified significance level - controlling the Type I error. — Scortchi - Reinstate Monica, Jul 30 '19 at 21:17
I was referring to the goodness of fit test since you talked about 2x2 contingency tables. Thanks for the clarification. I can see your point now. — nalzok, Jul 30 '19 at 23:11

How does Fisher calculate his $p$-value?

1 Answers1

Linked