Is the "hybrid" between Fisher and Neyman-Pearson approaches to statistical testing really an "incoherent mishmash"?

Question

There exists a certain school of thought according to which the most widespread approach to statistical testing is a "hybrid" between two approaches: that of Fisher and that of Neyman-Pearson; these two approaches, the claim goes, are "incompatible" and hence the resulting "hybrid" is an "incoherent mishmash". I will provide a bibliography and some quotes below, but for now suffice it to say that there is a lot written about that in the wikipedia article on Statistical hypothesis testing. Here on CV, this point was repeatedly made by @Michael Lew (see here and here).

My question is: why are F and N-P approaches claimed to be incompatible and why is the hybrid claimed to be incoherent? Note that I read at least six anti-hybrid papers (see below), but still fail to understand the problem or the argument. Note also, that I am not suggesting to debate if F or N-P is a better approach; neither am I offering to discuss frequentist vs. Bayesian frameworks. Instead, the question is: accepting that both F and N-P are valid and meaningful approaches, what is so bad about their hybrid?

Here is how I understand the situation. Fisher's approach is to compute the $p$-value and take it as an evidence against the null hypothesis. The smaller the $p$, the more convincing the evidence. The researcher is supposed to combine this evidence with his background knowledge, decide if it is convincing enough, and proceed accordingly. (Note that Fisher's views changed over the years, but this is what he seems to have eventually converged to.) In contrast, Neyman-Pearson approach is to choose $\alpha$ ahead of time and then to check if $p\le\alpha$; if so, call it significant and reject the null hypothesis (here I omit large part of the N-P story that has no relevance for the current discussion). See also an excellent reply by @gung in When to use Fisher and Neyman-Pearson framework?

The hybrid approach is to compute the $p$-value, report it (implicitly assuming that the smaller the better), and also call the results significant if $p\le\alpha$ (usually $\alpha=0.05$) and nonsignificant otherwise. This is supposed to be incoherent. How can it be invalid to do two valid things simultaneously, beats me.

As particularly incoherent the anti-hybridists view the widespread practice of reporting $p$-values as $p<0.05$, $p<0.01$, or $p<0.001$ (or even $p\ll0.0001$), where always the strongest inequality is chosen. The argument seems to be that (a) the strength of evidence cannot be properly assessed as exact $p$ is not reported, and (b) people tend to interpret the right-hand number in the inequality as $\alpha$ and view it as type I error rate, and that is wrong. I fail to see a big problem here. First, reporting exact $p$ is certainly a better practice, but nobody really cares if $p$ is e.g. $0.02$ or $0.03$, so rounding it on a log scale is not soooo bad (and going below $\sim 0.0001$ does not make sense anyway, see How should tiny p-values be reported?). Second, if the consensus is to call everything below $0.05$ significant, then error rate will be $\alpha=0.05$ and $p \ne \alpha$, as @gung explains in Interpretation of p-value in hypothesis testing. Even though this is potentially a confusing issue, it does not strike me as more confusing than other issues in statistical testing (outside of the hybrid). Also, every reader can have her own favourite $\alpha$ in mind when reading a hybrid paper, and her own error rate as a consequence. So what is the big deal?

One of the reasons I want to ask this question is because it literally hurts to see how much of the wikipedia article on Statistical hypothesis testing is devoted to lambasting hybrid. Following Halpin & Stam, it claims that a a certain Lindquist is to blame (there is even a large scan of his textbook with "errors" highlighted in yellow), and of course the wiki article about Lindquist himself starts with the same accusation. But then, maybe I am missing something.

References

Gigerenzer, 1993, The superego, the ego, and the id in statistical reasoning -- introduced the term "hybrid" and called it "incoherent mishmash"
- See also more recent expositions by Gigerenzer et al.: e.g. Mindless statistics (2004) and The Null Ritual. What You Always Wanted to Know About Significance Testing but Were Afraid to Ask (2004).
Cohen, 1994, The Earth Is Round ($p<.05$) -- a very popular paper with almost 3k citations, mostly about different issues but favourably citing Gigerenzer
Goodman, 1999, Toward evidence-based medical statistics. 1: The P value fallacy
Hubbard & Bayarri, 2003, Confusion over measures of evidence ($p$'s) versus errors ($\alpha$'s) in classical statistical testing -- one of the more eloquent papers arguing against "hybrid"
Halpin & Stam, 2006, Inductive Inference or Inductive Behavior: Fisher and Neyman-Pearson Approaches to Statistical Testing in Psychological Research (1940-1960) [free after registration] -- blames Lindquist's 1940 textbook for introducing the "hybrid" approach
@Michael Lew, 2006, Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don't know P -- a nice review and overview

Quotes

Gigerenzer: What has become institutionalized as inferential statistics in psychology is not Fisherian statistics. It is an incoherent mishmash of some of Fisher's ideas on one hand, and some of the ideas of Neyman and E. S. Pearson on the other. I refer to this blend as the "hybrid logic" of statistical inference.

Goodman: The [Neyman-Pearson] hypothesis test approach offered scientists a Faustian bargain -- a seemingly automatic way to limit the number of mistaken conclusions in the long run, but only by abandoning the ability to measure evidence [a la Fisher] and assess truth from a single experiment.

Hubbard & Bayarri: Classical statistical testing is an anonymous hybrid of the competing and frequently contradictory approaches [...]. In particular, there is a widespread failure to appreciate the incompatibility of Fisher's evidential $p$ value with the Type I error rate, $\alpha$, of Neyman-Pearson statistical orthodoxy. [...] As a prime example of the bewilderment arising from [this] mixing [...], consider the widely unappreciated fact that the former's $p$ value is incompatible with the Neyman-Pearson hypothesis test in which it has become embedded. [...] For example, Gibbons and Pratt [...] erroneously stated: "Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error."

Halpin & Stam: Lindquist's 1940 text was an original source of the hybridization of the Fisher and Neyman-Pearson approaches. [...] rather than adhering to any particular interpretation of statistical testing, psychologists have remained ambivalent about, and indeed largely unaware of, the conceptual difficulties implicated by the Fisher and Neyman-Pearson controversy.

Lew: What we have is a hybrid approach that neither controls error rates nor allows assessment of the strength of evidence.

+1 for this well researched (even if long) question. It would help I think to perhaps continue to specify what exactly is confusing. Is it enough to know that for Fisher there doesnt exist an alternative hypothesis at all whereas for NP the world of possibilities is exhausted with both null and alternative? Seems incoherent enough to me but alas I do the hybrid thing all the time because you cant avoid, so ingrained has it become. — Momo, Aug 21 '14 at 13:15
And a note on the wording in the anti-hybrid papers (or other contested things in ststs for that matter), never forget that they are often polemic to bring their point across. Rarely do these papers really reflect the subtelties, normativities and necessities involved in such a debate. Everyones a critic. — Momo, Aug 21 '14 at 13:22
@Momo: to you question about "what exactly is confusing" -- well, confusing is the frenzy of the anti-hybrid rhetoric. "Incoherent mishmash" are strong words, so I would like to see a pretty bad inconsistency. What you said about alternative hypothesis does not sound as such to me (in the garden variety case of $H_0: \mu=0$ the alternative is obviously $H_1: \mu \ne 0$, and I don't see much room for inconsistency), but if I am missing your point then maybe you would like to provide it as an answer. — amoeba, Aug 21 '14 at 15:42
Having just read Lew (and realizing I'd read it before, probably around 2006), I found it quite good, but I don't think it represents how I use p-values. My significance levels - on the rare occasions I use hypothesis testing at all\* - are always up front, and where I have any control over sample size, after consideration of power, some consideration of the cost of the two error types and so on - essentially Neyman-Pearson. I still quote p-values, but not in the framework of Fisher's approach .... (ctd) — Glen_b, Aug 22 '14 at 02:26
(ctd) ... \* (I often steer people away from hypothesis testing - so often their actual questions are related to measuring effects, and are better answered by constructing intervals). The specific problem Lew raised for the 'hybrid' procedure applies to something I don't do and would tend to caution people against doing. If there are people really doing the mix of approaches he implies, the paper seems fine. The earlier discussion of the meaning of p-values and the history of the approaches seems excellent. — Glen_b, Aug 22 '14 at 02:28
@Glen_b, Lew's historical overview is very nice and clear, I fully agree. My trouble is specifically with the hybrid issue (section "Which approach is most used?"). Certainly there *are* people doing what he describes there, i.e. reporting the strongest of p<.001 about="" all="" alpha=".05," and="" be="" cases="" certainty="" choice="" choose="" come="" consider="" different="" do="" e.g.="" follow="" framework.="" from="" get="" going="" h1="" how="" hybrid="" i="" if="" in="" is="" it="" neuroscience.="" not="" np="" of="" one="" or="" p=".00011," see="" so="" testing.="" the="" time="" to="" use="" when="" wording="" would="" you="" your=""> — amoeba, Aug 22 '14 at 11:02
For me the wording *might* differ slightly in that it might point out that the sample would have led to rejection at smaller significance levels than the one used (which is logically true, but not [especially meaningful](http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf) in the context of my test). Since I'm in effect using the p-value itself as a test statistic, there's no more problem reporting it than there is reporting any other test statistic -- the issue is the interpretation. ...(ctd) — Glen_b, Aug 22 '14 at 11:55
(ctd) ... I think that counts as technically hybrid, but it's really not like what Lew lists for the Fisher approach. [There are particular circumstances where I do something more like a Fisher approach, however - but those would not be formal hypothesis tests, as one would report in a paper.] — Glen_b, Aug 22 '14 at 12:03
(ctd) ... One example is a situation where I am explaining to someone else how to carry out a test, and I don't know what significance level they want to use, I may point out at which typical significance levels they'd reject. — Glen_b, Aug 22 '14 at 12:10
@Glen_b: That is interesting, thank you. It seems that you strongly prefer NP, and don't particularly appreciate Fisher's "strength of evidence" approach, even outside of the hybrid. Personally, I would be happy to see your comments joined together in one reply (especially if you elaborate on them a bit); this thread seems to become quite popular and to have your point of view clearly expressed here would be a great addition. — amoeba, Aug 22 '14 at 13:30
@amoeba I don't think Fisher's approach is wrong - the intuitive appeal is clear - it's just that there are enough caveats to go along with it that I tend not to use it when being formal. Which is odd, because I'd have described myself as using a hybrid approach. I guess it depends on the situation. — Glen_b, Aug 23 '14 at 08:32
@amoeba thanks for the clarification from above, good to know you were thinking something along those lines of "why this rhetoric?". I think the wording has to be strong to be heard (I recently read something in the same vain that advocates against using CI's). — Momo, Aug 23 '14 at 11:54
Thanks for the interesting question. Now, ~7 years later, have you come to a resolution? Would you mind posting it? — dariober, May 22 '21 at 17:31

score 16 · Answer 1 · edited Apr 13 '17 at 12:44

I believe the papers, articles, posts e.t.c. that you diligently gathered, contain enough information and analysis as to where and why the two approaches differ. But being different does not mean being incompatible.

The problem with the "hybrid" is that it is a hybrid and not a synthesis, and this is why it is treated by many as a hybris, if you excuse the word-play.
Not being a synthesis, it does not attempt to combine the differences of the two approaches, and either create one unified and internally consistent approach, or keep both approaches in the scientific arsenal as complementary alternatives, in order to deal more effectively with the very complex world we try to analyze through Statistics (thankfully, this last thing is what appears to be happening with the other great civil war of the field, the frequentist-bayesian one).

The dissatisfaction with it I believe comes from the fact that it has indeed created misunderstandings in applying the statistical tools and interpreting the statistical results, mainly by scientists that are not statisticians, misunderstandings that can have possibly very serious and damaging effects (thinking about the field of medicine helps giving the issue its appropriate dramatic tone). This misapplication, is I believe, accepted widely as a fact-and in that sense, the "anti-hybrid" point of view can be considered as widespread (at least due to the consequences it had, if not for its methodological issues).

I see the evolution of the matter so far as a historical accident (but I don't have a $p$-value or a rejection region for my hypothesis), due to the unfortunate battle between the founders. Fisher and Neyman/Pearson have fought bitterly and publicly for decades over their approaches. This created the impression that here is a dichotomous matter: the one approach must be "right", and the other must be "wrong".

The hybrid emerged, I believe, out of the realization that no such easy answer existed, and that there were real-world phenomena to which the one approach is better suited than the other (see this post for such an example, according to me at least, where the Fisherian approach seems more suitable). But instead of keeping the two "separate and ready to act", they were rather superfluously patched together.

I offer a source which summarizes this "complementary alternative" approach: Spanos, A. (1999). Probability theory and statistical inference: econometric modeling with observational data. Cambridge University Press., ch. 14, especially Section 14.5, where after presenting formally and distinctly the two approaches, the author is in a position to point to their differences clearly, and also argue that they can be seen as complementary alternatives.

(+1) I appreciate your comments and agree with many of them. But I am not sure what exactly you are referring to when you say that the hybrid "created misunderstandings" (and moreover, that this is "accepted widely as a fact"). Could you give some examples? To be an attack on the hybrid, it should be examples of misunderstandings that do not arise in either F or N-P approaches alone. Are you referring to the potential confusion between $p$ and $\alpha$ that I mentioned in my question, or to something else? Apart from that, I am already reading Section 14.5 in Spanos, thanks. — amoeba, Aug 21 '14 at 16:07
The obvious issue is indeed the $p-\alpha$ issue. More subtle and I believe more important, is the fact that the hybrid mixes the exploratory flavor of Fisher (which more over leaves the matter of decision to the researcher), with the more formal approach of N-P. So researchers approached the matter in a Fisherian spirit, but then claimed the strong "rejection/acceptance" weight of the N-P approach, which in principle gives more credibility to the conclusions. CONTD — Alecos Papadopoulos, Aug 21 '14 at 16:18
CONTD For me, this is the "have your cake and eat it too" issue of the hybrid approach. For example, an N-P approach without power-test calculations should be unthinkable, but all the time we see test posed in the N-P framework, but no mention about power calculations. — Alecos Papadopoulos, Aug 21 '14 at 16:19
Off topic, but... Since you are citing Aris Spanos, I wonder if you might be able to answer [this question](https://stats.stackexchange.com/questions/303887/effects-of-model-selection-and-misspecification-testing-on-inference-probabilis) about his methodology? (I once asked the question to Aris Spanos directly, and he kindly put down some effort in answering it. Unfortunately, his answer was in the same language as his papers, thus it did not help me much.) — Richard Hardy, Dec 17 '19 at 18:23
@RichardHardy I had a look -my problem is that I need to revisit concepts like "pre-test bias" in order to maybe be able to say something useful on your issue. — Alecos Papadopoulos, Dec 18 '19 at 20:58
Thank you for having taken at look! I had the impression that you knew the answer but did not have time to write it up earlier when I contacted you last time. But perhaps my memory is tricking me, perhaps it was a different question. — Richard Hardy, Dec 18 '19 at 21:33

score 14 · Answer 2 · answered Aug 21 '14 at 16:23

14

My own take on my question is that there is nothing particularly incoherent in the hybrid (i.e. accepted) approach. But as I was not sure if I am maybe failing to comprehend the validity of the arguments presented in the anti-hybrid papers, I was happy to find the discussion published together with this paper:

Hubbard & Bayarri, 2003, Confusion over measures of evidence (p's) versus errors (α's) in classical statistical testing

Unfortunately, two replies published as a discussion were not formatted as separate articles and so cannot be properly cited. Still, I would like to quote from both of them:

Berk: The theme of Sections 2 and 3 seems to be that Fisher did not like what Neyman and Pearson did, and Neyman did not like what Fisher did, and therefore we should not do anything that combines the two approaches. There is no escaping the premise here, but the reasoning escapes me.

Carlton: the authors adamantly insist that most confusion stems from the marriage of Fisherian and Neyman-Pearsonian ideas, that such a marriage is a catastrophic error on the part of modern statisticians [...] [T]hey seem intent on establishing that P values and Type I errors cannot coexist in the same universe. It is unclear whether the authors have given any substantive reason why we cannot utter "p value" and "Type I error" in the same sentence. [...] The "fact" of their [F and NP] incompatibility comes as surprising news to me, as I'm sure it does to the thousands of qualified statisticians reading the article. The authors even seem to suggest that among the reasons statisticians should now divorce these two ideas is that Fisher and Neyman were not terribly fond of each other (or each other's philosophies on testing). I have always viewed our current practice, which integrates Fisher's and Neyman's philosophies and permits discussion of both P values and Type I errors -- though certainly not in parallel -- as one of our discipline's greater triumphs.

Both responses are very worth reading. There is also a rejoinder by the original authors, which does not sound convincing to me at all.

answered Aug 21 '14 at 16:23

amoeba

93,463
28
275
317

1

It is one thing to co-exist, it is another for the one to be considered as the other. But indeed, this strand of anti-hybrid approach is in the spirit of "there can be no synthesis whatsoever" -which I strongly disagree with. But I don't see the current hybrid as a _successful_ marriage. – Alecos Papadopoulos Aug 21 '14 at 16:30
The problem with those quotes is that they do not justify the "hybrid". This is what many "anti-hybrid" people are looking for, and the lack of one being presented front and center is a glaring red flag. Anyone who investigates the history can see it clearly arose from some informal process. We cannot deduce from this that it is wrong, but what rationale is it based upon? When is it not applicable? What evidence is there to support its use with regards to different problems? How does it compare with other approaches (e.g., pure fisher, pure N-P, Bayesian, non-nil null hypothesis, etc)? – Livid Feb 14 '15 at 22:29
@Livid, these quotes are taken from a polemic discussion following an anti-hybrid paper. The purpose of these discussion notes is not so much to "justify the hybrid" as such, but to question the validity of the arguments put forward in the main anti-hybrid paper. This is also what my whole question was about. I was interested in what the anti-hybrid arguments are and whether they hold any water; I am not so much interested in the pro-hybrid arguments. By the way, I might still revive this topic (by editing and putting a bounty). – amoeba Feb 14 '15 at 22:37
@amoeba You imply above your background is in neuroscience. Think about the people who claim to detect "neuroplasticity" after various treatments by measuring the number of dendritic branches in golgi-cox stained tissue. A treatment could result in this by eg 1) Killing smaller neurons or 2) Causing neurons to grow branches. One is clearly deleterious, the other maybe good. The hybrid tells us to see if there is a difference in branching of control vs treatment groups. How has that helped us?... – Livid Feb 14 '15 at 23:22
If there is a difference, we still do not know if that is a good thing or bad thing. Further, how does it compare to what is reported from longitudinal in vivo 2-photon studies (no one yet has reported observing any dendritic branch formation in adult mammals)? Also, shouldnt we be trying to come up with theories to explain the entire branching pattern anyway, and not just whether there is a difference between groups? This was already done 100 years ago but most people have ignored it. An exception is: http://www.ncbi.nlm.nih.gov/pubmed/20700495. It isn't clear to me what use the hybrid has. – Livid Feb 14 '15 at 23:25
2

@Livid, thanks for your comments, this is interesting, but I would like to refrain from further discussion here. I would rather encourage you to post a new answer, if you wish. But if you decide to do so, try to focus on the main issue, which is: what is so bad about "hybrid", as compared to both Fisher and N-P alone. You seem to hate the whole approach of significance testing, "nil null hypothesis", etc., but this is **not** what this question is about! – amoeba Feb 14 '15 at 23:33
@amoeba Ok, but my comments are relevant. It is important to recognize that the "nil null hypothesis" is a distinguishing characteristic of the hybrid. Get rid of that, and I think it may be valid. – Livid Feb 14 '15 at 23:37
1

@Livid: Hmmm, can you actually clarify why you say that is is a distinguishing characteristic of the hybrid? What would the null be in pure Fisher or in pure NP? Say you have two groups and want to test for a significant difference ("nil null"). Can't one approach this situation with all three approaches: pure Fisher, pure NP, and hybrid? – amoeba Feb 14 '15 at 23:42
@amoeba People can (mis)use Fisher and N-P (or bayes factors or anything else that can be come up with) in that way, but it is **always** done with the hybrid regardless of the problem. I am not talking about cases where a nil null is reasonable (e.g., mind reading does not exist, no difference in radioisotope decay rates in cancer patients vs normal people). The null hypothesis is supposed to be predicted by some theory or accepted explanation. Once you start disproving "chance" instead, often it is irrelevant. There is only an extremely limited set of cases where I can see this as useful. – Livid Feb 14 '15 at 23:53
2

@Livid, I understand your arguments against the nil null, I just think that this issue is orthogonal to the issue of hybrid. I have to refresh the anti-hybrid papers in memory, but as far as I remember their critique of the hybrid is not at all centered on the nil null. Instead, it is about combining Fisher and NP. Again, if you disagree with this, please consider posting an answer; for the moment, let's leave it at that. – amoeba Feb 15 '15 at 00:06
2

A note to myself: I should incorporate into this answer some quotes from this paper: Lehmann 1992, [The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?](http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/333.pdf) – amoeba Apr 21 '15 at 17:12

Michael M · Answer 3 · 2014-08-23T07:53:12.280

An often seen (and supposedly accepted) union (or better: "hybrid") between the two approaches is as follows:

Set a prespecified level $\alpha$ (0.05 say)
Then test your hypothesis, e.g. $H_o: \mu = 0$ vs. $H_1: \mu \ne 0$
State the p value and formulate your decision based on the level $\alpha$:

If the resulting p value is below $\alpha$, you could say
- "I reject $H_o$" or
- "I reject $H_o$" in favor of $H_1$" or
- "I am $100\% \cdot (1-\alpha)$ certain that $H_1$ holds"
If the p value is not small enough, you would say
- "I cannot reject $H_o$" or
- "I cannot reject $H_o$ in favor of $H_1$"

Here, aspects from Neyman-Pearson are:

You decide something
You have an alternative hypothesis at hand (although it is just the contrary of $H_o$)
You know the type I error rate

Fisherian aspects are:

You state the p value. Any reader has thus the possibility to use its own level (e.g. strictly correcting for multiple testing) for decision
Basically, only the null hypothesis is required since the alternative is just the contrary
You don't know the type II error rate. (But you could immediately get it for specific values of $\mu \ne 0$.)

ADD-ON

While it is good to be aware of the discussion about the philosophical problems of Fisher's, NP's or this hybrid approach (as taught in almost religious frenzy by some), there are much more relevant issues in statistics to fight against:

Asking uninformative questions (like binary yes/no questions instead of quantitative "how much" questions, i.e. using tests instead of confidence intervals)
Data driven analysis methods that lead to biased results (stepwise regression, testing assumptions etc.)
Choosing wrong tests or methods
Misinterpreting results
Using classic statistics for non-random samples

(+1) This is a good description of the hybrid (and why exactly it is hybrid), but you did not explicitly say what your evaluation of it is. Do you agree that what you described is an "incoherent mishmash"? If so, why? Or do you think it is a reasonable procedure? If so, do the people claiming it is incoherent have a point, or are they simply wrong? — amoeba, Aug 21 '14 at 15:35
I often test hypotheses in exactly this manner... But there are other mish mashs I would not accept (e.g. not showing p values above $\alpha$) etc. — Michael M, Aug 21 '14 at 15:50

score 8 · Answer 4 · answered Aug 21 '14 at 21:56

8

I fear that a real response to this excellent question would require a full-length paper. However, here are a couple of points that are not present in either the question or the current answers.

The error rate 'belongs' to the procedure but the evidence 'belongs' to the experimental results. Thus it is possible with multi-stage procedures with sequential stopping rules to have a result with very strong evidence against the null hypothesis but a not significant hypothesis test result. That can be thought of as a strong incompatibility.
If you are interested in the incompatibilities, you should be interested in the underlying philosophies. The philosophical difficulty comes from a choice between compliance with the Likelihood Principle and compliance with the Repeated Sampling Principle. The LP says roughly that, given a statistical model, the evidence in a dataset relevant to the parameter of interest is completely contained in the relevant likelihood function. The RSP says that one should prefer tests that give error rates in the long run that equal their nominal values.

answered Aug 21 '14 at 21:56

Michael Lew

10,995
2
29
47

3

J. O. Berger and R.L Wolpert's monograph "The Likelihood Principle" (2nd ed. 1988), is a calm, balanced, and good exposition of point 2., in my opinion. – Alecos Papadopoulos Aug 21 '14 at 23:55
5

Berger and Wolpert is indeed a good exposition, and authoritative too. However, I prefer the more practically directed and less mathemtatical book "Likelihood" by AWF Edwards. Still in print, I think. http://books.google.com.au/books/about/Likelihood.html?id=LL08AAAAIAAJ – Michael Lew Aug 22 '14 at 01:27
2

@MichaelLew has explained that a valid use of p values is a summary of effect size. He has done a great thing by writing this paper: http://arxiv.org/abs/1311.0081 – Livid Feb 15 '15 at 02:25
@Livid The paper is v interesting, but for the new reader it's worth noting the following: the main idea, that p values 'index' (presumably: are in one to one relation with) likelihood functions, is generally understood to be false because there are cases where the same likelihood corresponds to different p-values depending on the sampling scheme. This issue is discussed a bit in the paper, but indexing is a very unusual position (which doesn't necessarily make it wrong, of course). – conjugateprior Jan 09 '16 at 14:15
@conjugateprior The p-value to likelihood function indexing is one to one within the scope of a single statistical model, but the p-value from one model does not point to the same likelihood function as a numerically identical p-value from a different statistical model. cont... – Michael Lew Jan 26 '22 at 00:54
[...cont] The situation most often pointed to where a different p-value points to the same likelihood function is positive and negative binomial sampling where any particular result will give different p-values but both point to a proportional likelihood function. The positive and negative binomial experiments are analysed using different models to get the p-values and so that situation does not serve as a counter-example to my postulated one-to-one relationship between p-value and likelihood function (within a statistical model). – Michael Lew Jan 26 '22 at 00:58

Livid · Answer 5 · 2015-02-15T20:37:51.213

accepting that both F and N-P are valid and meaningful approaches, what is so bad about their hybrid?

Short answer: the use of a nil (no difference, no correlation) null hypothesis irregardless of the context. Everything else is a "misuse" by people who have created myths for themselves about what the process can achieve. The myths arise from people attempting to reconcile their (sometimes appropriate) use of trust in authority and consensus heuristics with the inapplicability of the procedure to their problem.

As far as I know Gerd Gigerenzer came up with term "hybrid":

I asked the author [a distinguished statistical textbook author, whose book went through many editions, and whose name does not matter] why he removed the chapter on Bayes as well as the innocent sentence from all subsequent editions. “What made you present statistics as if it had only a single hammer, rather than a toolbox? Why did you mix Fisher’s and Neyman–Pearson’s theories into an inconsistent hybrid that every decent statistician would reject?”

To his credit, I should say that the author did not attempt to deny that he had produced the illusion that there is only one tool. But he let me know who was to blame for this. There were three culprits: his fellow researchers, the university administration, and his publisher. Most researchers, he argued, are not really interested in statistical thinking, but only in how to get their papers published [...]

The null ritual:

Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.

Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as $p < 0.05$, $p < 0.01$ , or $p < 0.001$ (whichever comes next to the obtained $p$-value).

Always perform this procedure.

Gigerenzer, G (November 2004). "Mindless statistics". The Journal of Socio-Economics 33 (5): 587–606. doi:10.1016/j.socec.2004.09.033.

Edit: And we should always need to mention, because the "hybrid" is so slippery and ill-defined, that using the nil null to get a p-value is perfectly fine as a way to compare effect sizes given different sample sizes. It is the "test" aspect that introduces the problem.

Edit 2: @amoeba A p-value can be fine as a summary statistic, in this case the nil null hypothesis is just an arbitrary landmark: http://arxiv.org/abs/1311.0081. However, as soon as you start trying to draw a conclusion or make a decision (ie "test" the null hypothesis) it stops making sense. In the comparing two groups example, we want to know how different two groups are and the various possible explanations there may be for differences of that magnitude and type.

The p value can be used as a summary statistic telling us the magnitude of the difference. However, using it to "disprove/reject" zero difference serves no purpose that I can tell. Also, I think many of these study designs that compare average measurements of living things at a single timepoint are misguided. We should want to observe how individual instances of the system change over time, then come up with a process that explains the pattern observed (including any group differences).

+1, Thanks for your answer and for the link. It seems I haven't read this particular paper, I will take a look. As I said before, I was under impression that "nil null" is an issue orthogonal to the issue of "hybrid", but I guess I should re-read Gigerenzer's writings to check that. Will try to find time in the following days. Apart from that: could you please clarify your last paragraph ("edit")? Did I understand correctly that you meant that having a nil null when comparing two effect sizes is okay, but having a nil null when comparing an effect size to zero is not okay? — amoeba, Feb 15 '15 at 14:19

score 2 · Answer 6 · answered Jul 28 '19 at 01:10

I see that those with more expertise than myself have provided answers, but I think my answer has the potential to add something additional, so I'll offer this as one other layman's perspective.

Is the hybrid approach incoherent? I'd say it depends on whether or not the researcher ends up acting inconsistently with the rules that they started out with: specifically the yes/no rule that comes into play with the setting of an alpha value.

Incoherent

Start with Neyman-Pearson. Researcher sets alpha=0.05, runs the experiment, calculates p=0.052. Researcher looks at that p-value and, using Fisherian inference (often implicitly), considers the result to be sufficiently incompatible with the test hypothesis that they will still claim "something" is going on. The result is somehow "good enough" even though the p-value was greater than the alpha value. Often this is paired with language such as "nearly significant" or "trending towards significance" or some wording along those lines.

However, setting an alpha value before running the experiment means that one has chosen the approach of Neyman-Pearson inductive behavior. Choosing to ignore that alpha value after calculating the p-value, and thus claiming something is still somehow interesting, undermines the entire approach that one started with. If a researcher starts down Path A (Neyman-Pearson), but then jumps across to another path (Fisher) once they don't like the path they are on, I consider that incoherent. They are not being consistent with the (implied) rules that they started with.

Coherent (possibly)

Start with N-P. Researcher sets alpha=0.05, runs the experiment, calculates p=0.0014. Researcher observes that p < alpha, and thus rejects the test hypothesis (typically no effect null) and accepts the alternative hypothesis (the effect is real). At this point the researcher, in addition to deciding to treat the outcome as a real effect (N-P), decides to infer (Fisher) that the experiment provides very strong evidence that the effect is real. They have added nuance to the approach they started with, but have not contradicted the rules set in place by choosing an alpha value at the beginning.

Summary

If one starts by choosing an alpha value, then one has decided to take the Neyman-Pearson path and follow the rules for that approach. If they, at some point, violate those rules using Fisherian inference as the justification, then they have acted inconsistently/incoherently.

I suppose one could go a step further and declare that because it is possible to use the hybrid incoherently, therefore the approach is inherently incoherent, but that seems to be getting deeper into the philosophical aspects, which I don't consider myself qualified to even offer an opinion on.

Hat tip to Michael Lew. His 2006 article helped me understand these issues better than any other resource.