How to fix hypothesis testing: MBF versus "directional correctness"

Question

A common misinterpretation of a p-value is that it represents the probability of a false positive in the context of hypothesis testing. Here a "positive" means rejecting the null.

There are many ways to explain how this is a misinterpretation. Here is my favorite: For most two-tailed hypothesis testing applications, the probability that the null hypothesis is true is precisely 0 even before we consider any data because an effect size is never exactly 0. Thus the probability of a false positive is zero regardless of the p-value.

This explanation is interesting for far more than understanding p-values. It highlights the tension in trying to compare a set of measure 0 ($H_0: \beta = 0$) against [typically] everything else ($H_1: \beta \neq 0$). There are two basic approaches to fixing that asymmetry:

Replace the alternative $H_1: \beta \neq 0$ with some single value, like $H_1: \beta = \beta^*$. Thus we weigh evidence for two specific values of $\beta$, specifically $0$ vs $\beta^*$.
Define both $H_0$ and $H_1$ in terms of sets with strictly positive measure. For example, set $H_0: \beta < 0$ and $H_1: \beta > 0$.

With (1), a standard approach is to apply a minimum Bayes factor (MBF), essentially a kind of upper bound on the relative posterior density (specifically the odds ratio) of any particular alternative parameter value (such as the MLE) versus the null value. A much-cited result is that the probability of $H_0$ estimated this way can easily be substantially greater than the p-value. The typical framing of this result is that using an MBF is somehow better than thinking about a p-value.

But let's also consider a specific flavor of (2), which I'll call directional correctness. Specifically, suppose the hypothesis we're actually interested in is a function of the data: $H_0$ is the event that the true value of $\beta$ is NOT of the same sign as our estimator $\hat \beta$, and $H_1$ is the complement of $H_0$. Basically, if I have an estimated effect, the next thing I often want to know is "how certain can I be that the estimate is not so far off from the truth that it at least has the same sign as the truth?".

Evaluating directional correctness aligns conceptually with looking at a posterior distribution and measuring the smaller tail (w.r.t. 0 or any other center of interest). And in the special cases where a credible interval coincides with a standard confidence interval, the p-value is precisely that tail probability.

I'm tempted to conclude that for purposes of evaluating directional correctness of effects, plain old p-values are a more reasonable place to start than getting into MBFs. What is the weakest link in this line of reasoning?

I have two problems with your favorite explanation. First, "probability of the hypothesis" has no place in NHST: it's not even a meaningful thing to talk about. Second, some effect sizes really are zero. *My* favorite is ESP. Part of your discussion is even more problematic: your proposal "the true value of $\beta$ is not the same sign as our estimator" is not a hypothesis, because it's a statement about the experimental *outcome* rather than a property of the model. — whuber, Jan 24 '22 at 22:51
Re 'no place in NHST' -- yes definitely not in any mainstream formulation. Possibly I should title my post more as finding a useful alternative to NHST than as 'fixing' it. And ESP is a great example (hence my use of 'most'). — zkurtz, Jan 24 '22 at 23:19

score 2 · Answer 1 · answered Jan 24 '22 at 21:31

No approach that distills the data into a single numerical summary of the 'evidence' will be resistant to misinterpretation. That's because the evidence has more than one dimension even for a simple system. The evidence will favour some values of the parameter(s) of interest and disfavour others and it will favour some strongly and others weakly. A single number can never capture that richness, and even adding the extra information of observed effect size and sample size helps only a little. A likelihood function is probably the best display of that favouring as it shows the strength of relative favouring and also the specificity of the favouring by way of its narrowness or peakiness. I discuss that here: https://arxiv.org/abs/1311.0081

Your suggestion (2) seems to be a one-tailed p-value. Is that right? One tailed p-values from simple significance tests have some important advantages over two-tailed p-values when they are treated as indices of evidence in the data against the specified null hypothesised parameter value according to the statistical model. Simplicity of interpretation is just one of them. (See that same paper.)

A suggestion to use one-tailed p-values will often elicit howls of disapproval from those who like to interpret p-values as error rates, but p-values from significance tests are not error rates and such an interpretation is of little utility in most scientific programs. I have written about using one-tailed p-values in several places, but the most complete discussion is in the article linked above.

The previous paragraph touches on two important distinctions that are not often enough respected. First, a 'neo-Fisherian' significance test yields a p-value and a Neyman–Pearsonian hypothesis test yields a decision regarding the null hypothesis. An analyst might make a decision regarding the null hypothesis on the basis of a significance test-derived p-value, but no decision is encoded in the p-value. (That distinction has been dealt with several times on this site, links below.) The second distinction is between a statistical inference and the types of inferences that are routinely make as part of a scientific program. Statistical inferences are amenable to accounting and to theoretical and simulation-based validation because they are contained within a fully defined statistical model. In contrast, scientific inferences relate to the real word and are therefore not amenable to any simple process of validation. Furthermore, the statistical inferences are very often treated as being all or none (especially those that follow a hypothesis test) whereas the scientific thought process will often be a mixture of principled reasoning and intuition and can involve more than just a single analysis of one dataset. A scientist can (should) consider the evidence currently in hand along with previously available evidence along with biological or physical theory and speculations. I've written about that here: https://pubmed.ncbi.nlm.nih.gov/31897610/

Your "much cited result" is very misleadingly titled, and should not be cited without careful evaluation (including reading the adjacent article in the same issue of the journal!). It conflates the numerical discrepancies between p-values and the Bayesian posterior from an arguably ill-formed 'slab and spike' prior. If p-values were 'irreconcilable' with Bayesian evidence then they would have to be irreconcilable with likelihood functions. The opposite is the case, according to a little-cited result here: https://arxiv.org/abs/1311.0081

See these posts on this site:

Why are lower p-values not more evidence against the null? Arguments from Johansson 2011

Accommodating entrenched views of p-values

Is it fair to say p-values tell us nothing about the probability null hypotheses are true?

Alternatives to the null hypothesis significance testing framework

What is the difference between "testing of hypothesis" and "test of significance"?

I appreciate your thoughtful comments and relevant links. To clarify, I'm not proposing a 1-tail p-value, at least not the usual sense. Instead, I was proposing look at the *smaller* tail (which, as @whuber correctly noted, is not compatible with hypothesis testing since the 'hypothesis' becomes a function of the data). — zkurtz, Jan 25 '22 at 00:37
If you read the linked paper you will see that the smaller tail is perfectly compatible with the ordinary significance test evidential interpretation of p-values. It is not compatible with hypothesis testing insofar as the arguments in favour of two-tailed testing mostly apply to the all or none decisions resulting from hypothesis tests. — Michael Lew, Jan 25 '22 at 06:26

score 1 · Answer 2 · answered Jan 25 '22 at 09:45

1

No need to resort to the curious notion of a hypothesis that can't be definitely stated before reference to the results of the experiment in order to draw inferences about the direction in which the true parameter $\beta$ deviates from (say) zero: performing a two-tailed test can often be usefully regarded as performing two one-tailed tests with a multiple-comparisons correction.

The hypothesis pairs are $$\begin{align} H_0: \beta \leq 0\ &\text{vs}\ H_1: \beta > 0 \\ H_0: \beta \geq 0\ &\text{vs}\ H_1:\beta<0 \end{align} $$

&, denoting the estimate $\hat\beta$ by $s$, the respective, corrected, p-values are given by

$$\begin{align} & 2\sup_{\beta\leq 0}\Pr(S\geq s;\beta) \\ & 2\sup_{\beta\geq 0}\Pr(S\leq s;\beta) \end{align}$$

This is practically the same as using as the test statistic

$$\begin{align} t &= \min\left[\sup_{\beta\leq 0}\Pr(S\geq s;\beta), \sup_{\beta\geq 0}\Pr(S\leq s;\beta)\right] \\ &= \min\left[\Pr_{\beta=0}(S\geq s;\beta), \Pr_{\beta=0}(S\leq s;\beta)\right] \end{align}$$

with the p-value given by

$$ \Pr_{\beta=0}(T\leq t) = 2t $$

but provides a formal account of what you're probably inclined to do anyway: take $\hat\beta > 0$ as evidence that $\beta$ is non-negative & not merely non-zero.

answered Jan 25 '22 at 09:45

Scortchi - Reinstate Monica

27,560
8
81
248

+1 because this seems to articulate the closest thing to what I was proposing in standard significant testing machinery. But I'm still not quite seeing something. This is a very unique/unusual application of multiple comparisons because the outcomes of the two tests are perfectly negatively correlated. I feel like that should imply something but not sure what. – zkurtz Jan 25 '22 at 20:13
Also, contrast this with analysis of a Bayesian posterior: there P(correct sign) is just one minus the smaller tail probability, with no 2x correction, right? The smaller tail can be as large as 0.5, and I can't imagine how the odds of having the wrong sign could ordinarily exceed 50-50 ... so I'm doubly skeptical about this factor of 2. – zkurtz Jan 25 '22 at 20:15
What I've described is simply a Bonferroni correction, & that the two probabilities are of mutually exclusive events (supposing the test statistic to have a continuous distribution) makes it exact rather than conservative. I'd say the case for correcting for multiple comparisons is here at its strongest given the intimate connection between the two tests - inferences about the same parameter, from the same data. But anyway, I wanted to present a different view of two-tailed tests - one which allows directional inference & in which there are two interval nulls rather than a single point null. – Scortchi - Reinstate Monica Jan 26 '22 at 18:41

How to fix hypothesis testing: MBF versus "directional correctness"

2 Answers2

Linked