$H_0$ vs $H_1$ in diagnostic testing

Question

Consider diagnostic testing of a fitted model, e.g. testing whether regression residuals are autocorrelated (a violation of an assumption) or not (no violation). I have a feeling that the null hypothesis and the alternative hypothesis in diagnostic tests often tend to be exchanged/flipped w.r.t. what we would ideally like to have.

If are interested in persuading a sceptic that there is a (nonzero) effect, we usually take the null hypothesis to be that there is no effect, and then we try to reject it. Rejecting $H_0$ at a sufficiently low significance level produces convincing evidence that $H_0$ is incorrect, and we therefore are comfortable in concluding that there is a nonzero effect. (There are of course a bunch of other assumptions which must hold, as otherwise the rejection of $H_0$ may result from a violation of one of those assumptions rather than $H_0$ actually being incorrect. And we never have 100% confidence but only, say, 95% confidence.)

Meanwhile, in diagnostic testing of a model, we typically have $H_0$ that the model is correct and $H_1$ than there is something wrong with the model. E.g. $H_0$ is that regression residuals are not autocorrelated while $H_1$ is that they are autocorrelated. However, if we want to persuade a sceptic that our model is valid, we would have $H_0$ consistent with a violation and $H_1$ consistent with validity. Thus the usual setup in diagnostic testing seems to exchange $H_0$ with $H_1$, and so we do not get to control the probability of the relevant error.

Is this a valid concern (philosophically and/or practically)? Has it been addressed and perhaps resolved?

Why do you think your autocorrelation test exchanges $H_0$ and $H_1$? If it fails to reject $H_0$, I would have thought your conclusion was that any autocorrelation you see in the residuals might have happened by chance so you go on using the model — Henry, Jun 09 '21 at 08:40
@Henry, I have no problem with the conclusion you describe. However, I think we want something else. We want to limit the probability of not rejecting a model when the assumption is violated to $\alpha$ (say, $\alpha=0.05$). But we do not actually do that. Instead, we limit the probability of rejecting a model under no violation of assumptions. I think the former is more relevant than the latter, hence the problem. In other words, I think nonrejection of a model when an assumption is violated should be type I error, but in the usual setup it happens to be type II error instead. — Richard Hardy, Jun 09 '21 at 09:43
That is moving toward power analysis. Otherwise, if you want there to be a small probability of using a model with no autocorrelation when there is in fact some underlying autocorrelation (even if it is tiny) then this could push you towards never using a model with no autocorrelation. Your choice; there are other cases where that kind of approach makes sense, such as never using an unpaired $t$-test *assuming equal variances* and instead always using Welch's test. — Henry, Jun 09 '21 at 09:53
@Henry, I think my question is more on the philosophical side. I start by motivating diagnostic testing from first principles and only then compare the ideal setup I have arrived at to the reality of diagnostic testing as currently observed in practice. So while power analysis may be a related topic, it is probably not the level on which the core of the discussion would be based. Though I may be mistaken. (And since you mentioned power analysis, the notion of *severe testing* by Mayo and Spanos might also be relevant.) — Richard Hardy, Jun 09 '21 at 09:57
This sounds like it is close to equivalence testing. In equivalence testing (I'm thinking of TOST), we show that the true difference (or autocorrelation for you) is within some amount of zero that we consider practically insignificant. We do not, however, show that the value is zero. — Dave, Jun 09 '21 at 14:41
@Dave, I think it is simpler than that. I am interested in which hypothesis should be set as the null hypothesis based on subject-matter argumentation. Implementation (and the technical impossibility thereof) comes second, but I am not there yet; I care about the first stage in which we formulate the problem. — Richard Hardy, Jun 10 '21 at 10:04
Have you read Paul Meeh's work on this topic? It's **very** important. There's a nice short overview [here](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1090.1393&rep=rep1&type=pdf), and [this](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.8918&rep=rep1&type=pdf) is probably the most relevant single paper. — Eoin, Aug 19 '21 at 15:48
@Eoin, I read the short overview and liked it. It appears to argue that $H_0$ is false in soft psychology, so with sufficiently powerfull tests we should reject it about 50% of the time for a directional alternative (and, I guess, reject almost always for a nondirectional alternative). Treating the rejection as support for some theory is thus very weak evidence for the theory. Makes sense. Now, in relation to my question, what point are you making? — Richard Hardy, Aug 19 '21 at 19:17
I think Meehl's more relevant point is that in physics, $H_0$ is that the data is consistent with the theory, and researchers try to disconfirm it. More broadly though, my point is that Meehl has written extensively about this kind of practical and philosophical questions in theory testing for about 50 years, and is well worth reading in full if you want a proper answer. — Eoin, Aug 23 '21 at 12:59
1/2 With respect to autocorrelation (stationarity vs unit root specifically), do you know that there is a robust literature about (a) tests with $\text{H}_{0}$: data are strongly stationary, $\text{H}_1$: there is evidence the data have unit root, **and also** (b) tests with $\text{H}_{0}$: data have unit root, and $\text{H}_{1}$:there is evidence the data are strongly stationary? — Alexis, Aug 24 '21 at 04:24
2/2 More generally, in the frequentist world one need not privilege an particular theory with the burden of evidence (although frequentist hypothesis testing is mostly taught this way, with consequences for routinizing confirmation bias). For example, one can pose a test for difference **and also** a test for equivalence, and combine the inferences of both tests (very much like combining the inference from a test for stationarity with inference from a test for unit root). — Alexis, Aug 24 '21 at 04:27
@Alexis, re 1/2, yes, I do, with the exception that I think they concern weak stationarity (strong stationarity allows for infinite variance or even mean; I think that is a technical challenge; think about CLT and related tools which no longer apply). Re 2/2, so you do not think that if one is going to use a model, one must present compelling evidence the model's assumptions are not violated? And specifically, do you not think that $H_0$: model is wrong is more relevant than $H_0$: model is correct from the philosophy of science pov? — Richard Hardy, Aug 24 '21 at 06:26
I think about model assumptions quite a bit, thank you, and am glad you do also. My point is that frequentist hypothesis testing has a problem with privileging which view of the world gets to be called $H_{0}$, and that privilege itself is seldom theorized or treated rigorously by applied researchers, with consequences that lead to confirmation bias. In my comment (not answer) I raised frequentist relevance testing as a useful tool for confronting that kind of bias by not only looking for evidence to falsify equality, but also to look for evidence to falsify relevant difference. — Alexis, Aug 24 '21 at 17:05

Christian Hennig · Answer 1 · 2021-09-01T08:42:59.987

The somewhat unsettling truth is that misspecification testing is not suitable for "persuading a skeptic that the model is valid". Generally, as you obviously understand, not rejecting the $H_0$ does not imply that the $H_0$ is true, and this is the case also in misspecification testing. What the test does is something weaker, namely it just tells you that certain observable problems with the model assumptions have not occurred. Still the misspecification test will not rule out that the data has been generated in a way that violate the model assumptions and may violate them badly. For example, an evil dependence structure could be at work that enforces the data to show a certain seemingly innocent pattern that you see even though this may be contrived enough to not look suspicious to your favourite test for independence (I'm not claiming that this is realistic, I'm just claiming that a misspecification test cannot rule out that this is technically possible).

Misspecification testing can to a certain extent reassure you, but it cannot secure model assumptions to be true.

Note that some would argue that the term "valid" is weaker than the term "true", and A. Spanos (2018) argues that if you do misspecification testing in the right way (i.e., testing all assumptions in a reasonable order, meaning that the misspecification test of one assumption is not sabotaged by the failure of another assumption), ultimately indeed you can be sure that the model is "valid" for the data, even though this doesn't mean it's "true". The way he does this is by defining the term "valid" basically as passing all those tests, because then, according to him, we know that the data looks like a typical realisation from the model. I think that this is misleading though, because as I have argued above, this does not rule out that in fact model assumptions are violated in harmful ways.

A message from this is that misspecification testing is never a substitute for thinking about the subject matter and the data generating process in order to know whether there are problems with the assumptions that you couldn't see from the data alone.

The following are additions that were made taking into account comments and discussion:

In a comment, you already made reference to "severe testing" (Mayo and Spanos). Note that in their work you'll never find severity calculations that refer to misspecification tests, and for good reasons. Models can be violated in far too many and too complex ways in order to rule out all violations (or even just all relevant ones), and be it with a certain error probability.
There's TOST as in the response by Dave. This can work if we focus on one particular assumption (for example an autocorrelation parameter $\alpha$ to be zero) and take everything else in the model specification for granted. And even then we can only reject $|\alpha|>c$ for some $c>0$ (how small $c$ can be will depend on the sample size); we cannot reject $\alpha\neq 0$.
The original question was "how to choose the $H_0$", which I haven't really addressed up to now; instead of answering it, I will argue that we can't do much better than what is usually done. Remark 2 above is about an $H_0$ that isn't exactly the complement of the model assumption, rather rejecting it would secure (with the usual error probability) that the true $\alpha$ is close to zero, i.e., the model assumption. This is really the best we can hope for, and also it is not an accident that even this can only be achieved taking a host of other assumptions for granted. The thing is that we can never rule out too rich a class of distributions, because such a class will contain distributions that are so close (in case $\alpha\neq 0$) to the model assumption that they cannot be distinguished by any finite amount of data, or even distributions that are in terms of interpretation very different (like the "evil dependence structure" mentioned above), but can emulate perfectly whatever we observe, and can therefore not be rejected from the data. Famous early results in this vein are in Bahadur and Savage (1956) and Donoho (1988). Particularly there is no way to make sure that the underlying process has a density, let alone being normal or anything specific. (There is less work about evil dependence structures as far as I'm aware, because detecting them is outright hopeless.)
Furthermore, the problem with TOST is that I'd suspect that this has a higher probability to reject a true model than the standard misspecification testing approach, and this is bad, because not only it would be a (type II) error, but also it will worsen the problem that running model-based analysis conditionally on the "correct" outcome of a misspecification test can be biased, as the theory behind standard analyses doesn't take MS-testing into account, see the Shamsudheen and Hennig arxiv paper for this issue and some more literature.

References: Bahadur, R. and L. Savage (1956). The nonexistence of certain statistical procedures in nonparametric problems. Annals of Mathematical Statistics 27, 1115– 1122.

Donoho, D. (1988). One-sided inference about functionals of a density. Annals of Statistics 16, 1390–1420.

Spanos A (2018) Mis-specification testing in retrospect. Journal of Economic Surveys 32:541–577

There's also this (with which I agree more): M. Iqbal Shamsudheen, Christian Hennig (2020) Should we test the model assumptions before running a model-based test? https://arxiv.org/abs/1908.02218

*it just tells you that certain observable problems with the model assumptions have not occurred*: yes, but we do not control power, while we do control size. I argue that what we would like to control is power, not size. — Richard Hardy, Aug 19 '21 at 14:51
I agree, but we simply can't do that. Only exception is if we concentrate on one specific model assumption while taking everything else for granted by assumption. If we're just testing a single autocorrelation parameter, we could probably use severity/power calculations to switch the test around and to test $H_0:\ |\alpha|>c$ for some $c>0$, however this can only work if you are happy to tolerate all other assumptions for this model first. — Christian Hennig, Aug 19 '21 at 14:55
@RichardHardy: Some more discussion on why we can't just do things the other way round added. — Christian Hennig, Aug 19 '21 at 15:15
Just trying to clarify: I am not asking what is possible, I am asking what is logically desirable. I am trying to avoid *sour grapes* type of situation where we change our taste based on what is available to us. I want to discuss taste separately from our options, and only afterwards try to match them. — Richard Hardy, Aug 19 '21 at 15:18
That something is not available is one thing, that it is impossible is quite another. I don't think it makes much sense to say that something is "logically desirable" if it is mathematically impossible. (The term "logically desirable" looks a bit like a contradiction in terms anyway. Logic isn't driven by desire, is it?) — Christian Hennig, Aug 19 '21 at 15:21
Scratch *logically* then, leave *desirable*. Fair point on this. Regarding the main point, let me reiterate: the question stands as posed. It concerns specifically what is desirable. Only after that there comes the follow-up question of what might be possible. — Richard Hardy, Aug 19 '21 at 16:00
Fair enough. As you have probably figured out by now, I don't see any desirable alternative, at least not when it comes to choosing the $H_0$ in ways different from what is currently done. Putting the assumption of interest up as $H_0$ is in my view not a problem. Testing is not the right tool to convince your skeptic, however it's done. Much more of a concern in my view is what exactly we mean by a model being "valid" - surely not that it's true (because it isn't anyway). — Christian Hennig, Aug 19 '21 at 20:02
What *can* be done is to compute severities/power of MS-tests against specific alternatives (violations of model assumptions) of interest, in order to characterise better what these tests do. — Christian Hennig, Aug 19 '21 at 20:04
I now think the term *logically desirable* was not that bad. If we have the logic of science and the logic of statistics, then we can use *logically desirable* to stand for the basis of the desire. Also, *logically desirable* does not imply logic is driven by desire but the converse. In any case, your last idea to compute power/severities of MS-tests is a good candidate for the alternative solution that I have asked for. On the other hand, *Putting the assumption of interest up as $H_0$ is in my view not a problem* sounds quite unorthodox w.r.t. the general logic of hypothesis testing. — Richard Hardy, Aug 21 '21 at 12:59
I see why you write this, and maybe I'm too focused on what I can imagine to be (im)possible. What I mean is rather that I don't see any better option as long as it should be done in the framework of hypothesis testing. — Christian Hennig, Aug 21 '21 at 14:45
Thanks! Also, consider including @RichardHardy in the comments, otherwise I do not get notified about them. — Richard Hardy, Aug 23 '21 at 08:48
The bounty goes to you not only for the content of the answer but also for your effort and engagement. I appreciate that. — Richard Hardy, Aug 26 '21 at 16:57
@RichardHardy Much appreciated, thanks. At some point I had asked myself the same question; I realise that my answer doesn't really settle it, but I'm not looking for a better one anymore. — Christian Hennig, Aug 26 '21 at 20:00

Geoffrey Johnson · Answer 2 · 2021-08-26T18:38:21.590

This is a great question. If we were to set autocorrelation as the null hypothesis we would have to be very specific about the type and amount. If we reject this hypothesis we have not brought evidence against all types or amounts of autocorrelation, just the one we tested. For this reason we set no autocorrelation as the null hypothesis, with the general alternative being some form and amount of autocorrelation. This is in agreement with Henry's comment. While I see a similarity between a diagnostic test and a TOST, these are not the same. In a TOST we are hopeful to reject the null hypothesis in favor of the alternative. In a diagnostic test we are hopeful for a failure to reject the null hypothesis.

We typically think of a small p-value as evidence against the null, reducing the null to the absurd, showing it is implausible. By this same logic a large p-value could be seen as evidence in favor of the null (weak evidence against the null), showing it is not absurd, it is plausible. Of course no hypothesis is proven false with a small p-value, nor is it proven true with a large one. All we can do is provide the weight of the evidence.

There is no right or wrong for what hypothesis is considered the null and what is the alternative. If you are using a Neyman-Pearson framework it is a matter of what you want as the default decision. For instance, when investigating a treatment effect we often think of "no effect" as the null hypothesis. However, in clinical development one might use a clinically meaningful effect as the null hypothesis (default decision) and only if there is sufficient evidence against this hypothesis would it be decided that the durg is not efficacious. Under a Fisherian framework one would test all possible hypotheses to see the evidence against no treatment effect as well as evidence against a clinically meaningful effect.

All of that sounds logical and clear, yet I think it misses the question. The question is whether $H_0$ and $H_1$ should not be exchanged based on philosophical considerations underlying use of statistics in scientific enquiry (or practical enquiry). Only after that comes a second question of whether exchanging $H_0$ and $H_1$ is feasible and if not, what else can be done. — Richard Hardy, Aug 21 '21 at 12:42
My first paragraph indicates you most certainly can exchange $H_0$ and $H_1$ based on philosophical considerations, it simply narrows the scope of the inquiry. — Geoffrey Johnson, Aug 21 '21 at 13:40

Dave · Answer 3 · 2021-08-19T15:00:15.397

3

I think it is exactly what the two one-sided tests procedure (TOST) does. TOST concedes that there might be some small effect but shows, with some level of confidence, that the effect is below the threshold of causing us to care. Perhaps there is a bit of autocorrelation, but an autocorrelation of $0.01$ might be effectively zero. If you truly want to show the value to be zero, not close to zero, with some confidence (credibility...), I cannot see a way to do it without going Bayesian and using a prior with $P(0)>0$. If you want to be frequentist, then I think the best you can do is to bound the value in a range.

(I do not have enough experience with Bayesian methods to have much of an opinion of using a prior that puts $P(\text{what we want})>0$, but that sure sounds like rigging the test.)

$$\text{TOST}\\ H_0: \vert\theta\vert\ge d\\ H_a: \vert\theta\vert<d$$

In this way, we flip the null and alternative hypothesis to show that the value of interest, $\theta$, is less than our tolerance for difference from zero, $d$.

There are equivalences between TOST and power calculations, so I think this satisfies your requirement for controlling power that you mentioned in your comment to Lewian.

edited Aug 19 '21 at 15:00

answered Aug 19 '21 at 14:51

Dave

28,473
4
52
104

Thanks! This seems to answer my follow-up question regarding a feasible alternatives to the unsatisfactory state of having $H_0$ and $H_1$ flipped (the main question). – Richard Hardy Aug 21 '21 at 13:01
Dave and @RichardHardy Two brief points: (1) We can also combine inferences from tests for difference with inferences from tests for equivalence in a manner that directly centers power and effect size in inferential conclusions—[see my answer here](https://stats.stackexchange.com/a/360997/44269), and (2) TOST is just one way to perform equivalence tests, one that sacrifices power compared to the complementary test for difference, and which is effectively a simpler alternative to uniformly most powerful tests *a la* Wellek. – Alexis Aug 24 '21 at 17:17
1

I am also glad I am not the only one thinking about equivalence tests here. :) Gratis cite for UMP tests for equivalence: Wellek, S. (2010). *Testing Statistical Hypotheses of Equivalence and Noninferiority* (Second Edition). Chapman and Hall/CRC Press. – Alexis Aug 24 '21 at 17:18
I definitely see the similarity between a diagnostic test and a TOST, but they aren't exactly the same. In a TOST we do in fact set the alternative hypothesis to be what we hope to demonstrate. We are hopeful to reject the null in favor of the alternative. In a diagnostic test we are hopeful for a failure to reject. – Geoffrey Johnson Aug 26 '21 at 14:48
@GeoffreyJohnson I would argue that to be a misuse of null-hypothesis significance testing that arises from a misunderstanding of frequentist inference. – Dave Aug 26 '21 at 14:49
@Dave, which part? Is it the hopefulness for a failure to reject? – Geoffrey Johnson Aug 26 '21 at 14:53
Doing something in hopes for a failure to reject seems to be a misuse of NHST: $H_0: \theta=\theta_0$ vs $H_a: \theta\ne\theta_0$ with a large $p$-value does not suggest that $\theta=\theta_0$. If anything, one would want to guess that $\theta = \widehat\theta$. – Dave Aug 26 '21 at 14:56
I agree whole heartedly. Your last comment gets at the core of the question in the original post, and this is precisely what is being done in a diagnostic test. However, as Henry pointed out a failure to reject means we have not brought sufficient evidence against $H_0$, and so it is retained. One could choose to decide or take action as if $\theta=\theta_0$ and $\hat{\theta}$ was not an exceedingly rare event. – Geoffrey Johnson Aug 26 '21 at 15:04

score 2 · Answer 4 · answered Aug 24 '21 at 04:10

I don't think your premise is accurate regarding model testing. All the diagnostic tests for models that I am familiar with stipulate the model assumption as the null hypothesis and test for a departure from this that would falsify the assumption. Even if we are talking to someone who is a skeptic of the model assumptions, the usual approach would be to show them that when we subject the model to diagnostic tests there is no evidence of any breach of the model assumptions; via tests where those assumptions are taken as the null hypothesis.

The problem with setting the null hypothesis as a violation of the model is that this is not a simple hypothesis --- it is a complex composite hypothesis that must stipulate the type and degree of the violation in assumptions (which would then beg the question of sensitivity analysis for the stipulated degree).

So, I am not convinced that there is any incongruity in the first place to resolve.

Thank you. My experience agrees with yours, but my question is more about what we want than what we have or are used to having (like your *the usual approach*). I am trying to separate the the two to avoid a *sour grapes* type of problem, but it does not seem I am quite succeeding. — Richard Hardy, Aug 24 '21 at 11:35
Yes, I think it is not quite succeeding. Cutting all that away, it is certainly true that the classical hypothesis testing method is *asymmetric*, so there are definitely concerns that are sometimes expressed about that property of the test method. — Ben, Aug 26 '21 at 18:30

score 0 · Answer 5 · answered Aug 26 '21 at 13:54

Aa Hello mr. Hardy I did read page, but cant comment so i post

Having autocorrelation in residuals for me is good thing (from usage stand point)- it give me "assurance" that info about next error term can be "known" from previous

I mean, many tests were designed to help solving real issues - and it works widely. I cant see any problem, just predicting that model with H0 or H1 will be bad / good - i personally would still test both.

But different data, different approaches. I feel you try to broke things that are good on some issues, but bad on another... like saying "every h0 model is good and every h1 is bad" its like saying "all fruits are red" while clearly orange is orange.

Guessing some field is more approprieate to solve issue, usualy save initional time. (like many stats tests for me ... they are good in their ways)

Its always about helping in some way, I hope it will help you help.

Thanks for your perspective! – Richard Hardy Aug 26 '21 at 13:58 — Richard Hardy, Aug 26 '21 at 13:58

$H_0$ vs $H_1$ in diagnostic testing

5 Answers5