12

Background

In computer science, mathematics, and sometimes in other fields, “esoteric” examples cannot only be entertaining, but helpful to illustrate certain concepts, for example:

  • Bogosort and Slowsort are very inefficient sorting algorithms that can be used to understand properties of algorithms, in particular when compared to other sorting algorithms.

  • Esoteric programming languages demonstrate how far-reaching the concept of a programming language is and help to appreciate good programming languages.

  • The Weierstraß function and Dirichlet function primarily find use to illustrate certain misconceptions about the concept of continuity.

I am currently preparing some teaching on using hypothesis tests and think that having a test with very low power (but no other flaws) would help to illustrate the concept of statistical power. (Of course, I still have to decide myself whether a given example is didactically useful for my audience or just confusing.)

Actual Question

Are there any statistical test with intentionally low power, more specifically:

  • The test fits in the general framework of hypothesis tests, i.e., it works with a null hypothesis, has requirements, and returns a (correct) p value.
  • It is not intended/proposed for serious application.
  • It has an very low power (due to an intentional design flaw and not due to low sample or effect size).

If you can fundamentally argue that such a test cannot exist, I would also consider this a valid answer to my question. If on the other hand, a plethora of such tests exists, I am interested in the most didactically efficient one, i.e., it should be easily accessible and have a striking effect.

Note that I am not asking for a general selection of statistical mistakes (cherry picking, etc.) or similar.

What I found so far

Internet searches returned nothing for me.

Every attempt to construct something like this ended either up in some (useful) existing test or the format is not that of a regular test. For example, I thought about a test whether a population has a positive median that returns only yes if all samples are positive; but that test does not return a p value and thus does not fit within the usual test framework. If I just count the positive and negative signs as a test statistic (and compute the p values accordingly), I end up with the sign test, which is a reasonable test.

Wrzlprmft
  • 2,225
  • 1
  • 18
  • 35
  • 2
    Being more mathematical, "esoteric" examples (which abound) tend to be specific counterexamples to popular misunderstandings; a number of textbooks contain such examples. As it stands, your question is essentially a "big list" type question and so is too broad (though you should note that several users have concluded the question is unclear); if you can clarify your question and narrow its scope it may fit the site better. – Glen_b Feb 17 '19 at 04:00
  • @Glen_b: I am not specifically looking for a list of statistical mistakes or similar; this would be indeed too broad and the Internet is full of those already. So far, I failed to find a single instance of what I am looking for (and nothing was proposed as a comment or answer to this question, when open), so I expect this question to yield a short list at best. Still, I narrowed it down further to test with a low power (which is honestly the only thing I can conceive while staying within the framework for regular tests). – Wrzlprmft Feb 17 '19 at 10:54
  • As for the unclarity, I tried to make a few points more restrictive or prominent, but it would probably help if one of those who think it is unclear could articulate their confusion. – Wrzlprmft Feb 17 '19 at 10:55
  • The question would admit any example, so any answer that provided an example would be a correct answer. Thanks for the edits, though. While I still think it's overly broad I think the scope is better and I will reopen. – Glen_b Feb 17 '19 at 11:40
  • I find [this question](https://stats.stackexchange.com/questions/391580/checking-if-a-coin-is-fair) about whether some number of observations of the pattern 'heads-tails-tails' indicates an unfair coin is a good example. The p value is low but the strenght of proof is small. – Sextus Empiricus Feb 17 '19 at 12:03
  • 1
    Low power compared to what? Lehmann gave an example of a generalized likelihood-ratio test that had lower power under any alternative hypothesis than under the null. – Scortchi - Reinstate Monica Feb 17 '19 at 17:03
  • @Glen_b: I added some criterion that allows to compare answers. This way, this should not be less broad than other questions asking for didactic explanations or examples. – Wrzlprmft Feb 17 '19 at 18:19
  • @MartijnWeterings: (I read that example more thoroughly now and deleted my previous comment.) Thanks. That indeed is an example of what I am looking for. You may want to turn this into an answer (possibly generalising or extremifying it). – Wrzlprmft Feb 17 '19 at 18:21
  • @Scortchi: *Low power compared to what?* – Well, to any reasonable test for the same scenario. — *Lehmann gave an example of a likelihood-ratio test that had lower power under any alternative hypothesis than under the null.* – I cannot find the specific test you are alluding to (Lehman’s œuvre seems to be focused on likelihood ratio tests). Moreover, I fail to make sense of the notion of “power under the null”. – Wrzlprmft Feb 17 '19 at 18:25
  • 2
    Any of the silly estimators to which you apply Rao-Blackwellization could be used as a test statistic. For example, there's the first observation in the sample, used as an estimator of the mean. When Rao-Blackwellized, you obtain the sample mean. I had to do many exercises like this in class. Anyway, this statistic could be used instead of the sample mean in something like a $t$ test. But no, I can't think of anything directly in the form you're looking for, or I'd be writing an answer, not a comment. But there must be something, illustrating failure of a general method for test construction. – user54038 Feb 17 '19 at 18:31
  • @Wrzlprmft: It seems as if you're just asking for examples of inadmissible tests, of which there are indeed a plethora. An easy way to find a quite useless test - one whose power doesn't depend at all on the parameter value - would be to use any non-constant function of an ancillary statistic as the test statistic. For example to use the sample range in a test concerning location parameter. I think it would help to describe more precisely the lesson you want to impart. – Scortchi - Reinstate Monica Feb 17 '19 at 21:06
  • 1
    I'll dig out the Lehmann paper when I'm at a computer. The power of a test under the null is just the size of the test. – Scortchi - Reinstate Monica Feb 17 '19 at 21:09
  • 4
    An example test used in a class I was a student in (many years ago) was "roll a fair 20 sided die and reject if you roll a 1" (as part of a discussion of power curves). This of course completely ignores the data, but is a "valid" test in that it doesn't have higher than the desired type I error rate (which was 5% in the context the example was given in) . – Glen_b Feb 17 '19 at 23:15
  • Lehmann (1950), "Some principles of the theory of testing statistical hypotheses", *Ann. Math. Statist.*, **21**, 1. He credits Stein for the example (which I'd forgotten), but still deserves the credit for the famous "worse than useless" verdict on the test. BTW, you ought to be able to generate worse-than-useless tests from your "reasonable" tests by changing the test statistic from $T$ to $-T$. As @user54038 suggested, what's interesting is "failure of a general method of test construction", or of intuition at least. – Scortchi - Reinstate Monica Feb 17 '19 at 23:54
  • See Romano & Siegel (1986), *Counterexamples in Probability And Statistics*, Ch. 10, for ideas. – Scortchi - Reinstate Monica Feb 20 '19 at 11:43
  • The signed rank test is indeed more powerful than the sign test, and used in the same circumstances, so that is good to land on. They aren't tests, but the Holm (and Holm-Sidak) adjustment for multiple comparisons preserves more power than Dunn's Bonferroni (and the Sidak) adjustment, and the Benjamini-Hochberg adjustment for multiple comparisons has yet more power, and does not suffer from some of the conceptual incoherence of the former. Understanding how these methods evolved is instructive: https://www.youtube.com/watch?v=oONHlua2gBY. – Alexis Feb 21 '19 at 18:20
  • The Conover-Iman test is an obscure *post hoc* pairwise test following rejection of the Kruskal-Wallis omnibus test, but has strictly *more* power than the comparable and much better known Dunn's test (also *post hoc* to Kruskal-Wallis). – Alexis Feb 21 '19 at 18:23
  • [Chebyshev's inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality) fits the bill, but I doubt many would agree with me. It's very popular in academic literature, but virtually powerless to be practical – Aksakal Feb 21 '19 at 19:02

2 Answers2

7

There's a little-remarked-on corollary to the Neyman–Pearson lemma (proof in Geisser (2006), Modes of Parametric Statistical Inference, Ch 4.4): $$ \operatorname{E}\phi(X)=\alpha $$ $$ \phi(x) = \begin{cases} 0\ & \text{when $f_0(x) < kf_1(x)$} \\ 1\ & \text{when $f_0(x) > kf_1(x)$} \end{cases} $$ defines the least powerful level-$\alpha$ test, $\phi$, of the null hypothesis $H_0:$ density $f_0$ vs $H_1:$ density $f_1$ from data $x$.

From this result you can derive uniformly least powerful, locally least powerful, uniformly least powerful similar, & least powerful "totally biased" tests (I mean those with lower power under any alternative than under the null). If you already have a uniformly most powerful, &c. test, simply multiply your test statistic by -1 to maintain the partitioning of the sample space it induces while reversing the ordering of the partitions.


Perhaps, as @user54038 suggests, "failure of a general method of test construction" might be more interesting. Lehmann (1950), "Some principles of the theory of testing statistical hypotheses", Ann. Math. Statist., 21, 1, attributes the following example to Stein:

Let $X$ be a random variable capable of taking on the values $0, \pm 1, \pm 2$ with probabilities as indicated:

$$ \begin{array}{r c c c c c} & -2 & 2 & -1 & 1 & 0 \\ \hline \text{Hypothesis $H$:} & \frac{\alpha}{2} & \frac{\alpha}{2} & \frac{1}{2} - \alpha & \frac{1}{2} - \alpha & \alpha\\ \hline \text{Alternatives:} & pC & (1-p)C & \frac{1-C}{1-\alpha}\left(\frac{1}{2}-\alpha\right) & \frac{1-C}{1-\alpha}\left(\frac{1}{2}-\alpha\right) & \alpha\frac{1-c}{1-\alpha}\\ \end{array} $$ Here, $\alpha$, $C$, are constants $0 < \alpha \leq \frac{1}{2}$, $\frac{\alpha}{2-\alpha}< C <\alpha$, and $p$ ranges over the interval $[0,1]$.

It is desired to test the hypothesis $H$ at significance level $\alpha$. The likelihood ratio test rejects when $X=\pm2$, and hence its power is $C$ against each alternative. Since $C<\alpha$, this test is literally worse than useless, for a test with power $\alpha$ can be obtained without observing $X$ at all, simply by the use of a table of random numbers.

Note that it's the generalized likelihood ratio test he's considering, with $p$ in the role of a nuisance parameter to be maximized over. So when $X=-2$ or $X=2$, $\hat p=1$ or $\hat p=0$ respectively, & the likelihood ratio comes to $\frac{2C}{\alpha}$ in either case; for any other value of $X$ it's the lower value of $\frac{1-C}{1-\alpha}$.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
5

(Related to the comment by @Scortchi)

Suppose $X \sim N(\mu, 1)$ and we want to test the hypothesis

\begin{align*} H_0&: \mu = 0 \\ H_1&: \mu \neq 0 \end{align*}

For the sake of esetoricism, let's augment our data with an independent "coin flip" $Z \sim Bernoulli(p)$ where $p$ is known and no smaller than the significance level $\alpha$ (i.e. $p \in [\alpha, 1]$). Consider rejection regions of the form:

$$R = \left\{(X, Z) \ | \ z = 1 \ \wedge |x| > \Phi^{-1}\left(\frac{\alpha}{2p}\right) \right\}$$

By construction, this is a valid test of size $\alpha$.

\begin{align*} P(X\in R \ | \ \mu=0) &= P\left(Z=1 \ , \ |X| > \Phi^{-1}\left(\frac{\alpha}{2p}\right)\right) \\ &= P(Z=1)P\left(|X| > \Phi^{-1}\left(\frac{\alpha}{2p}\right)\right) \\ &= p\frac{\alpha}{p} = \alpha \end{align*}

The power of this test however can never be more than $p$. For instance, suppose that our observed data is $(x, z) = (1000000, 0)$. It is obvious that the null hypothesis should be rejected, but since our coin "shows tails" we fail to reject the null. Setting $p=\alpha$ leads to an even sillier example where the rejection region doesn't depend on $X$ at all, but is still a valid Rejection region with size $\alpha$.

A similar question could be given as homework by changing intersection to union in the rejection region. This region is uniformly less powerful than the one without $Z$, but is more reasonable in the sense that power doesn't have an upper bound.

knrumsey
  • 5,943
  • 17
  • 40