21

Is there an example where two different defensible tests with proportional likelihoods would lead one to markedly different (and equally defensible) inferences, for instance, where the p-values are order of magnitudes far apart, but the power to alternatives is similar?

All the examples I see are very silly, comparing a binomial with a negative binomial, where the p-value of the first is 7% and of the second 3%, which are "different" only insofar one is making binary decisions on arbitrary thresholds of significance such as 5% (which, by the way, is a pretty low standard for inference) and do not even bother to look at power. If I change the threshold for 1%, for instance, both lead to the same conclusion.

I've never seen an example where it would lead to markedly different and defensible inferences. Is there such an example?

I'm asking because I've seen so much ink spent on this topic, as if the Likelihood Principle is something fundamental in the foundations of statistical inference. But if the best example one has are silly examples like the one above, the principle seems completely inconsequential.

Thus, I'm looking for a very compelling example, where if one does not follow the LP the weight of evidence would be overwhelmingly pointing in one direction given one test, but, in a different test with proportional likelihood, the weight of evidence would be overwhelmingly pointing in an opposite direction, and both conclusions look sensible.

Ideally, one could demonstrate we can have arbitrarily far apart, yet sensible, answers, such as tests with $p =0.1$ versus $p= 10^{-10}$ with proportional likelihoods and equivalent power to detect the same alternative.

PS: Bruce's answer does not address the question at all.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 5
    When performing significance testing, one can always change the decision by changing the threshold. Could you therefore explain what you mean by "markedly," "silly," or "compelling"? BTW, you seem to be reading the [Wikipedia article](https://en.wikipedia.org/wiki/Likelihood_principle). – whuber Nov 23 '18 at 20:18
  • 2
    Welcome to CV, @statslearner. Can you give an example of one or more *specific* approaches to inference that do not use the likelihood principle which you would like to see contrasted? – Alexis Nov 23 '18 at 20:20
  • 1
    @whuber ideally I would like to see that you can construct arbitrarily different answers such as, if you want to use p-values, something like $p=0.5$ versus $p=10^{-5}$, and both computations would still seem defensible. –  Nov 23 '18 at 20:21
  • @Alexis the only example I'm aware of is using p-values. MLE and Bayes follow the LP. –  Nov 23 '18 at 20:22
  • 3
    I cannot follow that comment because $p=10^5$ makes no sense. Regardless, have you considered just changing the numbers given in the Wikipedia example? – whuber Nov 23 '18 at 20:22
  • @whuber it's $10^{-5} = 0.00001$. Yes, but the inferences just get closer, not further apart. –  Nov 23 '18 at 20:24
  • 6
    The significant difference with practical implications is the processing of stopping rules: under the LP they do not matter, outside the LP they do. Check Berger & Wolpert (1987) for details. – Xi'an Nov 23 '18 at 20:52
  • 1
    @Xi'an I'm reading the paper, but also found no example where p-values differ by at least two orders of magnitude. Can you construct such example? –  Nov 23 '18 at 21:02
  • For $n$ observations $X_i$ from an exponential distribution with mean $\mu,$ you might consider comparison of the minimum observation $X_{1:n} = X_{(1)}$ with the sample mean $\bar X,$ where the latter is MLE. [This](https://math.stackexchange.com/questions/3005259/pivots-for-exponential-distribution/3005813#3005813) recent post on another site explores CIs for rate $\lambda = 1/\mu,$ where CIs based on $\bar X$ are very much shorter. Wikipedia on 'exponential distribution' has some of the distribution theory. – BruceET Nov 23 '18 at 23:43
  • @BruceET I'm not sure I understand, are the likelihood ratios proportional in this example? Could you elaborate as an answer? –  Nov 24 '18 at 02:09
  • The minimum has an exponential distribution; the mean has a gamma distribution, both distributions are expressible in terms of chi-squared distributions. For example, if you are testing $H_0: \mu \le \mu_0$ vs $H_a: \mu > \mu_0,$ then it pretty clear in which tail of the dist'ns the critical values must lie. – BruceET Nov 24 '18 at 03:14
  • @BruceET do they have the same degrees of freedom? I'm not sure your example falls into the LP. But if does, can you get answers that differ by orders of magnitude? –  Nov 24 '18 at 03:58
  • 1
    We may be dealing with semantical differences as to what a LR test is. [NIST](https://www.itl.nist.gov/div898/handbook/apr/section2/apr233.htm) gives a fairly broad definition, the (fragmentary) Wikipedia article on 'likelihood ratio tests' takes a slightly different point of view. // If you look at the minimum of exponentials in terms of chi-sq, that's 2df as in my earlier link; if you look at the mean in terms of chi-sq, that's $2n$ df. // The difference is large; 'orders of magnitude' depends on what you're looking at. In the link, the avg lengths of the CIs is about an order of magnitude. – BruceET Nov 24 '18 at 05:34
  • My example with exponential mean (or rate) is only a suggestion, not intended as **the** answer. Maybe someone has a suggestion that you will find more elegant. – BruceET Nov 24 '18 at 05:38
  • @BruceET if they have different degrees of freedom, the likelihood principle does not apply. –  Nov 24 '18 at 07:59
  • 1
    @BruceET You need to show that the two likelihoods are proportional to each other (differ only by a constant), otherwise the likelihood principle does not apply. –  Nov 24 '18 at 09:00
  • @Xi'an I was reading some of your blog posts, and I take that you also think the LP is meaningless in practice... maybe you could sketch an answer? –  Nov 29 '18 at 18:33
  • @MartijnWeterings do you think a p-value of 7% versus a p-value of 3% that came from procedures with different power to capture the alternative provide very different evidence? If you give me an example where we have a p-value of $10^{-5}$ versus a p-value of $0.1$ and where the power function is similar, yet the likelihoods are proportional and both inferences are defensible, then I would say "wait... that's a real puzzle." –  Dec 02 '18 at 20:56
  • @MartijnWeterings I rewrote some passages, it might be clearer now. –  Dec 02 '18 at 21:00
  • @MartijnWeterings if it's easy then that's the answer I'm looking for. I need it because all examples I saw so far were both artificial and inconsequential. –  Dec 02 '18 at 21:03
  • @MartijnWeterings the technical condition I'm looking for is "demonstrate answers can be arbitrarily far apart", but I'm happy with a simple extreme example. If you want I can make this as precise as needed, for instance, give me an example where $p_{1} = 10^{-5}$, $p_{2} = 0.1$, power = 80% on both. –  Dec 02 '18 at 21:12
  • @MartijnWeterings I rewrote again, hope it's clear now. –  Dec 02 '18 at 21:33
  • @MartijnWeterings look, considering you think it's easy to come up with an example, if you provide any mathematical example that adheres to the mathematical conditions of what I described, I will accept your answer regardless of how crazy your proposed procedure is, so don't worry about the "defensible". –  Dec 02 '18 at 21:40
  • @MartijnWeterings "Why does the power need to be the same?" otherwise the evidential value of the tests is not the same, this is stats 101. –  Dec 02 '18 at 21:42
  • @MartijnWeterings the LP says nothing about what happens when you violate the LP, just says that you shouldn't. We are talking about two procedures that violates LP. The power restriction is so to make sure the two tests have same evidential value, otherwise you are just trading off one type of error with another, thus there is no real puzzle or paradox in reaching different decisions (you just chose a different trade-off). –  Dec 02 '18 at 22:07
  • @MartijnWeterings are the likelihoods proportional in both cases? –  Dec 02 '18 at 22:09
  • @MartijnWeterings you can make discrete examples where this is not a problem. But if it is indeed (almost) impossible as you claim, then this shows the LP is inconsequential since it would be (almost) impossible having two tests with overwhelming evidence in different directions, you are just trading off type I and type II errors of different alternatives. –  Dec 02 '18 at 22:13
  • @MartijnWeterings the likelihood of the statistics of your example do not seem the same. –  Dec 02 '18 at 22:17
  • @MartijnWeterings the statistics (how you measure the discrepancy) do not have the same *exact* pdf, if they had you would not be able to obtain different answers. –  Dec 02 '18 at 22:19
  • @MartijnWeterings if the two *decision procedures* have the *same exact* likelihood under the null hypothesis they lead to the same p-values by mathematical necessity! You are probably conflating the pdf of something else with the likelihood of the *test* itself (including how you measure extreme). –  Dec 02 '18 at 22:29
  • @MartijnWeterings anyway, feel free to write an answer, if powers not exact, at least very similar. If you can show this is impossible, that's even better, because that's what I suspected. –  Dec 02 '18 at 22:31

5 Answers5

11

Think about a hypothetical situation when a point null hypothesis is true but one keeps sampling until $p<0.05$ (this will always happen sooner or later, i.e. it will happen with probability 1) and then decides to stop the trial and reject the null. This is an admittedly extreme stopping rule but consider it for the sake of the argument.

This moronic procedure will have 100% Type I error rate, but there is nothing wrong with it according to the Likelihood Principle.

I'd say this does count as "really" mattering. You can of course choose any $\alpha$ in this argument. Bayesians can use a fixed cut-off on Bayes factor if they please. The same logic applies. The main lesson here is that you cannot adhere to LP and have an error rate guarantee. There is no free lunch.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 4
    I was thinking of this example as well. But I did not mention it because it is indeed moronic. But actually, it is what happens in practice indirectly and informally. – Sextus Empiricus Dec 04 '18 at 21:40
  • To pair this with my misunderstanding of what is claimed with the likelihood principle discussed in my answer, supposing you used a $z$ statistic, then the $P( |z_{\text{final}}| > 1.96) = 1$...which means your stopping rule certainly affects the distribution of the final observed statistic, even though each individual observation follows the same distribution. So my confusion about the claims made about the likelihood principle is how someone can claim that the stopping rule does not alter the likelihood of the final statistic. – Cliff AB Dec 05 '18 at 01:14
  • @CliffAB But who claims that it does not? Stopping rule clearly affects the likelihood _conditioned on the stopping rule_. I don't think this is under debate. – amoeba Dec 05 '18 at 08:13
  • @CliffAB are you suggesting that the p-value is a statistic? How would you express $\mathcal{L}(p|x)$? And does it matter that the statistic is altered? The point is that you *do* have equal likelihood functions *for raw the data* in at least some (natural) representation (namely binomial and negative binomial). Of course you can always alter the statistic such that for the two different methods you have different likelihood functions and claim that the LP is not violated because the likelihood functions aren't the same. – Sextus Empiricus Dec 05 '18 at 10:26
  • 1
    What are the 2 statistics and their likelihood in your example? In the neg. binomial vs binomial case we have: 1) statistics 1, number of trials until 3 heads, likelihood neg binomial; 2) statistics 2, number of heads in n trials, likehood binomail. In your example, I don't see what the two statistics are and if they have proportional likelihoods. –  Dec 05 '18 at 19:14
  • 1
    In your example probably it would be "number of trials until p<0.05" which I hardly doubt it is proportional to the binomial, so I'm not sure your example is valid, Amoeba. –  Dec 05 '18 at 19:17
  • @statslearner2 why can't the statistic be $S,N$ a vector with the number of successes and the total number of trials? – Sextus Empiricus Dec 05 '18 at 20:45
  • 1
    I don't think the likelihood principle says "there is nothing wrong with it." The likelihood principle filters out bad procedures. The fact that the procedure does not obey the likelihood principle is not the same as it being *endorsed* by the likelihood principle. A Bayesian analysis of this sequential testing problem, which of course does obey the likelihood principle, has perfectly fine properties, because it will not implement the "moronic" procedure you describe. – guy Dec 05 '18 at 23:28
  • Continuing on this, your claim about Bayes factors is also not a criticism of the likelihood principle or Bayes. A Bayesian analysis of sequential testing *does not use a fixed cutoff* for the Bayes factor. The cutoff for the associated test statistic grows like $\sqrt {\log n}$ if we assign prior probability $\pi$ to the null. – guy Dec 05 '18 at 23:32
  • To clarify on my last comment, I don't mean that we impose externally that the cutoff should grow like $\sqrt{\log n}$, I mean that this occurs naturally just from doing the math and computing the decision rule we get when we reject if the probability that the null is false exceeds (say) 0.5. – guy Dec 05 '18 at 23:42
  • @guy Thanks for your comments. I am not sure I fully understand what you are saying about BF in this scenario. Can you point me to some resource that discusses this? (That said, I see that my answer is unsatisfactory; I will either edit it or eventually delete.) – amoeba Dec 07 '18 at 22:00
  • 3
    @amoeba consider $\theta \sim N(0,\tau^{-1})$ under the alternative or $\theta = 0$ under the null, with $Y_i \sim N(\theta,1)$. It is easy to show that the log of the Bayes factor is roughly $\frac 1 2 [\log(\tau / n) + Z_n^2]$ where $Z_n$ is the usual $Z$ test statistic. Rejecting when the Bayes factor is larger than $1$ is then equivalent to rejecting when $|Z_n| > O(\sqrt{\log n})$. Under the null, this is not guaranteed to happen in the sequential testing setting (c.f. the law of iterated logarithm); hence, the Bayesian procedure will not fall victim to the problem you described. – guy Dec 07 '18 at 23:11
  • Hi amoeba, did you have a chance to read my previous comments? I don't think your example is valid, in the sense of providing two tests with proportional likelihoods that lead to different answers. What would be the two tests and their likelihoods in your example? –  Dec 11 '18 at 23:01
  • 1
    @statslearner2 I had little time over the last days. I keep thinking about this issue. I might prefer to delete this answer eventually. – amoeba Dec 14 '18 at 14:39
  • @amoeba, I remember a discussion somewhere on CV (probably in comments to a thread) on how come Bayesians see no problem (roughly speaking) with such moronic procedures. If I remember correctly, you were partaking in the discussion on the side of asking the question. Could you give a link to it? I think the problem was not fully resolved there (in my perception at least), so I was going to ask the same question properly. Your current answer is relevant in that regard, though it only provides a case and not an explanation why it is so / why the Bayesians see no problem. – Richard Hardy Jan 08 '19 at 08:01
  • @amoeba, ping... – Richard Hardy Jan 16 '19 at 14:27
4

Disclaimer: I believe this answer is at the core of the entire argument, so it worth discussion, but I haven't fully explored the issue. As such, I welcome corrections, refinements and comments.

The most important aspect is in regards to sequentially collected data. For example, suppose you observed binary outcomes, and you saw 10 success and 5 failures. The likelihood principle says that you should come to the same conclusion about the probability of success, regardless of whether you collected data until you had 10 successes (negative binomial) or ran 15 trials, of which 10 were successes (binomial).

Why is this of any importance?

Because according to the likelihood principle (or at least, a certain interpretation of the it), it's totally fine to let the data influence when you're going to stop collecting data, without having to alter your inference tools.

Conflict with Sequential Methods

The idea that using your data to decide when to stop collecting data without altering your inferential tools flies completely in the face of traditional sequential analysis methods. The classic example of this is with methods used in clinical trials. In order to reduce potential exposure to harmful treatments, data is often analyzed at intermediate times before the analysis is done. If the trial hasn't finished yet, but the researchers already have enough data to conclude that the treatment works or is harmful, medical ethics tells us we should stop the trial; if the treatment works, it is ethical to stop the trial and start making the treatment available to non-trial patients. If it is harmful, it is more ethical to stop so that we stop exposing trial patients to a harmful treatment.

The problem is now we've started to do multiple comparisons, so we've increased our Type I error rate if we do not adjust our methods to account for the multiple comparisons. This isn't quite the same as traditional multiple comparisons problems, as it's really multiple partial comparisons (i.e., if we analyze the data once with 50% of the data collected and once with 100%, these two samples clearly are not independent!), but in general the more comparisons we do, the more we need to change our criteria for rejecting the null hypothesis to preserve the type I error rate, with more comparisons planned requiring more evidence to reject the null.

This puts clinical researchers in a dilemma; do you want to frequently check your data, but then increase your required evidence to reject the null, or do you want to infrequently check your data, increasing your power but potentially not acting in an optimal manner in regards to medical ethics (i.e., may delay product to market or expose patients unnecessarily long to harmful treatment).

It is my (perhaps mistaken) understanding that the likelihood principle appears to tell us that it doesn't matter how many times we check the data, we should make the same inference. This basically says that all the approaches to sequential trial design are completely unnecessary; just use the likelihood principle and stop once you've collected enough data to make a conclusion. Since you don't need to alter your inference methods to adjust for the number of analyses you've prepared, there is no trade off dilemma between number of times checked and power. Bam, whole field of sequential analysis is solved (according to this interpretation).

Personally, what is very confusing about this to me is that a fact that is well know in the sequential design field, but fairly subtle, is that the likelihood of the final test statistic is largely altered by the stopping rule; basically, the stopping rules increase the probability in a discontinuous manner at the stopping points. Here is a plot of such a distortion; the dashed line is the PDF of the final test statistic under the null if data is only analyzed after all data is collected, while the solid line gives you the distribution under the null of the test statistic if you check the data 4 times with a given rule.

With that said, it's my understanding that the likelihood principle seems to imply that we can throw out all we know about Frequentist sequential design and forget about how many times we analyze our data. Clearly, the implications of this, especially for the field of clinical designs, is enormous. However, I haven't wrapped my mind around how they justify ignoring how stopping rules alter the likelihood of the final statistic.

Some light discussion can be found here, mostly on the final slides.

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
  • 2
    +1. I find it conceptually easier to think about a hypothetical situation when the null hypothesis is true but one keeps sampling until $p<0.05$ (this wall _always_ happen sooner or later, i.e. it will happen with probability 1) and _then_ decides to stop the trial. This moronic procedure will have 100% Type I error rate, even though it complies with the LP. – amoeba Dec 04 '18 at 20:20
  • @amoeba: I agree that your example is pretty straightforward (+1). The goal of my answer is to emphasize why is there even a discussion. I think that answer is that *if* the implications and interpretations of the LP were correct, it would mean that clinical trials would no longer have to chose between maximal power and unnecessary exposure, which would be an absolutely huge gain. In general it would also free researchers from needing to guess proper sample size in advance, which greatly improve the utility of statistical tests. – Cliff AB Dec 04 '18 at 23:38
  • Well, I think the whole framework of frequentist testing is inconsistent with the LP, and that's just how it is. One uses frequentist testing if one wants a guarantee on the error rates. Turns out that this is inconsistent with LP. See also Lindley's paradox and all that. Well, tough. I used to be excited about these matters, but now I am not anymore. There is no free lunch; one has to make some choices. Note that a lot of Bayesian procedures [violate LP as well](https://stats.stackexchange.com/questions/194448). – amoeba Dec 05 '18 at 08:18
  • *"the likelihood of the final test statistic is largely altered by the stopping rule"* The pdf is changed, and also the likelihood (but only by a constant), but you may still end up with a likelihood functions that are the same up to a constant of proportionality. E.g. the binomial distribution and the negative binomial distribution for $k$ successes and $n$ trials have both a likelihood $\mathcal{L}(p|n,k)$ that is proportional to $\propto p^kp^{n-k}$ – Sextus Empiricus Dec 05 '18 at 10:33
3

Outline of LR tests for exponential data.

Let $X_1, X_2, \dots, X_n$ be a random sample from $\mathsf{Exp}(\text{rate} =\lambda),$ so that $E(X_i) = \mu = 1/\lambda.$ For $x > 0,$ the density function is $f(x) = \lambda e^{-\lambda x}$ and the CDF is $F(x) = 1 - e^{-\lambda x}.$

1. Test statistic is sample minimum.

Let $V = X_{(1)} = \min_n (X_i).$ Then $V \sim \mathsf{Exp}(n\lambda).$ As an outline of the proof, $$P(V > v) = P(X_1 > v, \dots, X_n > v) = \left[e^{-\lambda v}\right]^n= e^{-n\lambda v},$$ so that $P(V \le v) = 1 - e^{-n\lambda v},$ for $v > 0.$

To test $H_9:\mu \le \mu_0$ against $H_a: \mu > \mu_0,$ at level $\alpha = 5\%,$ we regard $V$ as a single observation from its exponential distribution. We find that the log likelihood ratio indicates rejection when $V > c,$ where $P(V > c\, |\, \mu = \mu_0) = 0.05.$

For the specific case in which $n = 100$ and $\mu_0 =10,\, \lambda_0 = 0.1,$ we have exponential rate $10 = n/\mu_0 = 100/10 = 10,$ so that $c = 0.2295$ from R, where the exponential distribution is parameterized by the rate.

 qexp(.95, 10)
 [1] 0.2995732
 1 - pexp(0.2996, 10)
 [1] 0.04998662

Accordingly, the power against the alternative $\mu_a = 100$ (rate $n/\mu_a = 1)$ is about 74%.

1 - pexp(0.2996, 1)
[1] 0.7411146

2. Test statistic is the sample mean.

Oxford U. class notes (second page) show that the likelihood ratio test of $H_0: \mu \le \mu_0$ against $H_0: \mu > \mu_0$ at the 5% level of significance rejects for $\bar X > c,$ where $P(\bar X > c\, |\, \mu = \mu_0) = 0.5.$ Furthermore, one can show using moment generating functions that $\bar X \sim \mathsf{Gamma}(n, n\lambda).$

For the specific case in which $n = 100$ and $\mu_0 =10,\, \lambda_0 = 0.1,$ we have $\bar X \sim \mathsf{Gamma}(100, 10),$ so that $c = 11.7.$

qgamma(.95, 100, 10)
[1] 11.69971
1 - pgamma(11.7, 100, 10)
[1] 0.04997338

Accordingly, power against the alternative $\mu_a = 14$ is about 95.6%.

1 - pgamma(11.7, 100, 100/14)
[1] 0.9562513

Clearly, for purposes of testing hypotheses about the exponential mean $\mu,$ the information in the sufficient statistic $\bar X$ is much greater than the information in the sample minimum.

BruceET
  • 47,896
  • 2
  • 28
  • 76
  • I don't think this address the question at all.Are the two likelihoods proportional? You first need to show the likelihood of the two experiments are proportional, otherwise the likelihood principle does not apply. Second, in this example the two tests lead to the same conclusion, so it's even more underwhelming than the example of the binomial versus negative binomial. –  Nov 24 '18 at 08:59
  • I just checked the document, the likelihoods are **not** proportional, since the first likelihood has $v$ in the exponent and the other has $\sum x_i$, thus the likelihood principle should not apply here, it's fine for the two tests to lead to different conclusions according to the likelihood principle. –  Nov 24 '18 at 09:04
  • 2
    Bruce, just to clarify what the liklihood principle states: it says that if you have two experiments where the likelihoods differ only by a constant, then you should derive the same conclusion from them. This happens in the binomial versus negative binomial case, where they differ only in the binomial coefficient part (constant). Your example shows two tests where their likelihoods do not differ only by a constant, so the LP does not apply. –  Nov 24 '18 at 09:08
  • @statslearner2 the likelihood function for observing a sample $x_1,...,x_n$ is: $$f(x_1,...,x_n) = \prod_{i=1}^n \lambda e^{-\lambda x_i}$$ This is the same whether you select the minimum or the mean as a criteria to perform the test. The violation that occurs here can be seen as the type in which the definition of 'extreme cases' is different and the integration to compute the p-value is done differently. – Sextus Empiricus Dec 03 '18 at 13:22
3

Violation by different pdf functions $f(x,\theta)$ and $g(x,\theta)$

This case will be an example of 'violation' because the probability distribution functions $f(x,\theta)$ $g(x,\theta)$ are intrinsically different. Even when $f$ and $g$, differ, they may relate to the likelihood principle because at fixed measurement $x$ they give the same functions of $\theta$ up to scaling. The difference, opens up a possibility for "violations".


The coin flip with or without optional stopping rule

The coin flip with or without optional stopping rule is a typical example, the pdf is binomial or negative binomial which are different pdf functions and lead to different calculation of p-values, and confidence intervals, but they lead to the same likelihood functions for fixed sample/measurement (up to scaling).

$$\begin{array}{rcrl} f_{\text{Negative Binomial}}(n|k,p) &=& {{n-1}\choose{k-1}}&p^k(1-p)^{n-k} \\ f_{\text{Binomial}}(k|n,p) &=& {{n}\choose{k}}&p^k(1-p)^{n-k} \end{array}$$


More extreme example

Consider some measurement of $X$ which is distributed as

$$\mathcal{L}(\theta | x) = f(x|\theta) = \begin{cases} 0 & \text{ if } \quad x < 0 \\a & \text{ if }\quad 0 \geq x < 1 \\ (1-a) \theta \exp(-\theta (x-1)) & \text{ if }\quad x \geq 1 \end{cases}$$

where $a$ is some known parameter that depends on the type of experiment, and $\theta$ is some parameter that may be unknown and could be inferred from the measurement $x$.

For any given $x$ and $a$ the likelihood function is proportional to the same function that is independent from $a$:

  • If $x<1$ then $\mathcal{L}(\theta | x) \propto 1$
  • If $x\geq 1$ then $\mathcal{L}(\theta | x) \propto \theta \exp(-\theta (x-1))$

But, albeit the same likelihood function, the p-value can vary widely depending on the experiment (ie the value of $a $). For instance when you measure $x=2$ and test $H_0:\theta = 1$ against $H_0:\theta < 1$ then the p-value is

$$P(X>2|\theta = 1) = \frac{(1-a)}{\exp(1)} $$


Intuition: The reason for violation in these cases is that p-values and hypothesis tests are not solely based on the likelihood function for the particular observed value $x$.

The p-value is not calculated from the likelihood $f(θ|x)$ with $x$ fixed, but with the pdf $f(x|θ)$ with $θ$ fixed which is a different slice. Confidence intervals, p-value, and hypothesis tests, are different things than the information from likelihood ratios.

p-values are not really evidence: The p-value relates to type I error which is a measure that relates to an ensemble of measurements rather than to a single measurement. This type I error or p-value is not the same as 'evidential meaning' from Birnbaums 'foundations of statistical evidence'. This relates a lot to the problems with p-values and scientist searching for outcomes solely with statistical significance rather than important effects.

Do we need examples where inferences are markedly different? The extreme case is a contrived example. Such a case, or anything with a similar extreme difference, is of course not occurring easily in practice. It is more often the case that the difference will be small such as in the cases that you refer to as silly.

To ask for examples where the likelihood principle 'really matters', or where two different inferences lead to extremely different results, is a bit of a loaded question. At least when the intention for this question relates to some philosophical argument. It is a loaded question because it presupposes that principles that matter should lead to extremely varying results. In many practical cases the results are however small (in terms of different p-values less than an order). I believe that this is not a strange for two different, but both plausible, methods to result in more or less similar results. I would consider the likelihood principle not to be 'less violated' when the differences are only small.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • Regarding Case 1: I think choosing a different test statistic can (should?) be seen as changing the likelihood function. – amoeba Dec 04 '18 at 20:35
  • 2
    @MartijnWeterings yes it is choosing a different test statistics, what matters is the likelihood of the statistics, not of the data. Otherwise I can take a sequence of 100 flips and compute several statsistics: number of runs of heads, number of alternations of heads and tails. None of this violates the LP. –  Dec 05 '18 at 19:15
  • You need to pick two statistics that will have proportional likelihoods, such as the number of trials until 3 success or the number of successes in n trials etc. –  Dec 05 '18 at 19:25
1

Here is an example adapted from Statistical decision theory and Bayesian analysis by James O. Berger (Second edition page 29).

Say that two species of wasps can be distinguished by the number of notches on the wings (call this $x$) and by the number of black rings around the abdomen (call this $y$). The distribution of the characters in the two species (labelled $H_0$ and $H_1$) are as follows:

Table adapted from Statistical decision theory and Bayesian analysis by James O. Berger.

Say that we find a specimen with 1 notch on the wings and 1 ring around the abdomen. The weight of evidence if 100 times bigger in favor of $H_1$ against $H_0$ for both characters.

Now if someone wanted to set up a test for $H_0$ at 5% level, the decision rule would be for the first character “accept $H_0$ if there is 1 notch on the wing, otherwise reject it”, and for the second character “accept $H_0$ if there are 3 rings around the abdomen, otherwise reject it”. There are many other possibilities, but these ones are most powerful tests at this level. Yet, they lead to different conclusions for both characters.


Note: one could of course set up a test with the rule “accept $H_0$ if there are 1 or 3 rings around the abdomen, otherwise reject it”. The question is whether we prefer a test at 5% level with type II risk 0, or a test at 4.9% level with type II risk 0.00001. The difference is so small that we would probably not care, but as I understand it, this is the core of the argument for the likelihood principle: it is not a good idea to make the result depend on something that seems irrelevant.


The likelihood functions are proportional, and yet the p-value of $x = 1$ is 0.95, and that of $y = 1$ is 0.001 (assuming that we reject $H_0$ with events of the form $y \leq \alpha$). It is obvious from the structure of the table that I could have chosen any number smaller than 0.001. Also, the type II risk of the rejection is 0, so it looks like there is nothing “wrong” here.

Still, I admit that this example is somewhat contrived and not completely honest because it plays with the difficulty of arranging tests with discrete data. One could find equivalent examples with continuous data but they would be even more contrived. I agree with the OP that the likelihood principle has almost no practical value; I interpret it as a principle to guarantee some consistency within the theory.

gui11aume
  • 13,383
  • 2
  • 44
  • 89