Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals

Question

A recent question on the difference between confidence and credible intervals led me to start re-reading Edwin Jaynes' article on that topic:

Jaynes, E. T., 1976. `Confidence Intervals vs Bayesian Intervals,' in Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science, W. L. Harper and C. A. Hooker (eds.), D. Reidel, Dordrecht, p. 175; (pdf)

In the abstract, Jaynes writes:

...we exhibit the Bayesian and orthodox solutions to six common statistical problems involving confidence intervals (including significance tests based on the same reasoning). In every case, we find the situation is exactly the opposite, i.e. the Bayesian method is easier to apply and yields the same or better results. Indeed, the orthodox results are satisfactory only when they agree closely (or exactly) with the Bayesian results. No contrary example has yet been produced.

(emphasis mine)

The paper was published in 1976, so perhaps things have moved on. My question is, are there examples where the frequentist confidence interval is clearly superior to the Bayesian credible interval (as per the challenge implicitly made by Jaynes)?

Examples based on incorrect prior assumptions are not acceptable as they say nothing about the internal consistency of the different approaches.

Under rather mild assumptions, (a) Bayesian estimation procedures are admissible and (b) all, or almost all, admissible estimators are Bayesian with respect to some prior. Thus it's no surprise that a Bayesian confidence interval "yields the same or better results." Note that my statements (a) and (b) are part of the *frequentist* analysis of rational decision theory. Where frequentists part company with Bayesians is not over the mathematics or even the statistical procedures, but concerns the meaning, justification, and correct use of a prior for any particular problem. — whuber, Sep 03 '10 at 18:48
So, does the above comment imply that the answer to the OP's question is 'No such examples can be constructed.'? Or perhaps, some pathological example exists which violates the assumptions behind admissibility? — , Sep 03 '10 at 19:44
@Srikant: Good question. I think the place to begin investigating is a situation where there are non-Bayes admissible estimators--not necessarily a "pathological" one, but at least one that provides some opportunity to find a "contrary example." — whuber, Sep 03 '10 at 22:29
I would add some clarity to the "incorrect prior assumptions..." by stating that the Bayesian answer and the frequentist answer must make use of *the same information*, otherwise you are just comparing answers to two different questions. Great question though (+1 from me) — probabilityislogic, Jan 21 '11 at 12:26
I think, I have seen such an example in the book named "All of Statistics" by Larry Wasserman in which he gives an example where using Bayesian CI is _not_ the sensible thing to do. However, it is a pathological example. — suncoolsu, Jan 27 '11 at 20:00
pathology or not, it would probably be the first of its kind. I am very keen to see this example, for these "pathologies" usually have a good learning element to them — probabilityislogic, Jan 30 '11 at 13:18
@suncoolsu - is the example in Wasserman where the supposed "coverage" drops to zero? — probabilityislogic, Jan 30 '11 at 14:21
@probabilityislogic. I think so, but I need to check. Will get back to you soon on this soon! — suncoolsu, Jan 30 '11 at 16:24
@suncoolsu - this example of Wasserman is not an example of "defective Bayes". Because $\theta\sim N(0,1)$, and the coverage for $\theta<2$ is good, note that $Pr(|\theta|<2)\approx 0.977$, so the supposed "poor coverage" is obtained only in a small fraction of possibilities, if the prior is true. If you were to average this coverage taken with respect to the posterior of $\theta$, it would be about 95%, because most of the posterior probability would be in the $\theta<2$ range. (more later) — probabilityislogic, Jan 30 '11 at 23:15
@suncoolsu - I would say that it is a good example of the non-robust properties of conjugate priors though. Because if the true value of $\theta$ is say $4$, but your prior says $\theta\sim N(0,1)$, then almost surely, the prior and the data will come into conflict. If the prior is conjugate, then what you are basically saying is that the prior information is *just as cogent as the data*. If the prior was instead $\theta\sim St(0,1,10)$ (T with 10 df), then because the likelihood is normal, you are saying that the data is more cogent than the prior...(more still) — probabilityislogic, Jan 30 '11 at 23:20
...cont'd... and that in a case of conflict, you want the data to "win". If the situation was reversed (Student likelihood and normal prior), then if the data conflicts with the prior, the prior would "win". See [This post](http://stats.stackexchange.com/questions/6493/mildly-informative-prior-distributions-for-scale-parameters/6506#6506) for some links to how this works. I would suspect that the coverage would be better for large $\theta$ (but possibly worse for small $\theta$) if a student-t distribution was used as prior instead of normal. — probabilityislogic, Jan 30 '11 at 23:25
I think I will put this in more detail, as an answer, because it is a prime example of what Jaynes talks about in his paper. Wasserman shows a problem, shows that the Bayesian way gives an apparently counter-intuitive result, warns about the "danger of Bayesian methods", without any investigation as to *why* the Bayesian solution gives the result it does. Secondly the *Frequentist Confidence Interval is not given in the equivalent problem!* I will show in my answer, that the example can be formulated in equivalent frequentist terms *which give exactly the same answer as the Bayesian one!* — probabilityislogic, Jan 31 '11 at 00:24

score 60 · Answer 1 · answered Jan 21 '11 at 11:21

I said earlier that I would have a go at answering the question, so here goes...

Jaynes was being a little naughty in his paper in that a frequentist confidence interval isn't defined as an interval where we might expect the true value of the statistic to lie with high (specified) probability, so it isn't unduly surprising that contradictions arise if they are interpreted as if they were. The problem is that this is often the way confidence intervals are used in practice, as an interval highly likely to contain the true value (given what we can infer from our sample of data) is what we often want.

The key issue for me is that when a question is posed, it is best to have a direct answer to that question. Whether Bayesian credible intervals are worse than frequentist confidence intervals depends on what question was actually asked. If the question asked was:

(a) "Give me an interval where the true value of the statistic lies with probability p", then it appears a frequentist cannot actually answer that question directly (and this introduces the kind of problems that Jaynes discusses in his paper), but a Bayesian can, which is why a Bayesian credible interval is superior to the frequentist confidence interval in the examples given by Jaynes. But this is only becuase it is the "wrong question" for the frequentist.

(b) "Give me an interval where, were the experiment repeated a large number of times, the true value of the statistic would lie within p*100% of such intervals" then the frequentist answer is just what you want. The Bayesian may also be able to give a direct answer to this question (although it may not simply be the obvious credible interval). Whuber's comment on the question suggests this is the case.

So essentially, it is a matter of correctly specifying the question and properly intepreting the answer. If you want to ask question (a) then use a Bayesian credible interval, if you want to ask question (b) then use a frequentist confidence interval.

Well said, especially about what question a CI actually answers. In the Jaynes' article however, he does mention that CI's (and most frequentist procedures) are designed to work well "In the long-run" (e.g. how often do you see $n \rightarrow \infty$ or "for large n the distribution is approximately..." assumptions in frequentist methods?), but there are many such procedures that can do this. I think this is where frequentist techniques (consistency,bias,convergence,etc.etc.) can be used to assess various Bayesian procedures which are difficult to decide between. — probabilityislogic, Jan 21 '11 at 12:13
"Jaynes was being a little naughty in his paper..." I think the point that Jaynes was trying to make (or the point that I took from it) is that Confidence Intervals are used to answer question a) in a large number of cases (I would speculate that anyone who *only has frequentist training* will use CI's to answer question a) and they will think they are an appropriate frequentist answer) — probabilityislogic, Jan 21 '11 at 12:19
yes, by "a little naughty" I just meant that Jaynes was making the point in a rather mischeiviously confrontational (but also entertaining) manner (or at least that is how I read it). But if he hadn't then it probably wouldn't have had any impact. — Dikran Marsupial, Jan 21 '11 at 12:28

score 28 · Answer 2 · edited Aug 21 '18 at 11:24

This is a "fleshed out" example given in a book written by Larry Wasserman All of statistics on Page 216 (12.8 Strengths and Weaknesses of Bayesian Inference). I basically provide what Wasserman doesn't in his book 1) an explanation for what is actually happening, rather than a throw away line; 2) the frequentist answer to the question, which Wasserman conveniently does not give; and 3) a demonstration that the equivalent confidence calculated using the same information suffers from the same problem.

In this example, he states the following situation

An observation, X, with a Sampling distribution: $(X|\theta)\sim N(\theta,1)$
Prior distribution of $(\theta)\sim N(0,1)$ (he actually uses a general $\tau^2$ for the variance, but his diagram specialises to $\tau^2=1$)

He then goes to show that, using a Bayesian 95% credible interval in this set-up eventually has 0% frequentist coverage when the true value of $\theta$ becomes arbitrarily large. For instance, he provides a graph of the coverage (p218), and checking by eye, when the true value of $\theta$ is 3, the coverage is about 35%. He then goes on to say:

...What should we conclude from all this? The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, such as confidence intervals, use frequentist methods... (p217)

And then moves on without any disection or explanation of why the Bayesian method performed apparently so bad. Further, he does not give a answer from the frequentist approach, just a broad brush statement about "the long-run" - a classical political tactic (emphasise your strength + others weakness, but never compare like for like).

I will show how the problem as stated $\tau=1$ can be formulated in frequentist/orthodox terms, and then show that the result using confidence intervals gives precisely the same answer as the Bayesian one. Thus any defect in the Bayesian (real or perceived) is not corrected by using confidence intervals.

Okay, so here goes. The first question I ask is what state of knowledge is described by the prior $\theta\sim N(0,1)$? If one was "ignorant" about $\theta$, then the appropriate way to express this is $p(\theta)\propto 1$. Now suppose that we were ignorant, and we observed $Y\sim N(\theta,1)$, independently of $X$. What would our posterior for $\theta$ be?

$$p(\theta|Y)\propto p(\theta)p(Y|\theta)\propto exp\Big(-\frac{1}{2}(Y-\theta)^2\Big)$$

Thus $(\theta|Y)\sim N(Y,1)$. This means that the prior distribution given in Wassermans example, is equivalent to having observed an iid copy of $X$ equal to $0$. Frequentist methods cannot deal with a prior, but it can be thought of as having made 2 observations from the sampling distribution, one equal to $0$, and one equal to $X$. Both problems are entirely equivalent, and we can actually give the frequentist answer for the question.

Because we are dealing with a normal distribution with known variance, the mean is a sufficient statistic for constructing a confidence interval for $\theta$. The mean is equal to $\overline{x}=\frac{0+X}{2}=\frac{X}{2}$ and has a sampling distribution

$$(\overline{x}|\theta)\sim N(\theta,\frac{1}{2})$$

Thus an $(1-\alpha)\text{%}$ CI is given by:

$$\frac{1}{2}X\pm Z_{\alpha/2}\frac{1}{\sqrt{2}}$$

But, using The results of example 12.8 for Wasserman, he shows that the posterior $(1-\alpha)\text{%}$ credible interval for $\theta$ is given by:

$$cX\pm \sqrt{c}Z_{\alpha/2}$$.

Where $c=\frac{\tau^{2}}{1+\tau^{2}}$. Thus, plugging in the value at $\tau^{2}=1$ gives $c=\frac{1}{2}$ and the credible interval becomes:

$$\frac{1}{2}X\pm Z_{\alpha/2}\frac{1}{\sqrt{2}}$$

Which are exactly the same as the confidence interval! So any defect in the coverage exhibited by the Bayesian method, is not corrected by using the frequentist confidence interval! [If the frequentist chooses to ignore the prior, then to be a fair comparison, the Bayesian should also ignore this prior, and use the ignorance prior $p(\theta)\propto 1$, and the two intervals will still be equal - both $X \pm Z_{\alpha/2})$].

So what the hell is going on here? The problem is basically one of non-robustness of the normal sampling distribution. because the problem is equivalent to having already observed a iid copy, $X=0$. If you have observed $0$, then this is extremely unlikely to have occurred if the true value is $\theta=4$ (probability that $X\leq 0$ when $\theta=4$ is 0.000032). This explains why the coverage is so bad for large "true values", because they effectively make the implicit observation contained in the prior an outlier. In fact you can show that this example is basically equivalent to showing that the arithmetic mean has an unbounded influence function.

Generalisation. Now some people may say "but you only considered $\tau=1$, which may be a special case". This is not true: any value of $\tau^2=\frac{1}{N}$ $(N=0,1,2,3,\dots)$ can be interpreted as observing $N$ iid copies of $X$ which were all equal to $0$, in addition to the $X$ of the question. The confidence interval will have the same "bad" coverage properties for large $\theta$. But this becomes increasingly unlikely if you keep observing values of $0$ (and no rational person would continue to worry about large $\theta$ when you keep seeing $0$).

Thanks for the analysis. AFAICS this is just an example of a problem caused by an incorrect (informative) prior assumption and says nothing about the internal consistency of the Bayesian approach? — Dikran Marsupial, Jan 31 '11 at 08:25
Nope, the prior is not necessarily incorrect, unless one didn't actually observe a value of $0$ prior to conducting the experiment (or obtain some equivalent knowledge). It basically means that, as the true $\theta$ becomes arbitrarily large, the probability of observing these implicit observations becomes arbitrarily small (like getting a "unlucky sample"). — probabilityislogic, Jan 31 '11 at 21:22
you can see by noting that the sample consists of an observation at $0$ and another one at $X$. $0$ is fixed (because it has been observed), but $X$ will be "close" to $\theta$ in most cases. So as $\theta$ becomes large, the sample average gets further and further away from both $X$ and $0$, and because the variance is fixed, the width of the CI is fixed, so it will eventually not contain either $X$ or $0$, and hence not be near either of the two likely values of $\theta$ (for one of them is an outlier when they become far apart, for fixed $\theta$) — probabilityislogic, Jan 31 '11 at 21:27
You made a mistake in the description of the confidence interval which should be: $$X\pm Z_{\alpha/2}$$ and this does *not* coincide with the credible interval $$cX\pm c Z_{\alpha/2}$$ This is true for any value of $\tau > 0$ for which $c = \frac{\tau^2}{\tau^2+1} <1$ — Sextus Empiricus, Jan 09 '20 at 12:43
@sextus empiricus - this is only true if you ignore the data implicit in the prior (ie set $\tau^2\to\infty$). To make the problems equivalent in terms of information being used, the CI procedure needs to add the pseudo data points prior to calculating the statistics. When you do this, the intervals coincide. — probabilityislogic, Jan 09 '20 at 21:07
You seem to be stating that the information/data that is implicitly creating the prior will give an equivalent result in a frequentist approach. But what if this data $Y$ and $X$ were sampled for i.i.d $\theta_Y, \theta_X$ instead of $\theta_Y=\theta_X$? If you have discovered, from earlier observations/estimates of $\theta_1, \theta_2,...,\theta_k$, that $\theta\sim N (0, \tau^2)$ then it is not correct/confidence to augment new observed data/sample (to estimate a new $\theta_{k+1}$) with 'artificial' data (it would mean that the succes rate for the CI is not independent from $\theta_{k+1}$) — Sextus Empiricus, Jan 09 '20 at 22:34
@sextus empiricus - you are taking about a different problem now. This problem with multiple $\theta_k$ is not the example I consider here. There is only one single value $\theta$ (ie same as freq problem). The pdf describes the uncertainty for it's value. — probabilityislogic, Jan 10 '20 at 01:07
@probabiltyislogic why do you consider only that flavour of Wasserman's problem where credible interval and confidence intervals coincide? Is it the practical situation that the prior can always be replaced by data+uninformative prior? I believe that this is often not the case. (a practical case of the problem that I was taking is for instance when $\theta$ is a person's IQ and $X$ is an IQ-test result; often those tests consider confidence intervals instead of credible intervals and MLE instead of maximum posterior probability when expressing predictions of IQ) — Sextus Empiricus, Jan 10 '20 at 08:13
@sextus empiricus - I only consider this case because that's what was in the paper I was discussing - I did not want to create a 'straw man' by talking about a different problem. If you can come up with a example that you think shows bayes is worse you should post it. — probabilityislogic, Jan 10 '20 at 11:21
@probabilityislogic Both Wasserman in Figure 12.1 of 'All of Statistics' and Jaynes in 'confidence intervals vs bayesian intervals' describe cases where they *don't* coincide. Sure if you use a noninformative prior in the Bayesian method (as Jaynes showed) or if you augment the sample data with biased data in the frequentist method (as you showed), then the two methods coincide. But both Jaynes and Wasserman describe cases where you (for whatever reason) do *not* do this....... — Sextus Empiricus, Jan 10 '20 at 14:53
....The disadvantage/advantage of the Bayesian/frequentist treatment is that the bias improves/reduces the accuracy/precision depending on the bias correctness/incorrectness. What Jaynes claimes is that the Bayesian method is better (when smartly using prior information/knowledge) or at least the same (when using uniformed prior) and as a bonus it is also easier to compute and more intuitive. But the problem is that on might abuse the method and use priors wrongly and making the method subjectively inaccurate (on the opposite site the frequentist method is subjectively overly conservative).... — Sextus Empiricus, Jan 10 '20 at 14:58
....I believe that this contrast/difference between advantages and disadvantages of using prior information is the point that Wasserman wishes to describe. (that you can make the frequentist method similar by adding bias to the sampled data is, I believe, besides the point). — Sextus Empiricus, Jan 10 '20 at 15:02
@probabilityislogic Very pedagogic analysis and discussion. Have you published this in some journal or preprint server? If so, could you share the reference? If not, I warmly wish you will. — pglpm, Sep 06 '21 at 12:31

Joris Meys · Answer 3 · 2010-09-03T23:00:59.537

11

The problem starts with your sentence :

Examples based on incorrect prior assumptions are not acceptable as they say nothing about the internal consistency of the different approaches.

Yeah well, how do you know your prior is correct?

Take the case of Bayesian inference in phylogeny. The probability of at least one change is related to evolutionary time (branch length t) by the formula

$$P=1-e^{-\frac{4}{3}ut}$$

with u being the rate of substitution.

Now you want to make a model of the evolution, based on comparison of DNA sequences. In essence, you try to estimate a tree in which you try to model the amount of change between the DNA sequences as close as possible. The P above is the chance of at least one change on a given branch. Evolutionary models describe the chances of change between any two nucleotides, and from these evolutionary models the estimation function is derived, either with p as a parameter or with t as a parameter.

You have no sensible knowledge and you chose a flat prior for p. This inherently implies an exponentially decreasing prior for t. (It becomes even more problematic if you want to set a flat prior on t. The implied prior on p is strongly dependent on where you cut off the range of t.)

In theory, t can be infinite, but when you allow an infinite range, the area under its density function equals infinity as well, so you have to define a truncation point for the prior. Now when you chose the truncation point sufficiently large, it is not difficult to prove that both ends of the credible interval rise, and at a certain point the true value is not contained in the credible interval any more. Unless you have a very good idea about the prior, Bayesian methods are not guaranteed to be equal to or superior to other methods.

ref: Joseph Felsenstein : Inferring Phylogenies, chapter 18

On a side note, I'm getting sick of that Bayesian/Frequentist quarrel. They're both different frameworks, and neither is the Absolute Truth. The classical examples pro Bayesian methods invariantly come from probability calculation, and not one frequentist will contradict them. The classical argument against Bayesian methods invariantly involve the arbitrary choice of a prior. And sensible priors are definitely possible.

It all boils down to the correct use of either method at the right time. I've seen very few arguments/comparisons where both methods were applied correctly. Assumptions of any method are very much underrated and far too often ignored.

EDIT : to clarify, the problem lies in the fact that the estimate based on p differs from the estimate based on t in the Bayesian framework when working with uninformative priors (which is in a number of cases the only possible solution). This is not true in the ML framework for phylogenetic inference. It is not a matter of a wrong prior, it is inherent to the method.

edited Sep 03 '10 at 23:00

answered Sep 03 '10 at 20:24

Joris Meys

5,475
2
32
43

3

It is possible to be interested in the differences between Bayesian and frequentist statistics without it being a quarrel. It is important to know the flaws as well as benefits of ones preferred approach. I specifically excluded priors as that is not a problem with the framework, per se, but just a matter of GIGO. The same thing applies to frequentists statistics, for example by assuming and incorrect parametric distribution for the data. That wouldn't be a criticism of frequentist methodology, just the particular method. BTW, I have no particular problem with improper priors. – Dikran Marsupial Sep 03 '10 at 20:52
3

Jaynes first example: Not one statistician in his right mind will ever use an F-test and a T-test on that dataset. Apart from that, he compares a two-tailed test to P(b>a), which is not the same hypothesis tested. So his example is not fair, which he essentially admits later on. Next to that, you can't compare "the frameworks". What are we talking about then? ML, REML, LS, penalized methods,...? intervals for coefficients, statistics, predictions,...? You can as well ask whether Lutheran service is equivalent or superior to Shiite services. They talk about the same God. – Joris Meys Sep 03 '10 at 22:22
Could you clarify what is your data and what are the parameters you would be estimating in your model? I am a bit confused on this point. Also, could you please use $$ instead of $ to center the formula? The font size is very small right now. – Sep 03 '10 at 22:23
@Srikant: The example in Felsensteins book is based on a Jukes-Cantor model for DNA evolution. Data is DNA sequences. You want to estimate the probability of change in your sequence, which is related to your branch length based on the mentioned formula. Branch lengths are defined as time of evolution : the higher the chance for changes, the more time that passed between the ancestor and the current state. Sorry, but I can't summarize the whole theory behind ML and Bayesian phylogenetic inference in just one post. Felsenstein needed half a book for that. – Joris Meys Sep 03 '10 at 22:29
I guess I just wanted you to clarify what variables in your equation was data and which ones were the parameter as it was not clear from your post especially to someone like me who is an outsider. I am still lost but I guess I would need to read the book to find out more. – Sep 03 '10 at 22:44
@Srikant : I tried to clarify a bit more. Actually, P is the parameter that is used in the likelihood function for optimization, and the formula merely gives its relation to t, which can alternatively be used in the likelihood function. Sorry I can't be more clear. If Phylogeny interests you, I can surely recommend Felsensteins book, it's a gem. http://www.sinauer.com/detail.php?id=1775 – Joris Meys Sep 03 '10 at 23:04
It isn't clear to me why it is a problem that a flat prior on p implies an exponentially decreasing prior on t. If that is inconsistent with biological knowledge, it simply means that a flat prior on p does not reflect actual prior knowledge. I also don't see why it is a problem to use an improper flat prior on t (other than I would have thought it inconsistent with prior knowledge; the branch time can't be say a billion years, if it were we wouldn't be here yet, so it is inappropriate to use a flat prior). Note that flat priors don't necessarily imply ignorance. – Dikran Marsupial Sep 03 '10 at 23:45
@Dikran : it's not a problem. It is a fact. The problem is that p and t are strictly related, and hence should give exactly the same model. That happens in an ML approach, but that doesn't happen in the Bayesian approach. In Felsensteins example, a truncation of the t-prior at 700 or larger makes that the credible interval doesn't cover the true value any more. In this particular case, i.e. the lack of prior knowledge, Bayesian inference just isn't feasible. There is no sensible "uninformative" prior that can be used. – Joris Meys Sep 04 '10 at 00:08
@Dikran : Regarding the flat t-prior: the prior gets truncated. When truncated at 5(!), most of the mass of the prior on p is concentrated around the maximum p-value. With larger truncation values, this effect is even more pronounced. The point is-again- that it's impossible to find a sensible prior when you have no prior knowledge in phylogenetic inference. – Joris Meys Sep 04 '10 at 00:11
Joris, I think you are missing the point, a flat prior is not necessarily non-informative. It is completely reasonable for the same state of knowledge/ignorance to be expressed by a flat prior on p and (say) a flat prior on log(t) (which is a very common Jeffrey's prior) rather than a flat prior on t. Does the book investigate ideas of MAXENT and transformation groups for this problem? There isn't enough detail in your example, but from what I can tell even a truncated flat prior on t is likely to be inconsistent with prior knowedge about t. – Dikran Marsupial Sep 04 '10 at 08:39
@Joris, also in your original comment you suggest the flat prior on t must be truncated, because otherwise the area under the density function is infinite. This is not true, there are plenty of problems where improper priors work very well, so there is not necessarily a need to truncate the flat prior. – Dikran Marsupial Sep 04 '10 at 08:49
@Dikran : Guess you are missing the point : using the same uninformative prior gives two different models with Bayesian statistics on the same dataset. Not so with ML. The Bayesian can be very biased due to the nature of the model and the incompatibility of that model with infinite priors. You don't have to believe me. Felsenstein is the authority on phylogenetic inference, and his book explains you better than I will be able to. Reference in a previous comment. – Joris Meys Sep 05 '10 at 20:02
@Joris, as I said a flat prior is NOT NECCESSARILY UNINFORMATIVE. Consider this, if two priors give different results, then the must logically represent a different state of prior knowledge (see early chapters of Jaynes book that set out desiderata for Baysian inference). Therefore the "flat p" prior and "flat t" prior cannot both be uninformative. Felsenstein may be an expert on phylogenetic inference, but it is possible that he is not an expert on Bayesian inference. If he states that two priors giving different results are both uninformative, he is at odds with Jaynes (who certailny was). – Dikran Marsupial Sep 05 '10 at 23:09
@Dikran : The point is not whether a flat prior is uninformative. The point is that an satisfying uninformative prior cannot be defined due to the nature of the model. Hence rendering the whole method unusable if you don't have prior information, and thus leading to the conclusion that Bayesian inference in this case is inferior to the ML approach. Felsenstein never said a flat prior was uninformative. He just illustrated why an uninformative prior cannot be determined, using the example of a flat prior. – Joris Meys Sep 06 '10 at 16:03
@Joris - it may be that an uninformative prior cannot be constructed in this case, but nothing that you have written so far establishes that to be the case. What does Felsenstein write about MAXENT and transformation groups (the two main techniques used to determine an uninformative prior for a particular problem)? If he has not investigated those methods, how can he know an uninformative prior is impossible? It looks to me that a flat prior on p corresponds a flat prior on log(t), which is a well known Jeffreys' prior. Can you demonstrate that the flat log(t) prior is informative? – Dikran Marsupial Sep 06 '10 at 16:22
I was recently given a copy of Felsenstein's book. In chapter 18 he does not say why you can't use an improper flat prior on 0-infinity. Neither does he mention MaxEnt or transformation groups in his criticism of uniformative priors. While the rest of the book may be very good; this suggests inadequate scholarship on that particular issue. Caveat lector - just because something appears in a text book or journal paper, doesn't mean that it is correct. – Dikran Marsupial Nov 19 '10 at 13:06
@Dikran: entropy maximization without testable information gets only one constraint: probabilities sum up to one. Most often the uniform distribution is taken there. I don't take it as granted, but I do agree with Felsensteins calculations and reasoning. So we disagree, like more people in that field. Felsenstein if far from accepted by everybody, and I'm not accepting everything he says. But on this point, I follow him. Sometimes a Bayesian approach is not superior to another one. And the case he describes is one such a case according to me. YMMV. – Joris Meys Nov 19 '10 at 13:32
I am not suggesting a Bayesian approach is any better than a frequentist one - horses for courses. In this case it is probably transformation groups that hold the key. It is quite possible that a prior on branch length that is invariant to the units used is equivalent to a flat prior on the probability of a change - in which case Felsensteins criticism is badly misguided. Uninformative priors are not necessarily flat and it is inappropriate to criticize uninformative priors without mentioning the standard procedures for finding them! Not that this means Bayesian is better, of course. – Dikran Marsupial Nov 19 '10 at 17:32
This is a very poor example of the "inferiority" of Bayesian methods, of exactly the same type Jaynes speaks of in his 1976 paper. You need to write down what the *numerical/mathematical equation* that the ML (or other frequentist method) does, *and the corresponding Bayesian method and its numerical answer!* You have written down the model, but no solution to the estimation of anything to do with it! The rest of your answer would be greatly improved if you wrote down what the frequentist answer using ML actually is. – probabilityislogic Jan 19 '11 at 06:03
@probabilityislogic : I gave the references. This is a discussion site, not a scientific journal. Please read the comments and the reference I gave for more information. and before you call it a poor example. – Joris Meys Jan 19 '11 at 09:10
@joris meys - I understand that you did give a reference, but your discussion does not talk about *how* the confidence interval solution is superior to the Bayesian credible interval. This means that basically the confidence interval basically needs to be *uncalculable* using Bayesian methods. By showing the Bayesian solution which gives the same interval, you can show what prior information was implicitly contained in the procedure to generate the confidence interval. – probabilityislogic Jan 20 '11 at 07:17
@probabilityislogic : the whole discussion revolves around Felstensteins claim that it is impossible to put a prior without making impossible assumptions about either time or mutation rate. Remember we're talking about phylogenetic trees. This concept makes for quite a different framework, as it's not a classical equational model in a space of real numbers. I'd suggest you read the chapter of his book to see his argument on how under certain condition the Bayesian approach can be proven to be wrong. I'd like to stress this is ONE example. It doesn't say anything about Bayesian in general. – Joris Meys Jan 20 '11 at 09:33
@probabilityislogic : To show the difference in nature of the problem : you talk about confidence intervals. Now try to define a confidence interval around a phylogenetic tree... – Joris Meys Jan 20 '11 at 09:34
@Joris Meys - I do appreciate the reference to the book (but its seems as though without a link, I am to buy his book in order to read your reference), which is where all the arguments are. The equation you presented for the model is simple enough (0
0, u>0 with a relation between each), in fact it could be expressed as the $P=Pr(Y
– probabilityislogic Jan 20 '11 at 23:17
Apologies (again), I wrote the fraction incorrectly (today is just not my day!). So it should be that you can write $P=Pr(Y – probabilityislogic Jan 21 '11 at 03:08
@probabilityislogic - I have Felsenstein's book, unfortunately his reasoning is faulty as he seems to think that all flat priors are uninformative and vice-versa and thus considers the fact that two flat priors on different parameterisations of the same thing give different conclusions is an indication there is a problem. The premise is wrong, and the conclusion unsurprising to anyone familiar with the idea of transformation groups. Essentially an uninformative prior on branch length should be insensitive to the choice of units, which would give a prior that was flat on a logarithmic scale. – Dikran Marsupial Jan 21 '11 at 10:11
@Joris, can you give a specific page number? – Dikran Marsupial Jan 21 '11 at 10:16
comment removed - whatever... – Joris Meys Jan 21 '11 at 10:20
@Dikran : I'll look it up tonight. It's where he demonstrates the effect of the truncation on the t prior. Actually, it's almost a page big, you should have seen it when you read the chapter. It's pretty the center of his story... – Joris Meys Jan 21 '11 at 10:21
@probabilityislogic : the whole point Felstenstein makes is that t and u are linked. Meaning that a flat prior on t gives a greatly biased prior on u and vice versa. You'll have to use a prior that favorizes certain values for either of it in order to have a prior that actually makes **biologically** sense. So you need to know at least something about either the transformation rate or the mutation time to use eg mrBayes in phylogeny. – Joris Meys Jan 21 '11 at 10:29
@Joris, it is a while since I read the chapter in question, but IIRC Felseneteins problem was that a flat prior on branch length is biologically implausible. I agree, but a flat prior on branch length is not necessarily an uninformative prior. Felsensteing seems to think (incorrectly) that only flat priors are uninformative, and hence isn't aware of other choices that may be uninformative and biologically plausible. I should point out though that if you have knowledge of what is and what isn't biologically plausible, then you are not entirely uninformed, and neither should be your prior! – Dikran Marsupial Jan 21 '11 at 10:29
@Joris "the whole point Felstenstein makes is that t and u are linked. Meaning that a flat prior on t gives a greatly biased prior on u and vice versa." It may be that this bias is what you get if you make a minimally informative prior that includes the prior knowledge that the units of measurement should have no effect on the conclusion (transformation groups). – Dikran Marsupial Jan 21 '11 at 10:32
@joris I can understand what you are trying to say, in setting a prior you are describing a *state of knowledge*, just as if you are setting a sampling distribution. Now in the uniform prior on $P$ you are describing a *state of knowledge* that it is possible for "no change" and "one or more changes" to occur on a given branch. Probability theory tells you how to *coherently* transform this into *the same state of knowledge* about $t$, given your knowledge about the relationship between $P$ and $t$. So a "flat" prior for $t$ necessarily is describing a *different state of knowledge*. – probabilityislogic Jan 21 '11 at 10:55
That the solutions are different is no more and no less surprising than if you used a different model between P and t. – probabilityislogic Jan 21 '11 at 10:56
I'm a bit curious, how does the ML solution work for $t$ if you just plug in $P$ into your likelihood. The derivative will be (by chain rule) $\frac{dL}{dt}=\frac{dL}{dP}\frac{dP}{dt}=0$ but from the function for $P$ this means $\frac{dP}{dt}=\frac{4u}{3}e^{-\frac{4}{3}ut}$, so setting $u \rightarrow 0$ and $t\rightarrow\infty$ such that $P$ is unchanged (and equal to $P_{MLE}$) would solve the ML equation? Or is there something about $u$ which is not stated in the information? – probabilityislogic Jan 21 '11 at 16:08
@Dikran : the graph about the truncation of T is shown on page 305 (fig 18.7) – Joris Meys Jan 21 '11 at 22:55
@probabilityislogic : we're talking about trees. The likelihood of the tree is the multiplication of all likelihoods at each site (node) of the tree, which is defined as the sum over all possible nucleotides that may have existed at the interior nodes of the tree, of the probabilities of each scenario of events. And that probability is defined by a model which involves T (or u), the Jukes-Cantor model being the most easy one. As said, phylogeny does not fit into any classical framework. – Joris Meys Jan 21 '11 at 23:00
@probabilityislogic : There have been numerous frameworks built up by now around bayesian posterior probabilities as alternative for bootstrap support values, but most of the studies conclude - rightfully - that both cannot be compared. And for the estimates of the prior both birth-death processes (data-based) as theoretical distributions for branch lengths have been used extensively. Bayesian applications like mrBayes can reduce calculation time significantly, but discussion remains whether they perform better or worse, each side of the argument bringing "proof" for the claim. – Joris Meys Jan 21 '11 at 23:04
@probabilityislogic : But again, most studies rigthfully conclude that they can't be compared. And I still follow Felsenstein that, in case no further knowledge is available, the risk on bias is far larger with a bayesian than with an ML estimate for a phylogenetic tree. If you dive into the literature on phylogeny ( and check the papers that are not online as well, science didn't start in 1998), you'll see that this controversy has been debated heavily for the past 50 years. You and @Dikran might disagree, but the comments here are far from the right place to discuss this properly. Cheers – Joris Meys Jan 21 '11 at 23:10
@Joris, Figure 18.7 on page 305 just shows that using an informative (not uninformative) prior, the maximum likelihood estimate lies outside the Bayesian credible interval. There is nothing in the least surprising about that. As has already been pointed out, a flat prior on branch length is unlikely to be uninformative (transformation groups), especially when needlessly truncated (it is possible to use improper priors). – Dikran Marsupial Jan 22 '11 at 13:25
1

I think something which is perhaps been overlooked in the discussion above (including by me) is that the ML solution is exactly equal to the maximum of the joint posterior density using a uniform prior (so $p(\theta|X)\propto p(X|\theta)$ ($\theta$ is the vector of parameters). So you *cannot* claim that ML is good and Bayes is not, because ML is mathematically equivalent to a Bayesian solution (flat prior, and 0-1 loss function). You need to find a solution which *cannot* be produced using Bayesian methods. – probabilityislogic Jan 30 '11 at 12:55

score 11 · Answer 4 · edited Aug 21 '18 at 11:28

Keith Winstein,

EDIT: Just to clarify, this answer describes the example given in Keith Winstein Answer on the King with the cruel statistical game. The Bayesian and Frequentist answers both use the same information, which is to ignore the information on the number of fair and unfair coins when constructing the intervals. If this information is not ignored, the frequentist should use the integrated Beta-Binomial Likelihood as the sampling distribution in constructing the Confidence interval, in which case the Clopper-Pearson Confidence Interval is not appropriate, and needs to be modified. A similar adjustment should occur in the Bayesian solution.

EDIT: I have also clarified the initial use of the clopper Pearson Interval.

EDIT: alas, my alpha is the wrong way around, and my clopper pearson interval is incorrect. My humblest apologies to @whuber, who correctly pointed this out, but who I initially disagreed with and ignored.

The CI Using the Clopper Pearson method is very good

If you only get one observation, then the Clopper Pearson Interval can be evaluated analytically. Suppose the coin is comes up as "success" (heads) you need to choose $\theta$ such that

$$[Pr(Bi(1,\theta)\geq X)\geq\frac{\alpha}{2}] \cap [Pr(Bi(1,\theta)\leq X)\geq\frac{\alpha}{2}]$$

When $X=1$ these probabilities are $Pr(Bi(1,\theta)\geq 1)=\theta$ and $Pr(Bi(1,\theta)\leq 1)=1$, so the Clopper Pearson CI implies that $\theta\geq\frac{\alpha}{2}$ (and the trivially always true $1\geq\frac{\alpha}{2}$) when $X=1$. When $X=0$ these probabilities are $Pr(Bi(1,\theta)\geq 0)=1$ and $Pr(Bi(1,\theta)\leq 0)=1-\theta$, so the Clopper Pearson CI implies that $1-\theta \geq\frac{\alpha}{2}$, or $\theta\leq 1-\frac{\alpha}{2}$ when $X=0$. So for a 95% CI we get $[0.025,1]$ when $X=1$, and $[0,0.975]$ when $X=0$.

Thus, one who uses the Clopper Pearson Confidence Interval will never ever be beheaded. Upon observing the interval, it is basically the whole parameter space. But the C-P interval is doing this by giving 100% coverage to a supposedly 95% interval! Basically, the Frequentists "cheats" by giving a 95% confidence interval more coverage than he/she was asked to give (although who wouldn't cheat in such a situation? if it were me, I'd give the whole [0,1] interval). If the king asked for an exact 95% CI, this frequentist method would fail regardless of what actually happened (perhaps a better one exists?).

What about the Bayesian Interval? (specifically the Highest Posterior Desnity (HPD) Bayesian Interval)

Because we know a priori that both heads and tails can come up, the uniform prior is a reasonable choice. This gives a posterior distribution of $(\theta|X)\sim Beta(1+X,2-X)$ . Now, all we need to do now is create an interval with 95% posterior probability. Similar to the clopper pearson CI, the Cummulative Beta distribution is analytic here also, so that $Pr(\theta \geq \theta^{e} | x=1) = 1-(\theta^{e})^{2}$ and $Pr(\theta \leq \theta^{e} | x=0) = 1-(1-\theta^{e})^{2}$ setting these to 0.95 gives $\theta^{e}=\sqrt{0.05}\approx 0.224$ when $X=1$ and $\theta^{e}= 1-\sqrt{0.05}\approx 0.776$ when $X=0$. So the two credible intervals are $(0,0.776)$ when $X=0$ and $(0.224,1)$ when $X=1$

Thus the Bayesian will be beheaded for his HPD Credible interval in the case when he gets the bad coin and the Bad coin comes up tails which will occur with a chance of $\frac{1}{10^{12}+1}\times\frac{1}{10}\approx 0$.

First observation, the Bayesian Interval is smaller than the confidence interval. Another thing is that the Bayesian would be closer to the actual coverage stated, 95%, than the frequentist. In fact, the Bayesian is just about as close to the 95% coverage as one can get in this problem. And contrary to Keith's statement, if the bad coin is chosen, 10 Bayesians out of 100 will on average lose their head (not all of them, because the bad coin must come up heads for the interval to not contain $0.1$).

Interestingly, if the CP-interval for 1 observation was used repeatedly (so we have N such intervals, each based on 1 observation), and the true proportion was anything between $0.025$ and $0.975$, then coverage of the 95% CI will always be 100%, and not 95%! This clearly depends on the true value of the parameter! So this is at least one case where repeated use of a confidence interval does not lead to the desired level of confidence.

To quote a genuine 95% confidence interval, then by definition there should be some cases (i.e. at least one) of the observed interval which do not contain the true value of the parameter. Otherwise, how can one justify the 95% tag? Would it not be just a valid or invalid to call it a 90%, 50%, 20%, or even 0% interval?

I do not see how simply stating "it actually means 95% or more" without a complimentary restriction is satisfactory. This is because the obvious mathematical solution is the whole parameter space, and the problem is trivial. suppose I want a 50% CI? if it only bounds the false negatives then the whole parameter space is a valid CI using only this criteria.

Perhaps a better criterion is (and this is what I believe is implicit in the definition by Kieth) "as close to 95% as possible, without going below 95%". The Bayesian Interval would have a coverage closer to 95% than the frequentist (although not by much), and would not go under 95% in the coverage ($\text{100%}$ coverage when $X=0$, and $100\times\frac{10^{12}+\frac{9}{10}}{10^{12}+1}\text{%} > \text{95%}$ coverage when $X=1$).

In closing, it does seem a bit odd to ask for an interval of uncertainty, and then evaluate that interval by the using the true value which we were uncertain about. A "fairer" comparison, for both confidence and credible intervals, to me seems like the truth of the statement of uncertainty given with the interval.

In your first main paragraph you seem to have confused $\alpha$ and $1-\alpha$. Where does the value of 10^12+1 come in? What do you mean by "beheaded"?? This text looks like it is need of proofreading and revision. — whuber, Jan 19 '11 at 18:30
$10^{12}$ is for the trillion fair coins, and 1 is for the unfair coin. And I haven't confused $\alpha$ and $1-\alpha$ the Clopper Pearson interval listed [here][1] — probabilityislogic, Jan 20 '11 at 02:43
[sorry typo] $10^{12}$ (TeX fixed) is for the trillion fair coins, and 1 is for the unfair coin, one over this is a rough approx. to the probability of having the "bad" coin. Beheaded is the consequence of giving the wrong confidence interval. And I haven't confused $\alpha$ and $1-\alpha$ the Clopper Pearson interval listed on the wiki page (search binomial proportion confidence interval). What happens is one part of the C-P interval is a tautology $1 \geq \frac{\alpha}{2}$ when one 1 observation. The side "flips" when X=1 to X=0, which is why there is $1-\theta$ and $\theta$. — probabilityislogic, Jan 20 '11 at 02:52

Keith Winstein · Answer 5 · 2010-09-04T05:26:37.343

9

Frequentist confidence intervals bound the rate of false positives (Type I errors), and guarantee their coverage will be bounded below by the confidence parameter, even in the worst case. Bayesian credibility intervals don't.

So if the thing you care about is false positives and you need to bound them, confidence intervals are the the approach that you'll want to use.

For example, let's say you have an evil king with a court of 100 courtiers and courtesans and he wants to play a cruel statistical game with them. The king has a bag of a trillion fair coins, plus one unfair coin whose heads probability is 10%. He's going to perform the following game. First, he'll draw a coin uniformly at random from the bag.

Then the coin will be passed around a room of 100 people and each one will be forced to do an experiment on it, privately, and then each person will state a 95% uncertainty interval on what they think the coin's heads probability is.

Anybody who gives an interval that represents a false positive -- i.e. an interval that doesn't cover the true value of the heads probability -- will be beheaded.

If we wanted to express the /a posteriori/ probability distribution function of the coin's weight, then of course a credibility interval is what does that. The answer will always be the interval [0.5, 0.5] irrespective of outcome. Even if you flip zero heads or one head, you'll still say [0.5, 0.5] because it's a heck of a lot more probable that the king drew a fair coin and you had a 1/1024 day getting ten heads in a row, than that the king drew the unfair coin.

So this is not a good idea for the courtiers and courtesans to use! Because when the unfair coin is drawn, the whole room (all 100 people) will be wrong and they'll all get beheaded.

In this world where the most important thing is false positives, what we need is an absolute guarantee that the rate of false positives will be less than 5%, no matter which coin is drawn. Then we need to use a confidence interval, like Blyth-Still-Casella or Clopper-Pearson, that works and provides at least 95% coverage irrespective of the true value of the parameter, even in the worst case. If everybody uses this method instead, then no matter which coin is drawn, at the end of the day we can guarantee that the expected number of wrong people will be no more than five.

So the point is: if your criterion requires bounding false positives (or equivalently, guaranteeing coverage), you gotta go with a confidence interval. That's what they do. Credibility intervals may be a more intuitive way of expressing uncertainty, they may perform pretty well from a frequentist analysis, but they are not going to provide the guaranteed bound on false positives you'll get when you go asking for it.

(Of course if you also care about false negatives, you'll need a method that makes guarantees about those too...)

edited Sep 04 '10 at 05:26

answered Sep 04 '10 at 04:22

Keith Winstein

4,896
1
15
8

6

Food for thought, however the particular example is unfair as the frequentist approach is allowed to consider the relative costs of false-positive and false-negative costs, but the Bayesian approach isn't. The correct thing to do according to Bayesian decision theory is to give an interval of [0,1] as there is no penalty associated with false-negatives. Thus in a like-for-like comparison of frameworks, none of the Bayesians would ever be beheaded either. The issue about bounding false-positives though gives me a direction in which to look for an answer to Jaynes' challenge. – Dikran Marsupial Sep 04 '10 at 09:10
1

Note also that if the selected coin is flipped often enough, then eventually the Bayesian confidence interval will be centered on the long run frequency of heads for the particular coin rather than on the prior. If my life depended on the interval containing the true probability of a head I wouldn't flip the coin just once! – Dikran Marsupial Sep 04 '10 at 09:57
1

Having though about this a bit more, this example is invalid as the criterion used to measure success is not the same as that implied by the question posed by the king. The problem is in the "no matter which coin is drawn", a clause that is designed to trip up any method that uses the prior knowledge about the rarity of the biased coin. As it happens, Bayesains can derive bounds as well (e.g. PAC bounds) and if asked would have done so, and I suspect the answer would be the same as the Clopper-Pearson interval. To be a fair test, the same information must be given to both approaches. – Dikran Marsupial Sep 06 '10 at 08:29
1

Dikran, there needn't be "Bayesians" and "Frequentists." They're not incompatible schools of philosophy to which one may subscribe to only one! They are mathematical tools whose efficacy can be demonstrated in the common framework of probability theory. My point is that IF the requirement is an absolute bound on false positives no matter the true value of the parameter, THEN a confidence interval is the method that accomplishes that. Of course we all agree on the same axioms of probability and the same answer can be derived many ways. – Keith Winstein Sep 07 '10 at 04:55
I agree with the first point, it is a matter of "horses for courses", but examples which show where the boundaries lie are interesting and provide insight into the "courses" best suited to each "horse". However, the examples must be fair, so that the criterion for success matches the question as posed (Jaynes is perhaps not completely immune to that criticism, which I will address in my answer which I will post later). – Dikran Marsupial Sep 07 '10 at 19:47
The confidence interval only provides a bound on the *expected* number of false positives, it is not possible to put an absolute bound on the number of false positives for a particular sample (neglecting a trivial interval of [0,1]). A Bayesian would determine an interval such that the probability of of more than five beheadings is less than some threshold value (e.g. 10^-6). This seems at least as useful as a bound on the expected number of beheadings and has the advantage of being a (probabilistic) bound on what happens to the actual sample of courtiers. I'd say this one was a clear draw. – Dikran Marsupial Sep 07 '10 at 19:57
Confidence intervals, in my opinion are *completely and utterly useless* UNLESS the experiment is to be repeated a moderate number of times (10 or more). Because whether or not an $\alpha$ level CI contains the true parameter is basically a $Bernouli(\alpha)$ random variable which has been "mixed up" so that we don't know whether we have observed a "success" or a "failure". Also this problem it is impossible to give an "exact" CI, because $1^{12}$ times its 0.5 and 1 time its 0.1. Show me 95% of this set? it doesn't exist! Wouldn't you just give the set of two numbers {0.5,0.1}? – probabilityislogic Jan 19 '11 at 06:18
1

The question as posed is a bit ambiguous, because it does not stated clearly what *information* the 100 people have. Do they know the distribution in the bag? for if they do, they "experiment" is useless, one would just give the interval $[0.1,0.5]$ or even just the two values $0.1$ and $0.5$ (does give required $\text{100%} \geq \text{95%}$ coverage). If we only know that there are a bag of coins to be drawn from, the Bayesian would specify the whole [0,1] interval, because false positives is *all* that matters in this question (and the *size* of the interval does not). – probabilityislogic Jan 27 '11 at 13:24
I would have thought the above argument holds just as valid for the frequentist as well. The argument above (as far as I can tell) does not invoke any specifically Bayesian or Frequentist principles (although it does invoke the principle of *sanity*). – probabilityislogic Jan 27 '11 at 14:04
A confidence interval does not bound the rate of false positives - see my answer below for a counter-example to back up my claim. – probabilityislogic Jan 31 '11 at 07:11
Hi -- yes, a confidence interval's coverage probability is bounded below by the confidence parameter. So a 95% confidence interval will have coverage of at least 95%, irrespective of the true value of the parameter. A credibility interval does not make this guarantee, and can have coverage lower than its probability -- it can even have 0% coverage for some values of the parameter, as in the "king" example. See http://stats.stackexchange.com/questions/2272/whats-the-difference-between-a-confidence-interval-and-a-credible-interval for a fuller explanation. – Keith Winstein Feb 02 '11 at 05:37
@Keith - if what you say is true, then you should point out the mistake I have made in my answer (relating to Wasserman's example). Because the CI in that case does not have the 95% coverage for all values of the parameter. So if you are correct, then logically, I must have made a mistake somewhere in the calculations. – probabilityislogic Feb 05 '11 at 11:25

score 5 · Answer 6 · edited Jun 11 '20 at 14:32

In this answer I aim to describe the difference between confidence intervals and credible intervals in an intuitive way.

I hope that this may help to understand:

why/how credible intervals are better than confidence intervals.
on which conditions the credible interval depends and when they are not always better.

Credible intervals and confidence intervals are constructed in different ways and can be different

see also: The basic logic of constructing a confidence interval and If a credible interval has a flat prior, is a 95% confidence interval equal to a 95% credible interval?

In the question by probabilityislogic an example is given from Larry Wasserman, which was mentioned in the comments by suncoolsu.

$$X \sim N(\theta,1) \quad \text{where} \quad \theta \sim N(0,\tau^2)$$

We could see each experiment with random values for $\theta$ and $X$ as a joint variable. This is plotted below for the 20k simulated cases when $\tau=1$

This experiment can be considered as a joint random variable where both the observation $X$ and the underlying unobserved parameter $\theta$ have a multivariate normal distribution.

$$f(x,\theta) = \frac{1}{2 \pi \tau} e^{-\frac{1}{2} \left((x-\theta)^2+ \frac{1}{\tau^2}\theta^2\right)}$$

Both the $\alpha \%$-confidence interval and $\alpha \%$-credible interval draw boundaries in such a way that $\alpha \%$ of the mass of the density $f(\theta,X)$ falls inside the boundaries. How do they differ?

The credible interval draws boundaries by evaluating the $\alpha \%$ mass in a horizontal direction such that for every fixed $X$ an $\alpha \%$ of the mass falls in between the boundaries for the conditional density $$\theta_X \sim N(cX,c) \quad \text{with} \quad c=\frac{\tau^2}{\tau^2+1}$$ falls in between the boundaries.
The confidence interval draws boundaries by evaluating the $\alpha \%$ mass in a vertical direction such that for every fixed $\theta$ an $\alpha \%$ of the mass falls in between the boundaries for the conditional density $$X_\theta \sim N(\theta,1) \hphantom{ \quad \text{with} \quad c=\frac{\tau^2}{\tau^2+1}}$$

What is different?

The confidence interval is restricted in the way that it draws the boundaries. The confidence interval places these boundaries by considering the conditional distribution $X_\theta$ and will cover $\alpha \%$ independent from what the true value of $\theta$ is (this independence is both the strength and weakness of the confidence interval).

The credible interval makes an improvement by including information about the marginal distribution of $\theta$ and in this way it will be able to make smaller intervals without giving up on the average coverage which is still $\alpha \%$. (But it becomes less reliable/fails when the additional assumption, about the prior, is not true)

In the example the credible interval is smaller by a factor $c = \frac{\tau^2}{\tau^2+1}$ and the improvement of the coverage, albeit the smaller intervals, is achieved by shifting the intervals a bit towards $\theta = 0$, which has a larger probability of occurring (which is where the prior density concentrates).

Conclusion

We can say that*, if the assumptions are true then for a given observation $X$, the credible interval will always perform better (or at least the same). But yes, the exception is the disadvantage of the credible interval (and the advantage of the confidence interval) that the conditional cover probability $\alpha \%$ is biased depending on the true value of the parameter $\theta$. This is especially detrimental when the assumptions about the prior distribution of $\theta$ are not trustworthy.

*see also the two methods in this question The basic logic of constructing a confidence interval. In the image of my answer it is illustrated that the confidence interval can place the boundaries, with respect to the posterior distribution for a given observation $X$, at different 'heights'. So it may not always be optimally selecting the shortest interval, and for each observation $X$ it may be possible to decrease the length of the interval by shifting the boundaries while enclosing the same $\alpha \%$ amount of probability mass.

For a given underlying parameter $\theta$ the roles are reversed and it is the confidence interval that performs better (smaller interval in vertical direction) than the credible interval. (although this is not the performance that we seek because we are interested in the intervals in the other direction, intervals of $\theta$ given $X$ and not intervals of $X$ given $\theta$)

About the exception

Examples based on incorrect prior assumptions are not acceptable

This exclusion of incorrect assumptions makes it a bit a loaded question. Yes, given certain conditions, the credible interval is better than the confidence interval. But are those conditions practical?

Both credible intervals and confidence intervals make statements about some probability, like $\alpha \%$ of the cases the parameter is correctly estimated. However, that "probability" is only a probability in the mathematical sense and relates to the specific case that the underlying assumptions of the model are very trustworthy.

If the assumptions are uncertain then this uncertainty should propagate into the computed uncertainty/probability $\alpha \%$. So credible intervals and confidence intervals are in practice only appropriate when the assumptions are sufficiently trustworthy such that the propagation of errors can be neglected. Credible intervals might be in some cases easier to compute, but the additional assumptions, makes credible intervals (in some way) more difficult to apply than confidence intervals, because more assumptions are being made and this will influence the 'true' value of $\alpha \%$.

Additional:

This question relates a bit to Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean?

See in the image below the expression of conditional probability/chance of containing the parameter for this particular example

The $\alpha \%$ confidence interval will correctly estimate/contain the true parameter $\alpha \%$ of the time, for a each parameter $\theta$. But for a given observation $X$ the $\alpha \%$ confidence interval will not estimate/contain the true parameter $\alpha \%$ of the time. (type I errors will occur at the same rate $\alpha \%$ for different values of the underlying parameter $\theta$. But for different observations $X$ the type I error rate will be different. For some observations the confidence interval may be more/less often wrong than for other observations).

The $\alpha \%$ credible interval will correctly estimate/contain the true parameter $\alpha \%$ of the time, for each observation $X$. But for a given parameter $\theta$ the $\alpha \%$ credible interval will not estimate/contain the true parameter $\alpha \%$ of the time. (type I errors will occur at the same rate $\alpha \%$ for different values of the observed parameter $X$. But for different underlying parameters $\theta$ the type I error rate will be different. For some underlying parameters the credible interval may be more/less often wrong than for other underlying parameters).

Code for computing both images:

# parameters
set.seed(1)
n <- 2*10^4
perc = 0.95
za <- qnorm(0.5+perc/2,0,1)

# model
tau <- 1
theta <- rnorm(n,0,tau)
X <- rnorm(n,theta,1)

# plot scatterdiagram of distribution
plot(theta,X, xlab=expression(theta), ylab = "observed X",
     pch=21,col=rgb(0,0,0,0.05),bg=rgb(0,0,0,0.05),cex=0.25,
     xlim = c(-5,5),ylim=c(-5,5)
    )

# confidence interval
t <- seq(-6,6,0.01)
lines(t,t-za*1,col=2)
lines(t,t+za*1,col=2)

# credible interval
obsX <- seq(-6,6,0.01)
lines(obsX*tau^2/(tau^2+1)+za*sqrt(tau^2/(tau^2+1)),obsX,col=3)
lines(obsX*tau^2/(tau^2+1)-za*sqrt(tau^2/(tau^2+1)),obsX,col=3)

# adding contours for joint density
conX <- seq(-5,5,0.1)
conT <- seq(-5,5,0.1)
ln <- length(conX)

z <- matrix(rep(0,ln^2),ln)
for (i in 1:ln) {
  for (j in 1:ln) {
    z[i,j] <- dnorm(conT[i],0,tau)*dnorm(conX[j],conT[i],1)
  }
}
contour(conT,conX,-log(z), add=TRUE, levels = 1:10 )

legend(-5,5,c("confidence interval","credible interval","log joint density"), lty=1, col=c(2,3,1), lwd=c(1,1,0.5),cex=0.7)
title(expression(atop("scatterplot and contourplot of", 
                      paste("X ~ N(",theta,",1)   and   ",theta," ~ N(0,",tau^2,")"))))




# expression succes rate as function of X and theta
# Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean?
layout(matrix(c(1:2),1))
par(mar=c(4,4,2,2),mgp=c(2.5,1,0))
pX <- seq(-5,5,0.1)
pt <- seq(-5,5,0.1)
cc <- tau^2/(tau^2+1)

plot(-10,-10, xlim=c(-5,5),ylim = c(0,1),
     xlab = expression(theta), ylab = "chance of containing the parameter")
lines(pt,pnorm(pt/cc+za/sqrt(cc),pt,1)-pnorm(pt/cc-za/sqrt(cc),pt,1),col=3)
lines(pt,pnorm(pt+za,pt,1)-pnorm(pt-za,pt,1),col=2)
title(expression(paste("for different values ", theta)))

legend(-3.8,0.15,
       c("confidence interval","credible interval"),
       lty=1, col=c(2,3),cex=0.7, box.col="white")


plot(-10,-10, xlim=c(-5,5),ylim = c(0,1),
     xlab = expression(X), ylab = "chance of containing the parameter")
lines(pX,pnorm(pX*cc+za*sqrt(cc),pX*cc,sqrt(cc))-pnorm(pX*cc-za*sqrt(cc),pX*cc,sqrt(cc)),col=3)
lines(pX,pnorm(pX+za,pX*cc,sqrt(cc))-pnorm(pX-za,pX*cc,sqrt(cc)),col=2)
title(expression(paste("for different values ", X)))


text(0,0.3, 
     c("95% Confidence Interval\ndoes not imply\n95% chance of containing the parameter"),
     cex= 0.7,pos=1)

library(shape)
Arrows(-3,0.3,-3.9,0.38,arr.length=0.2)

When I write *"So it may not always be optimally selecting the shortest interval, and for each observation $X$ it may be possible to decrease the length of the interval by shifting the boundaries while enclosing the same α% amount of probability mass."* It must be noted that this α% is variable as function of X for the confidence interval... — Sextus Empiricus, Jan 09 '20 at 17:17
....So if you use the same variability you can always make the intervals shorter or at least the same size. But when you make a constant α% dependency on X, as with a typical credible interval, then it might be possible that the credible interval is *not* smaller than the confidence interval for *every* X. That means that the credible interval does not always dominate the confidence interval. (I have no clear example in mind, but I imagine it must be possible) — Sextus Empiricus, Jan 09 '20 at 17:19
just on your comment on the incorrect prior assumptions - if we relax this, then we also should be considering that the model $p(X|\theta)$ is also "wrong". But this usually is not helpful to anyone - the solution is usually an implicit version of "change the model" (e.g. non-parametric tests, etc) — probabilityislogic, Jan 10 '20 at 11:35
@probabilityislogic When one constructs a confidence interval then one uses the model $p(X \, \vert \, \theta)$. When one constructs a credible interval then one has also an *additional* model/assumption/believe for the marginal distribution $p(\theta)$. Indeed, for *both* assumptions/models we should be considering how trustworthy they are and by how much the errors in the assumptions propagate into the idealistic expressions of Bayesian/frequentist probability. Luckily the expression for $p(X \, \vert \, \theta)$ is often very reasonable, but the $p(\theta)$ is not always so clear. — Sextus Empiricus, Jan 10 '20 at 11:50
I disagree here - often the likelihood is where the real problems are (e.g. constant variance assumption). Why is there a huge literature on "outliers" and "robustness" if likelihoods are reasonable? Additionally, the 'problem' with the prior can be easily fixed, by using a t-distribution with low df instead of normal. For large "true values" of $\theta$ the prior would be ignored with the posterior concentrating around $X$ rather than $cX$. — probabilityislogic, Jan 11 '20 at 03:43
@probabiltyislogic you are right the likelihood is not always the least problematic. I should have stated that sometimes $p (\theta) $ is the biggest problem, sometimes it is the $p (X\,\vert\,\theta) $ sometimes it is both. But besides that, it's probably not what makes people choose, right or wrong, for the frequentist method (the essential difference is in how they draw interval boundaries and choose to make the probability that the interval is correct dependent on other parameters; as illustrated in the two graphs that I made based on the figure from Wasserman ). — Sextus Empiricus, Jan 11 '20 at 08:34
@probabiltyislogic I agree with you that one can mock the *"a 95% Confidence Interval (CI) does not imply a 95% chance of containing the mean"* as Jaynes does in the article. It is often not the probability that is interresting (unless one does the test many times on a large ensemble such that focussing on frequency of success makes sense, e.g. quality testing or evaluating stocks, or when the loss function depends on the true $\theta$ and not on the observed $X$). However the creation of a statement about posterior probability is not a real solution when the prior is not correct. — Sextus Empiricus, Jan 11 '20 at 08:45

score 0 · Answer 7 · answered Apr 06 '12 at 19:30

0

are there examples where the frequentist confidence interval is clearly superior to the Bayesian credible interval (as per the challenge implicitly made by Jaynes).

Here is an example: the true $\theta$ equals $10$ but the prior on $\theta$ is concentrated about $1$. I am doing statistics for a clinical trial, and $\theta$ measures the risk to death, so the Bayesian result is a disaster, isn't it ? More seriously, what is "the" Bayesian credible interval ? In other words: what is the selected prior ? Maybe Jaynes proposed an automatic way to select a prior, I don't know !

Bernardo proposed a "reference prior" to be used as a standard for scientific communication [and even a "reference credible interval" (Bernardo - objective credible regions)]. Assuming this is "the" Bayesian approach, now the question is: when is an interval superior to another one ? The frequentist properties of the Bayesian interval are not always optimal, but neither are the Bayesian properties of "the" frequentist interval
(by the way, what is "the" frequentist interval ? )

answered Apr 06 '12 at 19:30

Stéphane Laurent

17,425
5
59
101

I am speculating, but I suspect this answer is bound to get the same treatment that others have. Someone will simply argue this is an issue of poor choice of prior and not of some inherent weakness of Bayesian procedures, which in my view partially tries to evade a valid criticism. – cardinal Apr 06 '12 at 19:42
@cardinal's comment is quite right. The prior here is off by an order of magnitude, making the criticism very weak. Prior information matters to frequentists too; what one knows _a priori_ should determine e.g. what estimates and test statistics are used. If these choices are based on information that's wrong by an order of magnitude, poor results should be expected; being Bayesian or frequentist doesn't come into it. – guest Apr 06 '12 at 20:14
My "example" was not the important part of my answer. But what is a good choice of prior ? It is easy to imagine a prior whose support contains the true parameter but the posterior does not, so the frequentist interval is superior ? – Stéphane Laurent Apr 07 '12 at 13:14
Cardinal and guest are correct, my question explicitly included "Examples based on incorrect prior assumptions are not acceptable as they say nothing about the internal consistency of the different approaches." for a good reason. Frequentist tests can be based on incorrect assumptions as well as Bayesian ones (the Bayesian framework states the assumptions more explicitly); the question is whether the *framework* has weaknesses. Also if the true value was in the prior, but not the posterior, that would imply that the observations ruled out the possibility of the true value being correct! – Dikran Marsupial Apr 10 '12 at 10:33
@cardinal it isn't evading criticism of Bayesian methods, of course choice of prior is an issue. It just isn't the issue that is relevant to this particular question. The difficulty of performing the integrals is another weakness of Bayesian methods. Horses for courses, the trick is to know which horse for which course, hence my interest in the question. – Dikran Marsupial Apr 10 '12 at 10:36
1

Maybe I should edit my answer and delete my "example" - this is not the serious part of my answer. My answer mainly was about the meaning of "the" Bayesian approach. What do you call the Bayesian approach ? This approach requires the choice of a subjective prior or it uses an automatic way to select a noninformative prior ? In the second case it is important to mention the work of Bernardo. Secondly you have not defined the "superiority" relation between intervals: when do you say an interval is superior to another one ? – Stéphane Laurent Apr 10 '12 at 17:47
Note that the prior being off by an order of magnitude doesn't matter so long as the tails of the prior are "fatter" than the tails in the likelihood. For example, if you have $p(x_i|\mu)\sim N(\mu,1)$ for $i=1,\dots,n$ and you set your prior as $p(\mu)\sim Cauchy(m,v)$. Then the posterior mean cannot be more than some fixed distance away from the sample mean. Further the distance tends to zero as $|m-\overline{x}|\to\infty$ - ie as our prior guess becomes more in conflict with the data. – probabilityislogic Oct 02 '12 at 08:43
The problem you speak of is more about prior specification than an error. We want the prior to accurately describe what information you have. The above example is one where we consider the likelihood function is more reliable than the prior. – probabilityislogic Oct 02 '12 at 08:51

score 0 · Answer 8 · answered Feb 28 '22 at 17:20

Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals

I'm going to say "any paper in experimental science".

There's an XKCD cartoon that has made the rounds here before, which I've edited slightly:

Okay, the stick figure on the left is nuts, and the one on the right is saner. But I want to focus on a different question: if this experiment were published, what would you want to see in the paper?

You don't want the opinion of either of these guys. What you want is the information in the first panel, so you can form your own opinion. That's what the confidence interval tells you: the Universe—which we expect to lie to us about 5% of the time—just told us that the answer is somewhere in here.

That isn't what you really want to know. What you really want to know is something like the credible interval. But it's what you want the paper to tell you: it's a concise summary of the result of this particular experiment.

The calculation of the confidence interval still incorporates assumptions that may be wrong, invalidating it. But they're assumptions about the reliability of the equipment, the quality of the randomization, and other things that the experimenter can be expected to know better than you. Human bias can still creep in, but it's unavoidable that you have to trust the experimenter about these sorts of things.

If you want to make a decision on the basis of this data, then you shouldn't treat the confidence interval as a credible interval, as the guy on the left does. You probably should do a Bayesian analysis. Proponents of Bayesianism often talk about winning bets, because Bayesian inference is good for that. But not everything is about winning bets.

I don't see how the rest of the answer substantiates the claim "any paper in experimental science". Most often in experimental science, what you really want to know is what you can infer from the outcome of the particular experiment that you actually performed, and that is summarised by the credible interval. Very rarely do we really want a statement about what we would expect to see if we performed the experiment a large number of times. — Dikran Marsupial, Feb 28 '22 at 22:11
@DikranMarsupial I said the same thing in the paragraph beginning "That isn't what you really want to know." You want a credible interval that you calculated yourself from your own priors and the confidence interval in the paper. You don't want a credible interval that reflects the experimenters' biases instead of yours. If it's approved by a theorist and you're just a layperson then that's different, but that's more like popular science reporting than the peer-reviewed literature. — benrg, Mar 01 '22 at 01:22
Bayesian credible intervals do not necessarily contain experimenter's biases (and frequentist analyses are not necessarily free of them). You can't necessarily compute a credible interval from your own priors and a frequentist confidence interval. — Dikran Marsupial, Mar 01 '22 at 06:24

Geoffrey Johnson · Answer 9 · 2021-08-24T17:12:08.387

The second example in this thread compares a frequentist confidence interval to two different posterior intervals based on two different non-informative priors. Despite using all the information in the likelihood, both credible intervals can be considered inferior because: i) neither credible interval provides a long-run guarantee of covering the unknown fixed true parameter; ii) it is not obvious which non-informative prior one should choose when constructing the posterior if the experimenter truly has no prior knowledge; iii) the posterior probability statements are not verifiable statements about the actual fixed parameter, the hypothesis, nor the experiment.

Both the credible interval and the confidence interval attempt to address the request, "Give me a set of plausible true values of the parameter, given the observed data." In his answer to the original post, Dikran Marsupial provides the following:

(a) "Give me an interval where the true value of the statistic lies with probability p", then it appears a frequentist cannot actually answer that question directly (and this introduces the kind of problems that Jaynes discusses in his paper), but a Bayesian can, which is why a Bayesian credible interval is superior to the frequentist confidence interval in the examples given by Jaynes. But this is only becuase it is the "wrong question" for the frequentist.

(b) "Give me an interval where, were the experiment repeated a large number of times, the true value of the statistic would lie within p*100% of such intervals" then the frequentist answer is just what you want. The Bayesian may also be able to give a direct answer to this question (although it may not simply be the obvious credible interval). Whuber's comment on the question suggests this is the case.

Dikran Marsupial's response is wrong for two reasons. The first is that neither the credible interval nor the confidence interval is a set of statistic values. Each is a set in the parameter space. Secondly, if we ignore this mistake and consider both the confidence and credible interval as residing in the parameter space, it is misleading in (a) to suggest a Bayesian approach can provide "an interval where the true parameter lies with 100p% probability." Under a Bayesian approach it is more transparent to say "a set of values that has 100p% belief units (Bayesian probability)." We must make it clear this is not a verifiable statement about the actual fixed parameter, the hypothesis, nor the experiment. The confidence interval for a single observed experimental result is considered plausible due to its long-run performance over repeated experiments. This coverage probability is a statement about the experiment in relation to the unknown fixed true parameter. If the prior distribution is chosen in such a way that the posterior is dominated by the likelihood, Bayesian belief is more objectively viewed as a type of confidence based on frequency probability of the experiment.

"neither the credible interval nor the confidence interval is a set of statistic values." my answer makes no such statement. "neither the credible interval nor the confidence interval is a set of statistic values." pedantry - under a Bayesian framework, that *is* a probability, — Dikran Marsupial, Mar 01 '22 at 06:43

Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals

9 Answers9

Credible intervals and confidence intervals are constructed in different ways and can be different

What is different?

Conclusion

About the exception

Additional:

Linked