Credibility Intervals

Question

I'm trying to understand when credibility intervals are useful?

Are there examples of real world situations where credibility intervals are the better thing to use compared to confidence intervals? Note that by "useful", I mean maximizing some concrete real-world objective (so not for instance trying to get posterior intervals for one's subjective beliefs which can be useful but not something I'm looking for)

Thanks

You may well find your answer if you search this site for "credible." — rolando2, Feb 10 '13 at 13:16
When I Google "What is the point of a confidence interval", many of the answers are variations on "to provide a range of likely values for a parameter". This is what credible intervals are better at. — fblundun, Dec 17 '21 at 20:53

score 1 · Answer 1 · answered Dec 20 '21 at 06:52

The problem of comparing credible sets and confidence intervals is that they are not apples to apples or apples to oranges comparisons. They are an apples to tractors comparison. They are only substitutes for one another in certain circumstances.

The primary use of a confidence interval is in scientific research. Although businesses use them, their value is lessened since it is often difficult to choose an action based on a range. Applied business statistical methods tend to favor point estimates for practical reasons, even if intervals are included in reports. When included, they are mostly as warnings.

Credible sets tend to be less used in Bayesian methods because the entire posterior is reported as well as the marginals. They are reported out and descriptively provide a feel for the data if no graph of the posterior is provided, but they do not have the same usefulness as confidence intervals because they mean something different.

There are four cases where you will tend to see a credible set used instead of a confidence interval, but I am not certain that most of them are practical. It happens, but not often.

The first one has already been mentioned. There are times where a confidence interval appears to produce a pathological interval. I am less happy with this use. It is important to remember that confidence procedures produce valid intervals at least $1-\alpha$ percent of the time upon infinite repetition, but the price of that may be total nonsense sometimes. I am not sure that is a good reason to discard a Frequentist method.

Rare or widespread events are a typical example. If a high enough percentage of a population is doing or not doing something, then it may appear that everybody or nobody is doing something. Because Frequentist intervals are built around point estimates, and the sample has no variance, the interval lacks a range. I find it disturbing to abandon a method because it sometimes produces a result that others may not accept. The virtue of a Frequentist method is that all information comes from the data. It just happens that the data didn’t have enough information in it.

That is not the sum total of all pathologies, however. Other pathologies may encourage the use of a Bayesian method because an appropriate Frequentist method may exist but cannot be found. For example, the sample mean coordinate of the points in a donut centered on $(0,0,0)$ should be near $(0,0,0)$, but there is no donut there. That is where the donut hole is. A range built around an unsupported point may encourage a Bayesian alternative if information about the shape cannot be included in the non-Bayesian solution for some reason.

The second reason has a partial Frequentist analog, the case of outside information. In the general case, where there is outside research on a parameter of interest, both a Bayesian prior and a Frequentist meta-analysis produce useable intervals. The difficulty happens when the outside knowledge is not contained in data, per se, but in outside knowledge.

Some knowledge is supported by theory and observations in unrelated studies but should logically hold. For example, consider the case of a well engineered object that should range between 1 and 0. If it reaches 0, then it terminates. The next value $x_{t+1}=\beta{x}_t+\epsilon,0<\beta<1$. It can only have a value of 1 at $t=0$. It may be the case that $x_t$ can go up or down, but it can never reach 1 again and stops at 0. Furthermore, because it is well-engineered, $\beta=.9999999\pm{.00000001}$. Of course, we could have deceived ourselves about the true tolerance. That is the rub when using a Bayesian method.

In the case of the well-engineered product, confidence intervals are too conservative and overestimate the range of the interval. In that case, it can be trivially true that a 95% interval covers it at least 95% of the time because it may be so wide, given that prior information was excluded from its construction, that it should cover the parameter nearly 100% of the time.

The third case happens when something is a one-off event instead of a repeating event. Interestingly, you can create a case where a confidence interval is the valid interval for one party, and a credible set is the valid interval for another party with the same data.

Consider a manufacturing firm that produces some product that fails from time to time. It wants to guarantee that at least 99% of the time, it can recover from failure based on an interval. A confidence interval provides that guarantee. However, the party buying a product that failed may want an interval that has a 99% chance of being the correct interval to fix the problem as this will not repeat, and it must only work this one time. They are concerned about the data they have and the one event they are experiencing. They do not care about the product’s efficacy for the other customers of the firm.

The fourth case may have no real-world analogs, but it has to do with the difference in the type of loss being experienced. Most Frequentist procedures are mini-max procedures. The minimize the maximum amount of risk that you are exposed to. That is also true for confidence procedures. Most Bayesian interval estimates minimize average loss. If your concern is minimizing your average loss from using an interval built by a non-representative sample, then you should use a credible set. If you are concerned about taking the smallest possible largest risk, then you should use a confidence interval.

But getting back to the apples and tractors, these do not happen that often. Frequentist procedures overtook the pre-existing Bayesian paradigm because it works in most settings for most problems. Bayesian procedures are clearly superior in some cases, but not necessarily Bayesian intervals.

The real-world cases for Bayesian credible sets are things like search and rescue because they can be quickly and easily updated and can use knowledge without prior research. It can also be superior when significant amounts of data are missing because Bayesian methods can treat a missing data point as it does a parameter. That can prevent a pathological interval created by information loss because it can then marginalize out the impact of the missing data.

This is a personal guess based on the observation that Bayesian methods are not in heavy use comparatively, but I am not that convinced an interval holds the same value on the Bayesian side of the coin.

Frequentist methods are built around points. Bayesian methods are built around distributions. Distributions carry more information than a single point. Bayesian methods can split inference and probability from actions taken based on those probabilities.

If an interval would be helpful, a loss function can be applied to the posterior, and boundaries for the interval can be discovered. In that case, it is a formalism to support a proper action given the data.

I do not suspect that specific use happens that much except in risk management, where ranges are essential. I do not know that it happens that much in that case.

Confidence intervals carry more information than point estimates. Credible sets are an information reduction technique.

A confidence interval of $7\pm{3}$ isn’t giving the same information as a credible set of $[6,7]\cup[7.5,9]$ for the same data.

score 0 · Answer 2 · answered Dec 16 '21 at 16:39

A classic example is when you have tested a drug versus a placebo in a randomized clinical trial of 1 year duration and there were 1000 patients in each group. An adverse event that people were concerned could be a side effect of the treatment occurred in 0 patients in the treatment group and 0 patients in the placebo group. We have rates at which these events occurred in the placebo groups of previous similar studies in the same population, where they were also very rare, but sometimes occurred.

What can you say about the odds ratios (or rate ratio or hazard ratio)? A frequentist estimate would be that we don't really have an estimate and maybe our confidence interval is something like $(-\infty, \infty)$.

In contrast, a sensible Bayesian analysis will do something more informative as long as we have at least some weak prior information about the likely placebo rate and the possible size of a treatment effect. With a plausible level of prior information, a Bayesian unless would in this kind of scenario already suggest that extreme odds ratios are no longer very likely.

In contrast, see e.g. the TGN1412 example (see e.g. pages 2 and 92 to 94 here or Senn, S. (2008). Lessons from TGN1412 and TARGET: Implications for observational studies and meta-analysis. Pharmaceutical Statistics, 7(4):294–301.), where 6 out of 6 patients with an adverse event on a test drug compared with 0 out of 2 placebo patients with an event is not statistically significant at the one-sided 2.5% level (Fisher’s exact test). However, a sensible Bayesian analysis suggests that we are pretty sure that the side effects were due to the drug.

The confidence interval for a rate ratio would not preclude incorporating historical data (meta-analysis). Even if the data are sparse one can construct sensible confidence intervals for each rate as well as their ratio by inverting the CDF of the maximum likelihood estimator. If constructed objectively even using historical data the posterior can be viewed as a crude approximate frequentist testing/confidence procedure. The choice of interpreting a credible interval comes down to what one wants to measure, the experimenter or the experiment. — Geoffrey Johnson, Dec 16 '21 at 17:57
Even if the Fisher's exact test is not statistically significant at the one-sided 2.5% level we should still report and interpret the p-value and provide a confidence interval. This would show there may very well be an effect and we simply were not able to detect it at a specific significance level. — Geoffrey Johnson, Dec 16 '21 at 17:58

Geoffrey Johnson · Answer 3 · 2021-12-17T19:14:55.727

Bjorn's answer suggests a frequentist confidence procedure cannot handle sparse data, nor can it incorporate historical data. To illustrate this Bjorn provides the TGN1412 example,

(see e.g. pages 2 and 92 to 94 here or Senn, S. (2008). Lessons from TGN1412 and TARGET: Implications for observational studies and meta-analysis. Pharmaceutical Statistics, 7(4):294–301.), where 6 out of 6 patients with an adverse event on a test drug compared with 0 out of 2 placebo patients with an event.

Using only the data provided above (while assuming equal exposure for all subjects and that a subject can experience only 1 event of interest), the figure below depicts confidence curves (one-sided p-values) testing hypotheses regarding the population-level adverse event rate $p$ for the active and placebo treatments. It also identifies the one-sided 97.5% confidence limits. This is formed by inverting the CDF of a binomial distribution based on the $\hat{p}_{pbo}=0$ and $\hat{p}_{act}=1$ point estimates. The estimated rate ratio is $\hat{p}_{pbo}/\hat{p}_{act}=0$ and a conservative upper 97.5% confidence limit is the ratio of the individual confidence limits, $0.84/0.54=1.56$. Notice the point and interval estimate $0(0,1.56)$ for the rate ratio is not $0(-\infty,\infty)$.

This figure also shows Bayesian posterior densities (credible intervals of all levels) for the adverse event rate for each treatment based on an arbitrary uniform prior in each group. As estimators the posterior means are biased towards 0.5, which is evidenced by the observed point estimates. Also to note is the upper credible limit for the placebo event rate is noticeably shorter than the confidence limit. This credible limit may not have good coverage probability in repeated experiments, calling into question whether we should feel confident in its performance for this experimental result. Based on $100,000$ Monte Carlo simulations the two-sided equal-tailed $95\%$ credible interval for the incidence rate ratio is $(0.0096, 0.85)$. Viewing the prior as a user-defined weight function that smooths the likelihood, the posterior densities can be seen as approximate p-value functions. The choice of interpreting a credible interval comes down to what one wants to measure using probability, the experimenter or the experiment.

Based on these data and a uniform prior distribution, a strict posterior decision rule would lead one to conclude the unknown fixed true rate ratio is smaller than $1$. Both methods can incorporate relevant historical data. Encoding the historical and current data through the likelihood, it is not clear what arbitrary user-defined weight function (prior) one should choose when smoothing the likelihood to construct the posterior intervals.

Addendum: Per Bjorn's request we can also look at the scenario where both groups have zero observed events. Just as before the credible intervals are worrisomely shorter than the confidence intervals, and the posterior means are the result of biased estimators.

The challenge now is to construct a point and interval estimate for the incidence rate ratio. The maximum likelihood estimate is $\frac{\hat{p}_{pbo}}{\hat{p}_{act}}=\frac{0}{0}$, which we could define to be equal to $1$. However, to construct conservative upper and lower confidence limits as before would produce values of the form $\frac{c}{0}$.

The Bayesian analysis of the rate ratio avoids this trouble because of the uniform prior distributions for each rate. This is equivalent to incorporating hypothetical experimental evidence by considering the scenario where each treatment group had recruited $2$ additional subjects, and $1$ subject in each group experienced the event of interest. This of course does not match the actual observed experiment, but it does provide conservative point estimates (conservative in the sense that the adverse event rate is not under estimated).

This same examination of hypothetical experimental evidence can be performed by referencing the exact binomial sampling distribution, which is presented in the figure below. Under this hypothetical scenario, a conservative $95\%$ confidence interval can be constructed by using the ratios of confidence limits for the individual rates, producing $\bigg(\frac{\hat{p}^L_{pbo}}{\hat{p}^U_{act}},\frac{\hat{p}^U_{pbo}}{\hat{p}^L_{act}}\bigg)=\Big(\frac{0.006}{0.53},\frac{0.81}{0.003}\Big)=(0.011,270)$. Another approach would be to invert the cumulative distribution function for the maximum likelihood estimator of the rate ratio while profiling the nuisance parameter $p_{act}$. Based on $100,000$ Monte Carlo simulations, the two-sided equal-tailed $95\%$ credible interval for the rate ratio is $(0.068, 68.25)$.

If we instead investigate the difference in incidence rates then no hypothetical experimental evidence is needed when constructing confidence limits based on the binomial CDF. If a subject can experience more than 1 event or we have varying exposure for each subject (or both) then a Poisson or Negative Binomial model should be used instead.

Treating fixed population-level parameters as random variables gives the appearance that more uncertainty is being accounted for, but often leads to credible limits (approximate confidence limits) that are too short.

Now do that for the 0 events in 1000 patient years vs. 0 events in 1000 patient-years example. That's the one that my $(-\infty, \infty)$ comments referred to. — Björn, Dec 17 '21 at 12:05
@Bjorn, per your request I have included an additional example where 0 events are observed in each group. — Geoffrey Johnson, Dec 17 '21 at 16:15

Raymond Kwok · Answer 4 · 2022-01-22T17:07:43.187

0

If you concern about the recovery rate of a certain disease, credible interval is what you need when you want to say

There is a 95% chance that the recovery rate is between X and Y.

You cannot say this using confidence interval. With a 95% confidence interval, you can only say

There is a 95% chance that the next set of patient sample have a recovery rate between X and Y (crossed out to not mess up with the sample generation scenario -- that ASSUME my sample distribution is the true population distribution, 95% chance the next sample drawn from the population is within the interval X to Y)

If we draw $N$ sets of samples and calculate for each set a confidence interval, 95% of those intervals cover the true recovery rate, but I don't know if a particular interval X to Y contains the true recovery rate or not. In other words, I am only 95% confidence that the true recovery rate falls within the confidence interval X to Y that I calculcated from my data.

edited Jan 22 '22 at 17:07

answered Jan 22 '22 at 05:30

Raymond Kwok

205
1
5

Your characterization of confidence intervals is incorrect. You describe something closer to a prediction interval for the range of recovery rates of the next 100 patients. That will be *much* wider than a confidence interval for the recovery rate. – whuber Jan 22 '22 at 16:03
how would you have rewritten "There is a 95% chance that the next 100 patients have a recovery rate between X and Y"? – Raymond Kwok Jan 22 '22 at 16:13
what about "There is a 95% chance that the next 100 patients have an ***averaged*** recovery rate between X and Y"? – Raymond Kwok Jan 22 '22 at 16:15
Same problem: that would be a prediction interval for the average rate of the next 100 patients. I recommend you review the concept of confidence intervals before guessing again. We have some good threads about the subtleties. See https://stats.stackexchange.com/questions/26450 for instance. – whuber Jan 22 '22 at 16:21
Thank you. @whuber – Raymond Kwok Jan 22 '22 at 16:56

Credibility Intervals

4 Answers4

Linked