How to calculate confidence intervals for a generic (non-normal) pdf

Question

In particular, this is a question about distributions with some skew and/or where the mean, median and mode might all be different.

For any pdf, it's possible to find multiple different domains within the support which contain exactly 95% (or another other %ge) of the integral. For distributions like the normal distribution, it's very intuitive that you want to find $b$ such that $\int _{\mu -b}^{\mu +b}f(x)dx = 0.95$ in order to find the 95% confidence interval.

For other distributions, it's less clear that you want to do this, because the mean $\mu$ of the distribution holds a less special status (in the normal case, it is the mean, median and mode).

It would seem to me, that really, one is trying to find (a,b) in a way that minimises (b-a) subject to the constraint that $\int_{a}^{b}f(x)dx=0.95$

In the case of the normal distribution, this will boil down to exactly what I wrote above.

My questions are:

Is this right?
Presumably this has to be done numerically. Is there a known algorithm for doing this? (naively I'd just vary a and find the corresponding b for every a (for many a there will exist no b) and then find the (a,b) pair that minimises (b-a) )
Is there an open source implementation of this in python? Ideally I would pass two vectors representing a pdf, one which tells it at which points on the x-axis I've sampled the pdf and one which gives the density at each of these points.

I'm interested, for example, of finding the 95% confidence interval of the beta distribution for arbtirary $(\alpha, \beta)$. Of course it would be ideal if one could do this analytically but I strongly suspect this won't be possible.

I appreciate that things get even more complicated for multi-modal distributions, I'm happy with answers that only work for unimodal distributions for the moment.

I think this is what you are looking for: https://stats.stackexchange.com/questions/148439/what-is-a-highest-density-region-hdr. On your points (1) It is almost correct because minimizing $(b-a)$ works only for unimodal distribution; (2) See this: https://stats.stackexchange.com/a/383626/239481; (3) Just search for HDI in python - it must be there for some bayesian statistics related package as it is used extensively there (called HPDI). — Dayne, Dec 20 '20 at 11:56
A little more explanation on (1): minimizing $(b-a)$ is forcing the interval to be continuous. For a multimodal distribution it maybe that a discontiguous set has smaller range and yet area under $f(x)$ for this set is 0.95. To get such sets, we find the highest horizontal bar such that area under $f(x)$ between pair(s) of intersection points is 0.95 — Dayne, Dec 20 '20 at 12:01
Thanks, useful pointers. Indeed the problem is substantially more complex for multimodal distributions. Surprisingly, googling "Highest Density Interval Python" has not been hugely fruitful (I found a few implementations that use raw data rather than the pdf itself)... but that's OK, at least this gives me confidence I'm not wasting my time implementing this (for unimodal cases) myself. — gazza89, Dec 20 '20 at 13:28
For python implementation you might find this helpful: https://stackoverflow.com/questions/22284502/highest-posterior-density-region-and-central-credible-region — Dayne, Dec 20 '20 at 13:31
kinda. The pymc3 example uses raw data and the scipy.stats.ppf example finds the central confidence interval rather than the HDI, so neither are quite what I'm after. I guess a somewhat inelegant solution would be to sample a tonne of data from from pdf and then pass to pymc3.stats.hpd — gazza89, Dec 20 '20 at 13:37

How to calculate confidence intervals for a generic (non-normal) pdf

0 Answers0