How to represent and sample prior in the continuous case?

Question

I'm trying to estimate an underlying distribution from a few data points. I'm using a bayesian updating technique and this works pretty well (see my other question). However, I currently use a discreet set of hypotheses $P(a|i)$ with weight or prior $P(i)$. The data variable is $a$, my hypotheses are enumerated by $i$.

Now I'd like to go to a smooth set of hypotheses, for example $P(a|\theta)$ or $P(a|\mu,\sigma)$ with some prior $P(\mu,\sigma)$. Then the expression for $P(a)$ becomes a marginalization integral:

$$P(a) = \int_\theta P(a|\theta) P(\theta)\,d\theta$$

This is easy if $P(\theta)$ is simple (e.g. flat), but how do I represent it after a knowledge update (especially in a program)? The updated prior no longer has a nice functional form. I could discretize it again, but that limits my accurracy (and it may be feasible now in 2D, but it gets worse if I add more parameters).

I could sample random $(\mu,\sigma)$ points, but I only see how that brings me through one iteration. It lets me calculate $P(a)$ for a given $a$. I want to use it to update my prior in light of new data $a^*$:

$$P(\theta)\bigg|_\mathrm{new} = P(\theta|a^*) = \frac{P(a^*|\theta)}{P(a^*)} P(\theta)$$

Since the sum over random points would be in $P(a^*)$, and after the first iteration also in $P(\theta)$, this would explode quickly computationally. I'm sure there are well-known techniques to represent $P(\theta)$ and to perform the integration, and it would be great if someone could point me towards them.

score 2 · Answer 1 · answered Aug 28 '15 at 12:50

It turns out you've stumbled on one of the fundamentally hard, important, and well studied questions of Bayesian statistics. I know of two good options.

The first is to use a conjugate prior. A prior can be conjugate to the distribution from which your data is being drawn, meaning the prior comes from a family of distributions, and the posterior will also be in that family. For instance the Beta distribution is conjugate to the Bernoulli, so if you use a Beta prior on Bernoulli data, you'll end up with an easily update Beta posterior.

The second is know as Markov Chain Monte Carlo or MCMC. MCMC methods allow for approximately sampling from a distribution where one has the relative probability (or density in the continuous case) of events but not the absolute. Meaning you have a non-normalized PMF or PDF. This is useful because it allows you to avoid the often more difficult bottom integral that comes from the Bayes Law update. (If that last sentence doesn't make sense let me know and I'll explain). There are packages that allow you to set up generative models and approximately sample from posterior distributions, such as PyMC for Python.

Just as an addendum for OP, IMO [this](http://videolectures.net/mlss09uk_murray_mcmc/) is an excellent resource for learning about approximate inference via MCMC. — jtobin, Aug 28 '15 at 13:28
Just a small correctoin. MCMC doesn't give you approximate samples. It gives you exact samples (after a possibly small burn-in time). And there is a third option which is approximate inference methods (VB-EM, expectation propagation, and others) — Guillaume Dehaene, Aug 28 '15 at 17:22
Thankyou, but I think the samples are (almost surely) approximate. Burning in $n$ samples is equivalent to multiplying sampling from the distribution $A^n x$ where $x$ is a delta vector denoting your starting point and $A$ is the models transition matrix (operator). In the limit multiplication by this matrix is projection to the eigenvector with eigenvalues 1, but before this limit it is not, unless $x$ is drawn from this eigenspace originally, which for a delta vector, means that $x$ can only transition to itself. More importantly, in many cases, it is difficult to know how much to burn-in — jlimahaverford, Aug 28 '15 at 17:33

How to represent and sample prior in the continuous case?

1 Answers1