0

A typical introduction to the Metropolis--Hastings algorithm, and hence to Markov chain Monte Carlo techniques in general, starts with the following assumptions on some probability distribution $P(x)$ which we'd like to sample from.

  • We are perfectly able to evaluate $P(x)$, or perhaps more accurately some multiple $C \cdot P(x)$ for some fixed $C$, though I don't quite see why you wouldn't just normalise and put $C = 1$.
  • However, it is very hard to sample from $P(x)$; so hard, that we'd rather sample from some other distribution thousands of times for the sake of getting a sample of $P(x)$ via Markov chain methods.

I'd like to see an actual, real-life example in which such a scenario comes up. How can it be easy to evaluate $P(x)$ but hard to sample from it? I'd be interested in seeing a real-life situation involving say weather, diseases, neurons, air flow, whatever it is, that leads to a situation in which MCMC methods are a natural thing to ask for.

3 Answers3

1

The reason it is such a useful method reduces to to Bayes' formula:

$$p(x \vert \text{data}) = \frac{p(\text{data} \vert x) p(x)}{p(\text{data})}$$

Typically, we denote $p(\text{data} \vert x)$ the likelihood function, $p(x)$ the prior distribution and in your notation $C = p(\text{data})$ is the marginal likelihood of the data. As statisticians or machine learners or whatever we desire to know the reasonable values $x$ can take under this model formulation. If you have a model of the above form and you want to estimate the mean of $x$, or the variance of $x$, or some other function of $x$, you need to be able to integrate with respect to $p(x \vert \text{data})$. The problem is, for nearly all choices of likelihood and prior, the product is not a known distribution, and it is furthermore also completely impossible to actually integrate numerically due to the dimensionality of $x$.

For interesting applications of MCMC to (Bayesian) models, you could look at

Forgottenscience
  • 1,186
  • 6
  • 10
0

First, let me summarize the principle. Let $\boldsymbol{y}$ be the data (sample) and $p(\boldsymbol{y}|\boldsymbol{\theta})$ be the distribution of $\boldsymbol{y}$, parameterized by $\boldsymbol{\theta}$. Let take prior $p(\boldsymbol{\theta})$ for $\boldsymbol{\theta}$. After observing the sample (with the joint distribution $p(\boldsymbol{y}|\boldsymbol{\theta})$, also denoted by $L(\boldsymbol{\theta}|\boldsymbol{y})$) the knowledge about $\boldsymbol{\theta}$ is updated using Bayes' Theorem: \begin{equation}\label{Ch2 Bayesian Theorem} p(\boldsymbol{\theta}|\boldsymbol{y})=\dfrac{L(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta})}{p(\boldsymbol{y})}. \end{equation} where $p(\boldsymbol{y})=\int{L(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta})}d\boldsymbol{\theta}$.

Now, take an example. Suppose we have $y\sim N(\mu, 1)$ and prior for $\mu$ as $N(0, 10^6)$. This is a conjugate prior and then the posterior distribution for $\mu$ is also normal with an analytical form. Bingo!!!

What happens if we take another prior such as a logistic distribution. I do not think this is a conjugate prior.

In general, the problem is at the denominator, which in general is not tractable, i.e., there is no closed form for $p(\boldsymbol{y})=\int{L(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta})}d\boldsymbol{\theta}$.

In other words, we only know $L(\boldsymbol{\theta}|\boldsymbol{y})p(\boldsymbol{\theta})$ , i.e., we only know $p(\boldsymbol{y})$ up to a multiplicative constant.

In summary:

  1. If we know $p(\boldsymbol{y})$ analytically, for example a common distribution, just sample from it. (Of course we can use MCMC also).

  2. If we just know $p(\boldsymbol{y})$ up to a multiplicative constant, we cannot sample from $p(\boldsymbol{y})$ (since of course we do not know it).

MH algorithm is a solution by the following reasons:

Recall that the acceptance probability is calculated as \begin{equation*} \alpha(\boldsymbol{\theta}^k,\boldsymbol{\theta}^{'})=\text{min}\left(\dfrac{p(\boldsymbol{\theta}^{'}|\mathbf{y})q(\boldsymbol{\theta}^k|\boldsymbol{\theta}^{'})}{p(\boldsymbol{\theta}^k|\mathbf{y})q(\boldsymbol{\theta}^{'}|\boldsymbol{\theta}^k)},1\right). \end{equation*}

It is important to note that since $p(\boldsymbol{y})$ is known up to a multiplicative constant the ratio $\dfrac{p(\boldsymbol{\theta}^{'}|\mathbf{y})}{p(\boldsymbol{\theta}^k|\mathbf{y})}$ is well defined.

TrungDung
  • 749
  • 4
  • 13
  • Dear @TDT, why can't we just normalise $p(\mathbf{y})$ to deduce that multiplicative constant? My intuition might be leading me astray, but I'm finding it difficult to see how that would be more resource-intensive than the calculations involved in MCMC algorithms. – Mr. Palomar Nov 19 '20 at 10:50
  • @Mr.Palomar: You see in the course of numerical analysis, a handful types of integration we can do analytically, for others, we have to use numerical approximation. Suppose you would like to do that for $p(y)$. Is this easy? Is the result resemble a common distribution that we can easily sample? Of course, in general, MH algorithm is a solution, but it seems a many-case solution. (I do not say a universal solution but something like that). We may have a better solution in a particular problem. – TrungDung Nov 19 '20 at 10:58
0

The most common situation I'm familiar with is fitting any kind of Bayesian (regression) model to data. It's usually very easy to write down the likelihood (=sampling distribution of the data for given parameter values), as well as some prior distributions for the model parameters. The posterior distribution is proportional to their product, but it is usually not possible to get a closed form solution for the denominator (that denominator requires integrating out all model parameters when you may have dozens or hundreds of model parameters).

Thus, outside some very, very specific circumstances, there will be no closed form posterior distribution, which one could analytically summarize, available for Bayesian models. However, doing MCMC sampling is usually very feasible. The homepage of one popular MCMC sampler (Stan) has a list of case studies that illustrate use cases of MCMC sampling. Even for something simple like the baseball batting rates example, try getting the results they produce without MCMC sampling, I assure you that will be tough. I'm not 100% sure that in that case it's not perhaps possible, but in the more complex examples it definitely will be impossible.

Björn
  • 21,227
  • 2
  • 26
  • 65