Details of the Metropolis-Hastings Algorithm

Question

Clarifying the Metropolis-Hastings Algorithm:

1) Metropolis-Hastings:

In Bayes Law : P(thetha|data = [P(data|thetha) * P(thetha)] / P(Data)

In the continuous case:

P(Data) = integral of [ P(Data|thetha) * P(thetha)] for all values of thetha

Suppose we say "integral of [ P(Data|thetha) * P(thetha)] " = Z.

Z is some constant. Then:

P (thetha | data ) = [P (Data | thetha) * P (thetha)] / Z

For initialization, take a random sample from : P (thetha | data ) . This random sample is divided by Z (as a formality). Call this sample as x-i.

Now, define a "candidate distribution" as "g" - e.g. "g" is a normal distribution with a mean of 0 and variance of 1000. Sample some number from "g". Call this sample as x-star.

Then : A( x-i, x-star) = min[1, (P(Data|thetha @ x-i) * P(thetha @ thetha-i) * Z) / (P(Data|thetha @ x-star) * P(thetha @ x-star) * Z)) ]

(notice that "Z" gets cancelled out)

Now, define "U" as a uniform distribution between 0 and 1. Take a sample from "U".

Conclusion : IF U < A( x-i, x-star) THEN x-i-new = x-star ELSE x-i-new = x-i

Go to next iteration. Replace x-i with x-i-new. Take new value of x-star. Repeat everything. Repeat until convergence.

My questions:

Why is the "candidate distribution" (g) chosen as a normal distribution with a large variance?
I have heard that the Metropolis-Hastings algorithm was said to be an improvement compared to previous sampling techniques, as it apparently has the ability to "exploit" high probability/high density regions from the posterior distribution. Is this true?
Just to clarify : the Markov Chain in this algorithm refers to the process of accepting the newly sampled point or rejecting the new sampling point? Can this be considered as the two states of this Markov Chain?
In this video : https://www.youtube.com/watch?v=a_08GKWHFWo , the author explains that there might be situations where you can not use the Metropolis-Hastings algorithm to sample from the joint distribution, and you might have to use the Gibbs Sampler. I didn't quite understand this point. If I understood it correctly, this answer over here When would one use Gibbs sampling instead of Metropolis-Hastings? suggests that the Metropolis-Hastings algorithm can be used for any problem where the Gibbs Sampler can be used - however, the Gibbs Sampler can be faster in lower dimensional problems, and in problems where the conditional distributions can be easily obtained through conjugate priors. Can someone please help me understand this?

Much Thanks!

score 2 · Accepted Answer · answered Sep 02 '21 at 07:44

Previous questions have already asked about the details of the algorithm, e.g. this one, and Wikipedia also describes it decently. Where your understanding is not quite right:

If we knew $Z$, this would all be a lot easier. We typically use the Metropolis-Hastings algorithm when we do not know $Z$.
"For initialization," we cannot "take a random sample from : $P(\theta | \text{data} )$." If we knew how to do this, we would just keep drawing samples from this distribution (instead of using the Metropolis-Hastings algorithm to do it). Thus, for initialization, we typically take some random point in parameter space that makes sense (e.g. randomly picked in some sense in the middle of the parameter space). Even if the distribution of interest lies somewhere else, we hope the MH algorithm will eventually get there. To check this does not go wrong/have too much of an effect, it's good (not just for that reason) to have multiple chains started with different initializations.
We typically center the proposal distribution (which you call $g$) on the current parameter value $\theta_i$ to propose a new $\theta_{i+1}^*$. I.e. it is often a multivariate normal (because $\theta$ is usually not one-dimensional, but rather a vector) centered at the current $\theta_i$. The covariance matrix (i.e. both the standard deviations for each dimension and the correlations between proposals in all dimensions) usually get tuned during some initial warmup steps. Some initial hundreds or thousands of samples only get used for that purpose and then thrown away. Ideally, our proposal distribution would locally resemble the shape of the target distribution we are trying to sample from, but that's hard to achieve in some cases, but a multivariate normal is at least after some tuning capable of approximating various situations decently.
We initially have to start with some distribution and one way of going is to make a it normal with a large variance. You are aiming for about the right acceptance rate for samples (about 23% for various theoretical reasons). So, if you variance is very large and you get hardly any acceptances, you can start decreasing the variance to see whether that gets better. I.e. the multivariate normal with a large variance is just a starting point.
Making a proposal with a normal distribution center on the previous value also satisfies (some of) the requirements for the Markov chain to converge the posterior of interest (it's of course not guaranteed that it will).
Because we, in general, do not know $Z$, $A(\theta_i, \theta_{i+1}^*)$ cannot involve $Z$. However, why the MH algorithm is cool is, because it works without Z. What we do is look at $A(\theta_i, \theta_{i+1}^*) = \min(1, \frac{\text{likelihood}(\text{data}|\theta_i) \times \text{prior}(\theta_i)}{\text{likelihood}(\text{data}|\theta_{i+1}^*) \times \text{prior}(\theta_{i+1}^*)})$. Then, as you say, we compare a uniform random number $U$ with this and if it's greater or equal to $U$, we set $\theta_{i+1} = \theta_{i+1}^*$ and $\theta_{i+1} = \theta_{i}$ otherwise.
I'm not sure what people would have used before MH, so not sure what it's an improvement over. But, yes, it does some sensible things: it always moves to a proposal that has a higher value of $\text{likelihood}(\text{data}|\theta_i) \times \text{prior}(\theta_i)$ than the previous parameter value, but will also sometimes move towards lower values (otherwise you'd just move towards the maximum-a-posteriori estimate and not explore the full distribution).
The sequence of points $\theta_1, \theta_2, \ldots$ is what is referred to as a Markov Chain and each value $\theta_i$ is the $i$th state of the chain.
The MH algorithm as described cannot really deal with discrete parameters (of course you can extend it to do so). Additionally, it will sometimes be pretty inefficient (or in some situations completely unreliable). In some of these situations a Gibbs sampler can be better. Gibbs sampling (when possible) is often more efficient (i.e. better samples with less time used) than MH, but theoretically MH is still possible. Gibbs samplers exploit situations, where you know how to directly sample parts of your parameter vector $\theta$ directly from the conditional posterior having fixed the remainder of the parameter vector. That usually happens when you work with a setup that uses conjugacy. And there's many interesting hybrid samplers (e.g. Metropolis-Hastings-within-Gibbs samplers) for situations where for part of your parameter space you can use Gibbs and for part of it you can't.

Details of the Metropolis-Hastings Algorithm

1 Answers1