Clarifying the Metropolis-Hastings Algorithm:
1) Metropolis-Hastings:
In Bayes Law : P(thetha|data = [P(data|thetha) * P(thetha)] / P(Data)
In the continuous case:
P(Data) = integral of [ P(Data|thetha) * P(thetha)] for all values of thetha
Suppose we say "integral of [ P(Data|thetha) * P(thetha)] " = Z.
Z is some constant. Then:
P (thetha | data ) = [P (Data | thetha) * P (thetha)] / Z
For initialization, take a random sample from : P (thetha | data ) . This random sample is divided by Z (as a formality). Call this sample as x-i.
Now, define a "candidate distribution" as "g" - e.g. "g" is a normal distribution with a mean of 0 and variance of 1000. Sample some number from "g". Call this sample as x-star.
Then : A( x-i, x-star) = min[1, (P(Data|thetha @ x-i) * P(thetha @ thetha-i) * Z) / (P(Data|thetha @ x-star) * P(thetha @ x-star) * Z)) ]
(notice that "Z" gets cancelled out)
Now, define "U" as a uniform distribution between 0 and 1. Take a sample from "U".
Conclusion : IF U < A( x-i, x-star) THEN x-i-new = x-star ELSE x-i-new = x-i
Go to next iteration. Replace x-i with x-i-new. Take new value of x-star. Repeat everything. Repeat until convergence.
My questions:
Why is the "candidate distribution" (g) chosen as a normal distribution with a large variance?
I have heard that the Metropolis-Hastings algorithm was said to be an improvement compared to previous sampling techniques, as it apparently has the ability to "exploit" high probability/high density regions from the posterior distribution. Is this true?
Just to clarify : the Markov Chain in this algorithm refers to the process of accepting the newly sampled point or rejecting the new sampling point? Can this be considered as the two states of this Markov Chain?
In this video : https://www.youtube.com/watch?v=a_08GKWHFWo , the author explains that there might be situations where you can not use the Metropolis-Hastings algorithm to sample from the joint distribution, and you might have to use the Gibbs Sampler. I didn't quite understand this point. If I understood it correctly, this answer over here When would one use Gibbs sampling instead of Metropolis-Hastings? suggests that the Metropolis-Hastings algorithm can be used for any problem where the Gibbs Sampler can be used - however, the Gibbs Sampler can be faster in lower dimensional problems, and in problems where the conditional distributions can be easily obtained through conjugate priors. Can someone please help me understand this?
Much Thanks!