Examples of errors in MCMC algorithms

Question

I'm investigating a method for automatic checking of Markov chain Monte Carlo methods, and I would like some examples of mistakes that can occur when constructing or implementing such algorithms. Bonus points if the incorrect method was used in a published paper.

I'm particularly interested in cases where the error means that the chain has the incorrect invariant distribution, though other types of errors (e.g. chain not ergodic) would be of interest as well.

An example of such an error would be failing to output a value when Metropolis-Hastings rejects a proposed move.

One of my favorite examples is the [Harmonic mean estimator](http://www.stat.wisc.edu/~newton/papers/publications/RafteryFinal.pdf) because it has nice asymptotic properties but it fails to work in practice. [Radford Neal](http://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/) discusses this in his blog: "The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe". This method has been widely implemented in applications. — , Jun 19 '12 at 11:39
[Another one](http://www.cs.utoronto.ca/~radford/chib-letter.html) courtesy of Prof. Neal. — Cyan, Jun 19 '12 at 17:02
@Cyan For Neal to be taken seriously I think he should have found a journal that would accept his article rather than just submit it on the internet. I can easily believe that he is right and the referees and author are incorrect. Although it is difficult to get papers published that contradict published results and the JASA rejection is discouraging, I think he should have tried several other journals until he succeeded. You need an inpartial and independent referee to add credibility to your findings. — Michael R. Chernick, Jun 19 '12 at 23:19
One should always take Prof. Neal seriously! ;o) Seriously it is a shame that results like this are difficult to get published, and unfortunately modern academic culture doesn't seem to value that sort of thing, so it is understandable if it isn't a high priority activity for him. Interesting question, I am very interested in the answers. — Dikran Marsupial, Jun 21 '12 at 16:18
On [Darren Wilkinson's blog](http://darrenjw.wordpress.com/2012/06/04/metropolis-hastings-mcmc-when-the-proposal-and-target-have-differing-support/) you have a very well described example of something that goes wrong and how to fix it. Quite a nice piece of work. — gui11aume, Jun 21 '12 at 16:33
@Michael: *You need an [impartial] and independent referee to add credibility to your findings.* If only there were more of those in the world! :) (There are plenty of referees.) I'm not sure Prof. Neal needs too much independent refereeing to add to his credibility, especially regarding this topic. He *is* unabashedly opinionated. (Also, it seems *JASA* was the most reasonable, and maybe, politically, the *only* venue that such a piece should appear in, given it was a comment on an article that appeared there.) — cardinal, Jun 21 '12 at 16:36
@cardinal I see why it is most appropriate as a letter to JASA since it critiques a paticular paper. But I do think that given that the paper was rejected by JASA there is reason to wonder about the credibility of the article and a published paper would carry more weight than an article submitted on the internet without peer review. — Michael R. Chernick, Jun 21 '12 at 17:19
@Michael: Perhaps. Having been on all sides of similar situations, including in Prof. Neal's position, on many occasions, my anecdotal observations are that paper rejection carries very, very little information content in most cases, as do many acceptances. Peer review is orders of magnitude more noisy than people care to admit and, often, as may be the case here, there are *partial* and *interested* (i.e., not independent) parties and interests at play. That said, I didn't intend my original comment to take us so far afield of the topic at hand. thanks sharing your thoughts on the matter. — cardinal, Jun 21 '12 at 17:38
@cardinal I have many times been on both sides of the refereeing process also as author, referee and AE. Referees and editors can certainly be biased in comments and decisions as most humans are prone to be. When the article has controversy attached it is helpful to have an impartial peer supporting your position. — Michael R. Chernick, Jun 21 '12 at 18:03
This appears to be a good question, but the question could be more useful (to non-experts) if it were revised for clarity. For example, the terms "ergodic" and "invariant" are not clear; "ergodic" is not mentioned in [Gelman et al BDA](http://goo.gl/Bk8wd) although it appears frequently in the more recent [Handbook of MCMC](http://goo.gl/M0rdE). Could you provide a definition and perhaps a visual example or description of how you would identify or test for an ergodic chain or invariant distribution? — David LeBauer, Jun 21 '12 at 22:06
I am also confused by "failing to output a value when MH rejects a proposed move". Doesn't this mean that the algorithm was implemented incorrectly? Similarly, if it took the proposed value with the wrong probability, the implementation would be wrong, and it wouldn't be MH. My understanding now is "errors in implementation of MCMC algorithm" as opposed to just checking that assumptions related to convergence, correlation, etc are met. So, to be clear, you are looking for incorrect implementations. Would a valid answer be the (potentially infinite) set of incorrectly implemented algorithms? — David LeBauer, Jun 21 '12 at 22:21
Neal's marginal likelihood example and Wilkinson's MH example are the sort of thing I had in mind (i.e. algorithms that *seem* correct, but aren't). Do you want to make them in to answers? — Simon Byrne, Jun 22 '12 at 09:43
@David: Yes, the algorithm would be incorrect, that was the point. But it is not an uncommon error for people to make when first learning MCMC methods (due to its apparent similarity to rejection sampling). — Simon Byrne, Jun 22 '12 at 09:48
The harmonic mean example is great as well (although the example Neal gives doesn't use MCMC, it would probably be used for more complex problems). — Simon Byrne, Jun 22 '12 at 09:54
@Cyan: I do not think the criticism of Sid Chib's method by Radford Neal does qualify as an error in a MCMC algorithm. The whole issue of label switching is a problem with a very slowly converging MCMC algorithm, see e.g. [Berkhof, van Mechelen and Gelman (2002)](http://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CB8QFjAA&url=http%3A%2F%2Fwww.stat.columbia.edu%2F~gelman%2Fresearch%2Fpublished%2FA13n210.pdf&ei=KIqkUOSYNYrN0QWovIHABA&usg=AFQjCNFETBfOjqStSFSqy06HMZ0u-cBR3w&sig2=FVYQrowCNwwxpinP-O0nSg). — Xi'an, Nov 15 '12 at 06:22

score 12 · Answer 1 · answered Jun 22 '12 at 10:38

Darren Wilkinson on his blog gives a detailed example of a common mistake in random walk Metropolis-Hastings. I recommend reading it in full, but here is the tl;dr version.

If the target distribution is positive (like Gamma distributions etc) in one dimension, it is tempting to reject proposals that have a negative value on that dimension straight away. The mistake is to throw away the proposals like they never happened and evaluate the Metropolis-Hastings (MH) acceptance ratio of the other ones only. This is a mistake because it amounts to using a non symmetric proposal density.

The author suggests to apply one of two fixes.

Count the "negatives" as failing acceptance (and lose a bit of efficiency).
Use the correct MH ratio in that case, which is

$$ \frac{\pi(x^*)}{\pi(x)} \frac{\Phi(x)}{\Phi(x^*)}, $$

where $\pi$ is the target density and $\Phi$ is the normalization constant of the truncated random walk proposal $\phi$, i.e. $\Phi(x) = \int_0^{\infty} \phi(y-x)dy$.

+1 Interesting example. I was also thinking about other issues with MH related to the acceptance rate. I think the [0.234](http://faculty.wcas.northwestern.edu/~lchrist/course/Gerzensee_2011/The%20Annals%20of%20Applied%20Probability%201997%20Roberts.pdf) optimal rate has been overused. — , Jun 22 '12 at 13:44
@Procrastinator you know the MCMC literature very well. Is this your domain of expertise? — gui11aume, Jun 22 '12 at 14:30
Thanks for your comment. I like Bayesian statistics, then I need to carry the MCMC cross ;). — , Jun 22 '12 at 14:34

score 11 · Accepted Answer · 2012-06-22T14:44:47.703

1. Marginal Likelihood and Harmonic mean estimator

The marginal likelihood is defined as the normalising constant of the posterior distribution

$$p({\bf x})=\int_{\Theta}p({\bf x}\vert\theta)p(\theta)d\theta.$$

The importance of this quantity comes from the role it plays in model comparison via Bayes factors.

Several methods have been proposed for approximating this quantity. Raftery et al. (2007) propose the Harmonic mean estimator, which quickly became popular due to its simplicity. The idea consists of using the relation

$$\dfrac{1}{p({\bf x})}=\int_{\Theta}\dfrac{p(\theta\vert{\bf x})}{p({\bf x}\vert\theta)}d\theta.$$

Therefore, if we have a sample from the posterior, say $(\theta_1,...,\theta_N)$, this quantity can be approximated by

$$\dfrac{1}{p({\bf x})}\approx\dfrac{1}{N}\sum_{j=1}^N \dfrac{1}{p({\bf x}\vert\theta_j)}.$$

This approximation is related to the concept of Importance Sampling.

By the law of large numbers, as discussed in Neal's blog, we have that this estimator is consistent. The problem is that the required $N$ for a good approximation can be huge. See Neal's blog or Robert's blog 1, 2, 3, 4 for some examples.

Alternatives

There are many alternatives for approximating $p({\bf x})$. Chopin and Robert (2008) present some Importance sampling-based methods.

2. Not running your MCMC sampler long enough (specially in the presence of multimodality)

Mendoza and Gutierrez-Peña (1999) deduce the reference prior/posterior for the ratio of two normal means and present an example of the inferences obtained with this model using a real data set. Using MCMC methods, they obtain a sample of size $2000$ of the posterior of the ratio of means $\varphi$ which is shown below

enter image description here

And obtain the HPD interval for $\varphi$ $(0.63,5.29)$. After an analysis of the expression of the posterior distribution, it is easy to see that it has a singularity at $0$ and the posterior should actually look like this (note the singularity at $0$)

enter image description here

Which can only be detected if you run your MCMC sampler long enough or you use an adaptive method. The HPD obtained with one of these methods is $(0,7.25)$ as has already been reported. The length of the HPD interval is significantly increased which has important implications when its length is compared with frequentist/classical methods.

3. Some other issues such as assessing convergence, choice of starting values, poor behaviour of the chain can be found in this discussion by Gelman, Carlin and Neal.

4. Importance Sampling

A method for approximating an integral consists of multiplying the integrand by a density $g$, with the same support, that we can simulate from

$$I=\int f(x)dx = \int \dfrac{f(x)}{g(x)}g(x)dx.$$

Then, if we have a sample from $g$, $(x_1,...,x_N)$, we can approximate $I$ as follows

$$I\approx \dfrac{1}{N}\sum_{j=1}^N \dfrac{f(x_j)}{g(x_j)}.$$

A possible issue is that $g$ should have tails heavier/similar than/to $f$ or the required $N$ for a good approximation could be huge. See the following toy example in R.

# Integrating a Student's t with 1 d.f. using a normal importance function   
x1 = rnorm(10000000)   # N=10,000,000
mean(dt(x1,df=1)/dnorm(x1))

# Now using a Student's t with 2 d.f. function
x2 = rt(1000,df=2)
mean(dt(x2,df=1)/dt(x2,df=2))

They're some great examples. For anyone who is interested, the letter to the editor with the figure is here: http://onlinelibrary.wiley.com/doi/10.1002/bimj.200800256/abstract — Simon Byrne, Jun 22 '12 at 13:24

score 1 · Answer 3 · answered Nov 07 '14 at 19:43

A very clear case (connected with the marginal likelihood approximation mentioned in the first answer) where true convergence is the example of the problem of label switching in mixture models coupled with the use of Chib's (1995) estimator. As pointed out by Radford Neal (1999), if the MCMC chain does not converge correctly, in the sense that it does explore some of the mode of the target distribution, the Monte Carlo approximation of Chib fails to reach the right numerical value.

Examples of errors in MCMC algorithms

3 Answers3

Linked