Markov chain Monte Carlo (MCMC) for Maximum Likelihood Estimation (MLE)

Question

I am reading a 1991 conference paper by Geyer which is linked below. In it he seems to elude to a method that can use MCMC for MLE parameter estimation

This excites me since, I have coded BFGS algorithms, GAs and all sorts of these horrible hand wavy lucky-dip methods of finding global minima necessary to extract the estimation of parameters from MLEs.

The reason it excites me is that if we can guarantee convergence of the MCMC to a fixed point (e.g. a sufficient criterion would be satisfying detailed balance) then we can obtain parameters without minimising an MLE.

Conclusion is, therefore, that this provides a generic method to obtain the global minima, modulo constraints imposed above and in the paper. There are a number of algorithms for MCMC e.g. HMC that are well mapped for high dimensional MCMC problems and I would assume that they would outperform traditional gradient descent methods.

Question

Am I correct that this paper provides a theoretical basis for the usage of MCMC to obtain parameter estimates from MLEs?
Can one use an MCMC algorithm in certain circumstances, as outlined in the paper, to extract parameters from the MLE bypassing the needs for methods like Genetic Algorithms and BFGS etc.

Paper

Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. Computing Science and Statistics: Proc. 23rd Symp. Interface, 156–163.

Abstract

Markov chain Monte Carlo (e.g., the Metropolis algorithm and Gibbs sampler) is a general tool for simulation of complex stochastic processes useful in many types of statistical inference. The basics of Markov chain Monte Carlo are reviewed, including choise of algorithms and variance estimation, and some new methods are introduced. The use of Markov chain Monte Carlo for maximum likelihood estimation is explained and its performance is compared with maximum pseudo likelihood estimation.

Note: Sections 1-6 are boring and you probably know them already if you got this far. In Section 7 he gets to the interesting but of what he terms “Monte Carlo Maximum Likelihood”

More resources

control+f for “Geyer”

For your reference, `R` package `glmm` [here](https://cran.r-project.org/web/packages/glmm/index.html) uses Monte Carlo to approximate the likelihood in GLMMs. The package is written by Geyer's student. In addition `R' package 'mcemGLM' [here](https://cran.r-project.org/web/packages/mcemGLM/index.html) estimates MLE for GLMMs using Monte Carlo EM. The package is written by a student in the same department as Geyer. — Greenparker, Apr 26 '16 at 16:23
This is very promising. I have always felt that this area of statistics sucked. I mean it seems so backwards that some of the brightest minds in the world are dropping imaginary lemmings to walk to various minima (i.e. Monte Carlo GAs) to solve these problems — Alexander McFarlane, Apr 26 '16 at 16:26
[This](http://www.jstor.org/stable/2680750?seq=1#page_scan_tab_contents) paper by Booth and Hobert is considered seminal in the field. Also see [this](http://users.stat.umn.edu/~galin/rev.pdf). Not directly related to your question, but still in the same neighborhood. — Greenparker, Apr 26 '16 at 16:30
Just for curiosity, if your goal is to optimize a function, why don't you look at modern methods for global, nonconvex stochastic optimization as opposed to a MCMC paper from 1991? — lacerbi, Apr 26 '16 at 16:30
@lacerbi because I'm a theoretical physics postgrad and I didn't even know that entire field existed (thanks!) and secondly because my problem at hand required distribution fitting. I know MCMC really well and I know MLE really well and I just had a feeling they might have a crossover that could be useful hence the paper I discovered — Alexander McFarlane, Apr 26 '16 at 16:33
@lacerbi Fair point, but I will also add that the two packages I mentioned were up on CRAN only last year, so this area has not really gone through an aging process yet. — Greenparker, Apr 26 '16 at 16:34
Look at my answers [here](https://stats.stackexchange.com/questions/196947/what-optimization-maximization-minimization-methods-exist-contours-with-lots-o/196964#196964), for example, and [here](http://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193391#193391). There are good references. If you describe better the problem at hand, people here can point you towards more specific algorithms. MCMC is not a good idea for doing optimization. I mean, it can do the job, but it's not the best tool for the job. — lacerbi, Apr 26 '16 at 16:37
Wait, if you are fitting a distribution, why don't you do MCMC on the parameters of the distribution? You wouldn't need to do MLE. Perhaps it's easier if you explicitly describe your goal. — lacerbi, Apr 26 '16 at 16:40
Thanks the question is focused on parameter finding specifically. However, like anyone genuinely interested in this topic I also welcome information about related areas as you have kindly provided. I will look in detail at all the responses in a few weeks time now as I have 7 exams starting this week and didn't expect such a good response so quickly! — Alexander McFarlane, Apr 26 '16 at 16:42
@Greenparker: I think what you are getting at is very off target for what I understand the OP is interested in. In those problems, the issue is that the target function is *intractable*, not multi-modal. — Cliff AB, Apr 29 '16 at 20:11
OP must confess that he hasn't had time to check if they are or are not (exams) — Alexander McFarlane, Apr 29 '16 at 20:34

score 6 · Answer 1 · answered Apr 29 '16 at 16:21

If I understand correctly, you are excited about MCMC in the case of multi-modal target functions. Your reasoning is that MCMC methods search the global parameter space, rather than just shooting up the closest mode and stopping.

While theoretically true, in practice, MCMC often behaves somewhat similarly to hill climbing methods: once they find a local mode, they often stay around that mode. Unlike hill climbing methods, there is a positive probability that they will leave the mode, so theoretically it will explore the global space if let run long enough. However, for most samplers, this probability is so extremely small that is unreasonable to run the chain long enough to have any assurance that the sampler will properly explore the global space.

Of course, there are samplers that try to remedy this by attempting to take occasional outlier steps (i.e. see if it can escape the local mode). But I don't think these samplers are going to be at all competitive, in regards to optimization, with standard optimization methods for exploring multi-modal surfaces (i.e. particle swarm, etc.).

w.r.t. escaping local minima, there are a family of MCMC routines (e.g [this](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00765.x/full)) based on Hamiltonian principles (from Physics) that seem reasonably competent at navigating these multi-modal spaces. Looking at your profile, appreciate this is your area of research and in fact my question comes in a similar light to your [social "ramblings"](http://cliffstats.weebly.com/various-ramblings). I am not familiar with the methods but as an expert do you think that the MCMC method described above would have any merit at all? — Alexander McFarlane, Apr 29 '16 at 19:01
@AlexanderMcFarlane: not sure I would call myself an "expert" on MCMC, but rather have had some professional exposure (see r-nimble.org, a project I worked on for awhile). So take my advice with a grain of a salt. That said, I wouldn't use generic MCMC methods, such as MH random walks, for what you want. Samplers that aggressively try to explore the limits of the probability space may have more luck (paywall for your link, so no comment for whether it meets the criteria). — Cliff AB, Apr 29 '16 at 19:15

score -1 · Answer 2 · answered Apr 29 '16 at 15:57

MCMC doesn't converge to a fixed point generally. Convergence is to the stationary distribution of a Markov chain. The draws are different, but, loosely, the distribution they are drawn from becomes fixed.

MCMC methods generally suffer from similar issues to other optimisation methods. E.g. it is easy to design chains which rarely escape from local minima. There is a whole literature of tricks to solve such problems for various models.

That said and in response to your second question, here's a quick and dirty way MCMC can be used for parameter estimation:

Run the chain, generating parameter samples.
Get the likelihood under each sample of the parameters.
Compare the MCMC samples' likelihoods to your favourite MLE.
If any of the MCMC samples do better, it wasn't really a global MLE.

Markov chain Monte Carlo (MCMC) for Maximum Likelihood Estimation (MLE)

Question

Paper

Abstract

More resources

2 Answers2