How does the celebrated result about the diffusion limit of the Random Walk Metroplis-Hastings algorithm help us to find the optimal scaling

Question

Let

$d\in\mathbb N$ with $d>1$
$\ell>0$
$\sigma_d^2:=\frac{\ell^2}{d-1}$
$f\in C^2(\mathbb R)$ be positive with $$\int f(x)\:{\rm d}x=1$$ and $g:=\ln f$
$Q_d$ be a Markov kernel on $(\mathbb R^d,\mathcal B(\mathbb R^d))$ with $$Q_d(x,\;\cdot\;)=\mathcal N_d(x,\sigma_dI_d)\;\;\;\text{for all }x\in\mathbb R^d,$$ where $I_d$ denotes the $d$-dimensional unit matrix

Now, let $$\pi_d(x):=\prod_{i=1}^df(x_i)\;\;\;\text{for }x\in\mathbb R^d$$ and $\left(X^{(d)}_n\right)_{n\in\mathbb N_0}$ denote the Markov chain generated by the Metropolis-Hastings algorithm with proposal kernel $Q_d$ and target density $\pi_d$ (with respect to the $d$-dimensional Lebesuge measure $\lambda^d$). Moreover, let $$U^{(d)}_t:=\left(X^{(d)}_{\lfloor dt\rfloor}\right)_1\;\;\;\text{for }t\ge0.$$ In the paper Weak convergence and optimal scaling of random walk Metropolis algorithms, the authors show (assuming that $g$ is Lipschitz continuous and satisfies some moment conditions) that if $X^{(d)}_0$ is distributed according to $\pi_d$, then $U^{(d)}$ converges (in the Skorohod topology) as $d\to\infty$ to the solution $U$ of $${\rm d}U_t=\frac{h(\ell)}2g'(U_t){\rm d}t+\sqrt{h(\ell)}{\rm d}W_t,$$ where $W$ is a standard Brownian motion, with $U_0\sim f\lambda^1$.

As discussed here, the authors conclude that the "optimal" choice for $\ell$ is maximizing $h(\ell)$. It's clear to me that $f\lambda^1$ is an invariant measure for $U$ (though, it's still not clear to me if we need additional assumptions on $f$ to ensure that $U_t$ weakly converges to $f\lambda^1$ and I would be happy about any comment related to that). So, it's sensible to maximize $h(\ell)$ since this means we're "moving faster in time" (to the invariant measure).

However, why does this mean that this choice for $\ell$ is optimal for the Metropolis-Hastings algorithm? First of all, it is assumed that $X^{(d)}_0$ is distributed according to $\pi_d$. But this means we start in stationarity. If I got things right the "optimality" we're searching for is (besides other metrics) with respect to the convergence of the total variation distance of the distribution $\mathcal L(X^{(d)}_n)$ and $\pi_d$. But if we start in stationarity, that distance is $0$.

My next problem is that the process $U^{(d)}$ is not the chain generated by the Metropolis-Hastings algorithm. It is "speeded up in time and shrinked in space". While I see that this is necessary to obtain the (nontrivial) diffusion limit, I don't understand why we're able to draw conclusions about the original chain.

In the chapter 4.1.4 of the Handbook of Markov Chain Monte Carlo, the authors consider several notions for the comparison of Markov chains. A great answer would be if someone could tell me why the conclusion of the paper is optimizing these notions for the original chain $X^{(d)}$.

EDIT: Please take note of my related question.

score 2 · Answer 1 · answered Jun 03 '20 at 11:13

(though, it's still not clear to me if we need additional assumptions on $f$ to ensure that $U_t$ weakly converges to $f$ and I would be happy about any comment related to that).

This concerns the convergence of the continuous-time Langevin diffusion to its invariant measure $f$. The paper assumes that $f \in C^2$, as well as certain moment conditions (A1, A2) on $f'/f, f''/f$, and I believe these are sufficient for convergence to the invariant measure.

Essentially, the key ingredients are to check that the diffusion is well-defined for all times (i.e. is non-explosive), irreducible (i.e. doesn't get stuck in some subset), and well-confined (i.e. doesn't make excursions off to infinity). The first condition corresponds to the roughness of $f$, the second to the support of $f$ being well-connected, and the third to the tails of $f$ decaying sufficiently fast. Further details can be found in this paper, see e.g. Thereom 2.1.

However, why does this mean that this choice for $\ell$ is optimal for the Metropolis-Hastings algorithm? First of all, it is assumed that $X^{(d)}_0$ is distributed according to $\pi_d$. But this means we start in stationarity. If I got things right the "optimality" we're searching for is (besides other metrics) with respect to the convergence of the total variation distance of the distribution $\mathcal L(X^{(d)}_n)$ and $\pi_d$. But if we start in stationarity, that distance is $0$.

Sort of. There are two notions of convergence which one could consider in this setting: convergence to stationarity, and mixing at stationarity. The paper is (in some sense) taking the viewpoint that, once we reach stationarity, we would like our Markov chain to decorrelate as quickly as possible, as this would lead to higher effective sample sizes, etc. The calculations in this paper thus aim to be optimal in this sense.

The more general case (i.e. starting out of stationarity) is more complicated, is treated in this paper and this paper.

My next problem is that the process $U^{(d)}$ is not the chain generated by the Metropolis-Hastings algorithm. It is "speeded up in time and shrinked in space". While I see that this is necessary to obtain the (nontrivial) diffusion limit, I don't understand why we're able to draw conclusions about the original chain.

Actually, it is not "shrinked in space", only "speeded up in time". One makes smaller moves at each step, but the chain itself is not rescaled in space.

The argument is roughly as follows: in high dimensions, you can approximate the RWMH Markov Chain $(X_n)_{n = 0, 1, \ldots}$, with step-size $\sigma_d^2 = \ell^2 / (d-1)$, by the continuous-time diffusion

$${\rm d}U_t=\frac{h(\ell)}2g'(U_t){\rm d}t+\sqrt{h(\ell)}{\rm d}W_t,$$

in the sense that $X_n \approx U_{n/d}$ (in law).

You would like $X_n$ to decorrelate as quickly as possible around the space, but you cannot optimise this directly. However, it is possible to optimise the speed at which $U_t$ decorrelates, by maximising $h(\ell)$. The hope is then that making the corresponding $U_t$ better will make the chain of interest, $X_n$, better as well. See this paper for a more detailed account of this argument.

How does the celebrated result about the diffusion limit of the Random Walk Metroplis-Hastings algorithm help us to find the optimal scaling

1 Answers1

Linked