8

Let

  • $d\in\mathbb N$ with $d>1$
  • $\ell>0$
  • $\sigma_d^2:=\frac{\ell^2}{d-1}$
  • $f\in C^2(\mathbb R)$ be positive with $$\int f(x)\:{\rm d}x=1$$ and $g:=\ln f$
  • $Q_d$ be a Markov kernel on $(\mathbb R^d,\mathcal B(\mathbb R^d))$ with $$Q_d(x,\;\cdot\;)=\mathcal N_d(x,\sigma_dI_d)\;\;\;\text{for all }x\in\mathbb R^d,$$ where $I_d$ denotes the $d$-dimensional unit matrix

Now, let $$\pi_d(x):=\prod_{i=1}^df(x_i)\;\;\;\text{for }x\in\mathbb R^d$$ and $\left(X^{(d)}_n\right)_{n\in\mathbb N_0}$ denote the Markov chain generated by the Metropolis-Hastings algorithm with proposal kernel $Q_d$ and target density $\pi_d$ (with respect to the $d$-dimensional Lebesuge measure $\lambda^d$). Moreover, let $$U^{(d)}_t:=\left(X^{(d)}_{\lfloor dt\rfloor}\right)_1\;\;\;\text{for }t\ge0.$$ In the paper Weak convergence and optimal scaling of random walk Metropolis algorithms, the authors show (assuming that $g$ is Lipschitz continuous and satisfies some moment conditions) that $U^{(d)}$ converges (in the Skorohod topology) as $d\to\infty$ to the solution $U$ of $${\rm d}U_t=\frac{h(\ell)}2g'(U_t){\rm d}t+\sqrt{h(\ell)}{\rm d}W_t,$$ where $W$ is a standard Brownian motion, with $U_0\sim f\lambda^1$.

Now, they conclude that the "optimal choice" for $\ell$ is obtained by maximizing $$h(\ell):=2\ell^2\Phi\left(-\frac{\ell\sqrt I}2\right),$$ where $\Phi$ denotes the cumulative distribution function of the standard normal distribution and $$I:=\int\left|g'\right|^2\:{\rm d}(f\lambda^1)<\infty.$$ Why? In which sense (e.g. total variation distance or variance) does this optimize the Metropolis-Hastings algorithm?

I've read that $h(\ell)$ is called the "speed function/measure" of the diffusion $U$ ... Would be very happy about a reference for that topic.

0xbadf00d
  • 539
  • 3
  • 8

1 Answers1

3

The Langevin diffusion is a process $(X_t)_{t \ge 0}$ satisfying the SDE: \begin{align*} d X_t = -\nabla h(X_t) + \sqrt{2} dW_t \end{align*}
where $(W_t)_{t \ge 0}$ is the standard Brownian motion in $\mathbb R^d$. Under mild conditions on $h$, the above has a unique solution which is a Markov process. Also, the distribution of $X_t$ can be shown to converge to to a distribution with density $\pi(x) \propto \exp(-h(x))$ as $t \to \infty$. I am going to write this as $\mathcal L(X_t) \to \exp(-h) dx$

In the paper you consider, they have the 1-d dynamic: \begin{align*} d V_t = d B_t + \frac{f'(V_t)}{2 f(V_t)} dt = d B_t + \frac12 g'(V_t) dt \end{align*} where $g = \log f$. Let us define $\tilde V_t = V_{\alpha t}$. Then, \begin{align*} d \tilde V_t &= \alpha^{1/2} \, d B_{t} + \frac\alpha 2 g'(\tilde V_{t}) dt \end{align*} (This is using the fact that $(B_{\alpha t})$ has the same distribution as $(\sqrt{\alpha} B_t.)$ Taking $\alpha = 2$, we have that $\tilde V_t$ satisfies the standard Langevin dynamics with $h = -g$, hence $$\mathcal L(\tilde V_t) \to \exp(g) dx = f dx \quad \text{as} \quad t \to \infty.$$

Now, as they argue in the paper $U_t = V_{h(\ell) t}$. This is easy to see by the same argument as above: basically setting $\alpha = h(\ell)$ shows that $U_t$ satisfies the desired SDE.

In short $U_t = \tilde V_{\frac12 h(\ell) t}$ and $\tilde V_t$ is converging in distribution to the desired law. So $\frac12 h(\ell)$ looks like a step size. The larger it is, the faster you move along the process $\tilde V_t$ for a unit step in time (say $\Delta t = 1)$.

EDIT: There is a lot of interesting recent activity on the convergence of Langevin dynamics and MH algorithm. I will try to cite them once I get a chance.

EDIT2: Some recent developments:

passerby51
  • 1,573
  • 8
  • 11
  • Do you have a reference for $\mathcal L(X_t) \to \exp(-h) dx$ at hand? – 0xbadf00d Dec 27 '18 at 11:40
  • In your equation for ${\rm d}\tilde V_t$, the $B_t$ is different from the previous $B_t$ isn't it? Actually it should be $\tilde B_t:=\alpha^{-1/2}B_{\alpha t}$ (which is a Brownian motion). – 0xbadf00d Dec 28 '18 at 12:11
  • And I don't get why $U_t = \tilde V_{\frac12 h(\ell) t}$. As you wrote before $U_t=V_{h(\ell)t}$. – 0xbadf00d Dec 28 '18 at 13:43
  • I'm still interested in an answer, especially for a reference for $\mathcal L(X_t) \to \exp(-h) dx$. – 0xbadf00d Jan 16 '19 at 11:19
  • @0xbadf00d, sorry for the delay. I have added some recent references. Hopefully, looking at what they cite you can find multiple references for that. – passerby51 Jan 17 '19 at 04:37
  • Regarding your other question (1) yes, it is a different process, it just has the same distribution as a the Brownian motion (2) I have defined $\tilde V_t = V_{2t}$, though this is not entirely necessary. It is just to map back to the standard way of writing the Langevin dynamic. – passerby51 Jan 17 '19 at 04:40
  • I've taken a look into the papers. It seems like we need some curvature dimension condition in order to prove the convergence or am I missing something? – 0xbadf00d Feb 15 '19 at 17:41
  • @0xbadf00d, those paper care about the rate of convergence, so need more assumptions. For the convergence alone, one needs less. See e.g. "Exponential convergence of Langevin distributions and their discrete approximations" By Roberts and Tweedie. – passerby51 Feb 15 '19 at 20:20
  • Could you write down what we need to assume? Couldn't find a less restrictive assumption it in the paper. – 0xbadf00d Feb 15 '19 at 20:48
  • @xbadf00d, look at Theorem 2.1 in Roberts and Tweedie. – passerby51 Feb 16 '19 at 00:32
  • (1) Can we say anything about uniqueness? (2) Theorem 2.1 in Roberts and Tweedie impose assumptions that ensure convergence in total variation. However, even without any further assumption, your $\pi$ is an invariant measure. Are we able to show weak convergence of $\mathcal L(X_t)$ to $\pi$ in general? – 0xbadf00d Feb 24 '19 at 19:53
  • I got some further open questions and asked a separate question. You might want to take a look: https://stats.stackexchange.com/q/394742/222528. Thank you in advance for your effort. – 0xbadf00d Feb 27 '19 at 18:30