Let
- $d\in\mathbb N$ with $d>1$
- $\ell>0$
- $\sigma_d^2:=\frac{\ell^2}{d-1}$
- $f\in C^2(\mathbb R)$ be positive with $$\int f(x)\:{\rm d}x=1$$ and $g:=\ln f$
- $Q_d$ be a Markov kernel on $(\mathbb R^d,\mathcal B(\mathbb R^d))$ with $$Q_d(x,\;\cdot\;)=\mathcal N_d(x,\sigma_dI_d)\;\;\;\text{for all }x\in\mathbb R^d,$$ where $I_d$ denotes the $d$-dimensional unit matrix
Now, let $$\pi_d(x):=\prod_{i=1}^df(x_i)\;\;\;\text{for }x\in\mathbb R^d$$ and $\left(X^{(d)}_n\right)_{n\in\mathbb N_0}$ denote the Markov chain generated by the Metropolis-Hastings algorithm with proposal kernel $Q_d$ and target density $\pi_d$ (with respect to the $d$-dimensional Lebesuge measure $\lambda^d$). Moreover, let $$U^{(d)}_t:=\left(X^{(d)}_{\lfloor dt\rfloor}\right)_1\;\;\;\text{for }t\ge0.$$ In the paper Weak convergence and optimal scaling of random walk Metropolis algorithms, the authors show (assuming that $g$ is Lipschitz continuous and satisfies some moment conditions) that if $X^{(d)}_0$ is distributed according to $\pi_d$, then $U^{(d)}$ converges (in the Skorohod topology) as $d\to\infty$ to the solution $U$ of $${\rm d}U_t=\frac{h(\ell)}2g'(U_t){\rm d}t+\sqrt{h(\ell)}{\rm d}W_t,$$ where $W$ is a standard Brownian motion, with $U_0\sim f\lambda^1$.
As discussed here, the authors conclude that the "optimal" choice for $\ell$ is maximizing $h(\ell)$. It's clear to me that $f\lambda^1$ is an invariant measure for $U$ (though, it's still not clear to me if we need additional assumptions on $f$ to ensure that $U_t$ weakly converges to $f\lambda^1$ and I would be happy about any comment related to that). So, it's sensible to maximize $h(\ell)$ since this means we're "moving faster in time" (to the invariant measure).
However, why does this mean that this choice for $\ell$ is optimal for the Metropolis-Hastings algorithm? First of all, it is assumed that $X^{(d)}_0$ is distributed according to $\pi_d$. But this means we start in stationarity. If I got things right the "optimality" we're searching for is (besides other metrics) with respect to the convergence of the total variation distance of the distribution $\mathcal L(X^{(d)}_n)$ and $\pi_d$. But if we start in stationarity, that distance is $0$.
My next problem is that the process $U^{(d)}$ is not the chain generated by the Metropolis-Hastings algorithm. It is "speeded up in time and shrinked in space". While I see that this is necessary to obtain the (nontrivial) diffusion limit, I don't understand why we're able to draw conclusions about the original chain.
In the chapter 4.1.4 of the Handbook of Markov Chain Monte Carlo, the authors consider several notions for the comparison of Markov chains. A great answer would be if someone could tell me why the conclusion of the paper is optimizing these notions for the original chain $X^{(d)}$.
EDIT: Please take note of my related question.