sequential Monte Carlo sampler, why the extended space and backward kernel?

Question

Hello cross validated,

I am currently studying sequential Monte Carlo samplers.

My current understanding is as follows:

~~We are interested in the marginal distribution of some sequence of joint distributions, increasing in dimension.~~ We are interested in some distribution that could be sampled through a sequence of intermediate distributions. For a concrete example, we want samples from the posterior of a sequence of annealed bridging distributions.

The idea is to perform Importance Sampling (IS), to target our desired distribution $\pi_n(X_n)$, by perturbing our particles from annealed distributions $\pi^t(X)$ for $t \in [0,1]$ with some transition kernel $K(X_{n-1}, X_n)$. The problem with this approach is we would have to integrate over every $X_{1:n-1}$ to do a pointwise evaluation of the proposal distribution $K_n(X_n)$. Thus the author suggested a workaround by introducing an artificial joint distribution $\tilde{\pi}_n(X_{1:n}) = \displaystyle \frac{\gamma_n(X_{1:n})}{Z_n} \prod_{k=1}^{n-1} L_k(X_{k+1}, X_k)$. And so we have this as our target distribution, and an importance distribution $\eta_n(X_{1:n}) = \pi_1(X_1)\prod_{k=1}^{n-1} K_k(X_k, X_{k+1})$.

Question: Why choose $\tilde{\pi}_n(X_{1:n})$ as our target? Why not simply target $\tilde{\pi}_n(X_{1:n}) = \displaystyle\prod_{i=1}^n \pi_i(X_i)$? This also admits $\pi_n(X_n)$ as a marginal distribution!

If there are any ambiguities, I will be happy to clarify, thank you :)!

/bump

Alex · Accepted Answer · 2019-05-13T11:19:22.750

At the start of Section 2.4, the authors say the following:

For any probability density $\nu$, we use the notation $$\nu K_{i:j}(x_{j}) = \int \nu(x_{i-1}) \prod_{k = i}^{j} K_k(x_{k-1}, x_{k}) \ dx_{(i-1):(j-1)}$$ where $x_{i:j}, i \leq j$, and $X_{i:j}$ respectively denote $(x_i, \dots, x_j)$ and $(X_i, \dots, X_j)$. The algorithm that was discussed above suffers from a major drawback. In most cases, it is impossible to compute the importance distribution $\eta_n(x_n)$ that is given by $$\eta_n(x_n) = \eta_1K_{2:n}(x_n)$$ and hence impossible to compute the importance weights.

That is, for any state $x_n$, the proposal distribution $\eta_n(x_n)$ is difficult to compute because we need to account for every possible initial state $x_1$ and every possible path that could have been taken to get from $x_1$ to $x_n$ in $n - 1$ steps.

The clever solution is to include some auxiliary variables. Replacing $\pi_n(x_n)$, which does not depend on $x_1, \dots, x_{n-1}$, with $\tilde{\pi}(x_{1:n})$, which does depend on $x_1, \dots, x_{n-1}$, we get around the problem. Previously, if we tried to calculate the weight update we had to compute the following: $$\frac{w_n(x_n)}{w_{n-1}(x_{n-1})} = \frac{\gamma_n(x_n)}{\gamma_{n-1}(x_{n-1})} \frac{\eta_{n-1}(x_{n-1})}{\eta_{n}(x_{n})} = \frac{\gamma_n(x_n)}{\gamma_{n-1}(x_{n-1})} \frac{\eta_1K_{2:(n-1)}(x_{n-1})}{\eta_1K_{2:n}(x_n)},$$ which involves calculating difficult integrals $\eta_1 K_{2:(n-1)}(x_{n-1})$ and $\eta_1K_{2:n}(x_n)$. Note that there are no cancellations here. If we use auxiliary variables then when we do the weight update we just do: $$\frac{w_n(x_{1:n})}{w_{n-1}(x_{1:(n-1)})} = \frac{\gamma_n(x_n) L_{n-1}(x_n, x_{n-1})}{\gamma_{n-1}(x_{n-1})} \frac{\eta_{n-1}(x_{1:(n-1)})}{\eta_{n}(x_{1:n})},$$ but now the ratio $$\frac{\eta_{n-1}(x_{1:(n-1)})}{\eta_{n}(x_{1:n})}$$ is easy to compute because both $x_{1:(n-1)}$ and $x_{1:n}$ have the same history up to observation $x_{n-1}$ and $$\eta_n(x_{1:n}) = \eta_1(x_1)K_2(x_1, x_2) \cdots K_n(x_{n-1}, x_n),$$ which does not involve any integrals. The ratio is simply equal to $$\frac{\eta_{n-1}(x_{1:(n-1)})}{\eta_{n}(x_{1:n})} = \frac{1}{K_{n}(x_{n-1}, x_{n})},$$ which is the kernel for going from $x_{n-1}$ to $x_{n}$ in one step.

The point of adding the auxiliary variables is to encode a particle $x_n$ with the whole history of how $x_n$ was reached. This saves us having to integrate over all possible ways that $x_n$ could have been reached. Fortunately the artificial joint target distribution $\tilde{\pi}_n$ admits $\pi_n$ as a marginal, so we don't lose anything by using this method.

You are free to choose $L_{k-1}(x_k, x_{k-1})$ to be any density that you like, and in particular you can choose $L_{k-1}(x_k, x_{k-1}) = \pi_{k-1}(x_{k-1})$ as you asked in your question. The only restriction on $L_{k-1}$ is that it must be a valid density for $x_{k-1}$, so it is never negative and it integrates to $1$. However the authors would warn against choosing $L_{k-1}(x_k, x_{k-1}) = \pi(x_{k-1})$ because $\pi(x_{k-1})$ may be quite far away from the optimal reverse kernel given in Section 3.3.1: $$L_{k-1}^{opt}(x_k, x_{k-1}) = \frac{\eta_{k-1}(x_{k-1}) K_k(x_{k-1}, x_{k})}{\eta_k(x_k)}.$$ The authors give some ways of approximating the optimal reverse kernel (since we would need to calculate the difficult integrals in order to use this kernel). They also warn in Section 3.3.3

To conclude this subsection, we emphasize that selecting $\{L_n\}$ as close as possible to $\{L^{opt}_n\}$ is crucial for this method to be efficient. It could be tempting to select $\{L_n\}$ in a different way. For example, if we select $L_{n−1} = K_n$ then the incremental importance weight looks like an MH ratio. However, this 'aesthetic' choice will be inefficient in most cases, resulting in importance weights with a very large or infinite variance.

Thank you for the detailed response, I understand the motivation for an extended space, but am unsure about the particular choice of auxiliary target. ie. why do we choose $\tilde{\pi}_n(X_{1:n}) = \pi_n(X_n) \prod_{i=1}^{n-1} L_i(X_{i+1}, X_i)$ instead of $\tilde{\pi}_n(X_{1:n}) = \prod_{i=1}^n \pi_i(X_i)$? — user, May 13 '19 at 04:22
Thanks for your comment - yes, I answered a different question than the one you asked. I've edited my reply to give an answer to your question. Let me know if I've missed anything else. — Alex, May 13 '19 at 11:21
The original answer was helpful, and the updated response cleared up my remaining concerns, thank you very much :). — user, May 13 '19 at 15:09

sequential Monte Carlo sampler, why the extended space and backward kernel?

1 Answers1