How is this minimum variance worked out for this importance sampling estimator?

Question

I was stuck with the function 17.13 in the open source book deep learning on page 590.

For short, the question is that, For the importance sampling estimator: $$\hat s_q = \frac{1}{n}\sum_{i=1, x^{i}\sim q}\frac{p(x^{(i)})f(x^{(i)})}{q(x^{(i)})}$$

and its variance can be represented as: $$Var[\hat s ] = Var[\frac{p(x)f(x)}{q(x)}]/n$$

I am not clear that why the minimum occurs when q is:

$$q^*(x)=\frac{p(x)|f(x)|}{Z},$$

where $Z$ is the normalization constant, chosen so that $q^∗(x)$ sums or integrates to 1 as appropriate.

Any suggestions are highly appreciated. Thanks very much!

shimao · Accepted Answer · 2018-01-24T14:55:48.433

An intuitive explanation is that we want $q$ to be large whenever either $p$ or $|f|$ is large. Otherwise, our estimate of $E_p[f]$ might have a lot of error, since we're "missing out" on sampling the most influential regions of the real number line. A proof is below:

So we want to minimize

$$\begin{align} \text{Var}_q\left[ \frac{p(x)f(x)}{q(x)} \right] &= E_q\left[ \left( \frac{p(x)f(x)}{q(x)} \right)^2 \right] - E_q\left[\frac{p(x)f(x)}{q(x)} \right]^2 \\ \end{align}$$

The second term is constant with respect to $q$. In fact, it comes out to exactly $E_p[f(x)]^2$, so we can drop it from the optimization, which leaves with

$$\begin{align} E_q\left[ \left( \frac{p(x)f(x)}{q(x)} \right)^2 \right] &= \int \frac{p(x)^2 f(x)^2}{q(x)} dx \end{align}$$

We also have the constraint that $\int q(x) dx = 1$. Since this is a constrained optimization problem, we can write the lagrangian:

$$\begin{align} L(q, \lambda) &= \int \frac{p(x)^2 f(x)^2}{q(x)} dx + \lambda \left( \int q(x) dx - 1 \right) \end{align}$$

We want the functional derivative with respect to $q$:

$$\begin{align} \frac{\partial L(q, \lambda)}{\partial q(x)} &= \lim_{\epsilon \rightarrow 0} \frac{\partial}{\partial \epsilon} \left[ \int \frac{p(x)^2 f(x)^2}{q(x) + \epsilon \eta(x)} dx + \lambda \left( \int (q(x)+\epsilon \eta(x)) dx - 1 \right) \right] \end{align}$$ for any arbitrary $\eta$.

This comes out to $$\begin{align} &\lim_{\epsilon \rightarrow 0} \left[ \int -\frac{p(x)^2 f(x)^2}{(q(x) + \epsilon \eta(x))^2} \eta(x)\ dx + \lambda \int \eta(x) dx \right] \\ &= \int -\frac{p(x)^2 f(x)^2}{q(x)^2} \eta(x)\ dx + \lambda \int \eta(x) dx \\ &= \int \eta(x)\left( \lambda -\frac{p(x)^2 f(x)^2}{q(x)^2}\right) dx \\ \end{align}$$

Since we want the derivative to be $0$ for all $\eta(x)$, then we must have

$$\begin{align} 0 &=\lambda -\frac{p(x)^2 f(x)^2}{q(x)^2} \\ \lambda &= \frac{p(x)^2 f(x)^2}{q(x)^2} \\ q(x)^2 &= \frac{p(x)^2 f(x)^2}{\lambda} \\ q(x) &= \frac{p(x)|f(x)|}{\sqrt{\lambda}} \end{align}$$

And it must be the case that $\sqrt{\lambda} = Z$.

Thanks for your derivation. But how does the first limitation becomes the second? It seems that the divisor is squared unknowingly... — Lerner Zhang, Jan 25 '18 at 01:24
It is because the derivative of $1/f(\epsilon)$ is $-\frac{f'(\epsilon)}{f(\epsilon)^2}$ by the chain rule, where in this case, $f(\epsilon) = q(x) + \epsilon \eta(x)$ — shimao, Jan 25 '18 at 01:29

Xi'an · Answer 2 · 2018-01-25T15:21:39.087

An easiest and intuitive answer [in addition to the earlier one that is completely to the point!] is that, when $f$ is a positive function, the variance of the resulting optimum is$$\text{var}[\hat s_q ] = \text{var}\left\{\frac{p(x)f(x)}{p(x)f(x)/Z}\right\}\frac{1}{n}=\frac{\text{var}[Z]}{n}=0$$since $Z$ is a constant. It thus cannot be beaten and rightly so since the resulting optimal estimator is $$\hat s_q \stackrel{x^{i}\sim q^\star}{=} \dfrac{1}{n}\sum_{i=1}^n\frac{ p(x^{(i)})f(x^{(i)}) }{ \left\{\dfrac{ p(x^{(i)})f(x^{(i)})}{Z}\right\} }=Z$$ which is indeed perfect since it returns the exact (if unknown) numerical value of the integral!

Remark: If $f$ takes positive and negative values, it can be written as $$f(x)=\max(f(x),0)-\max(-f(x),0)\stackrel{\text{def}}{=}f^+(x)-f^-(x)$$ Therefore, if one defines two importance functions, $$q^+(x)=\dfrac{p(x)f^+(x)}{Z^+}\quad\text{and}\quad q^-(x)=\dfrac{p(x)f^-(x)}{Z^-}$$one can produce a zero variance estimator as $$\hat{s}=\frac{1}{n^+}\sum_{i=1}^{n^+} \dfrac{p(x_i)f^+(x_i)}{q^+(x_i)}-\frac{1}{n^-}\sum_{i=1}^{n^-} \dfrac{p(y_i)f^-(y_i)}{q^-(y_i)}$$ based on two samples of sizes $n^+$ and $n^-$ (although $n^+=n^-=1$ is enough).

This is why this is an intuitive answer that only tackles that case. — Xi'an, Jan 25 '18 at 15:02

How is this minimum variance worked out for this importance sampling estimator?

2 Answers2