0

I have been wanting to understand this paper in a deeper way for a long time Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

But everytime I read about Stein's method or Stein's Lemma I get confused and I am not sure about where to go to get some understanding.

To start, I don't understand how the Stein operator $\mathcal{A}$ and function $\phi$

$$\mathcal{A}_p\phi(x) = \phi(x) \nabla_x \log p(x)^\top + \nabla_x \phi(x)$$

come from the wikipedia articles linked above. IDK what makes this topic so confusing compared to other things, but I think I need a fresh perspective on this if someone could help explain it to me.

Thanks

Joff
  • 599
  • 2
  • 13

1 Answers1

2

To start, your equation for the Stein operator (equation 1 of the paper) is the same as equation 3.4 of the Wikipedia article. If we write $q$ for your $p$ then $$\nabla_x\log p(x)^\top$$ is Wikipedia's $q'(x)/q(x)$. Writing $f$ for your $\phi$, $${\cal A}_p\phi(x)\equiv {\cal A}_pf(x)=f(x)q'(x)/q(x)+f'(x)$$ as Wikipedia has it.

Now, so what? Well, suppose we actually sample $X$ from a distribution $P$ but we want to approximate it by a simpler distribution $Q$ (using Wikipedia's notation). We want to choose (learn) $Q$, but (for some reason, perhaps extra coolness points) we don't want to restrict to a parametric family $Q_\theta$

Stein's Lemma says $$E[{\cal A}_pf(X)]=0$$ for all (sufficiently nice) $\phi$ when $Q=P$. That sort of looks like the condition that the derivative in the 'direction' $f$ is zero when the distance between $Q$ and $P$ is minimised. So we might get the idea of using $E[{\cal A}_pf(X)]$ as a sort of 'gradient' of some distance $d(P,Q)$ in the 'direction' $f$ and doing something like coordinate descent or gradient descent by updating $$q(x)\mapsto q(x)-\epsilon \times E[{\cal A}_pf(X)]\times f(x)$$

What the paper does is show that the handwaving can be made precise, so you do get gradient descent for the Kullback-Leibler divergence. It also (crucially) shows that the properties holding for all (sufficiently nice) $f$ can be replaced by them holding for $f$ generated by some useful set of transformations of $x$ (a reproducing kernel Hilbert space). This means you get an algorithm that can actually be implemented

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
  • Thanks so much for your answer. I think you broke me out of my local minima. I missed the log derivative being equivalent to the Wikipedia article which seems obvious now. The espectation of the Stein operator also made some things click. I think I can re-read the paper with a better perspective on what it all means. Thanks – Joff Dec 31 '21 at 04:09