Why is random sampling a non-differentiable operation?

Question

This answer states that we cannot back-propagate through a random node. So, in the case of VAEs, you have the reparametrisation trick, which shifts the source of randomness to another variable different than $z$ (the latent vector), so that you can now differentiate with respect to $z$. Similarly, this question states that we cannot differentiate a random sampling operation.

Why exactly is this the case? Why is randomness a problem when differentiating and back-propagating? I think this should be made explicit and clear.

score 4 · Answer 1 · edited Oct 03 '19 at 04:02

Gregory Gundersen wrote a blog post about this in 2018. He explictly answers the questions:

What does a “random node” mean and what does it mean for backprop to “flow” or not flow through such a node?

The following excerpt should answer your questions:

Undifferentiable expectations

Let’s say we want to take the gradient w.r.t. $\theta$ of the following expectation, $$\mathbb{E}_{p(z)}[f_{\theta}(z)]$$ where $p$ is a density. Provided we can differentiate $f_{\theta}(x)$, we can easily compute the gradient:

$$ \begin{align} \nabla_{\theta} \mathbb{E}_{p(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} p(z) \Big[\nabla_{\theta} f_{\theta}(z) \Big] dz \\ &= \mathbb{E}_{p(z)} \Big[\nabla_{\theta} f_{\theta}(z) \Big] \end{align} $$

In words, the gradient of the expectation is equal to the expectation of the gradient. But what happens if our density $p$ is also parameterized by $\theta$?

$$ \begin{align} \nabla_{\theta} \mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p_{\theta}(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} \nabla_{\theta} \Big[ p_{\theta}(z) f_{\theta}(z) \Big] dz \\ &= \int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z)dz + \int_{z} p_{\theta}(z) \nabla_{\theta} f_{\theta}(z)dz \\ &= \underbrace{\int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z)}_{\text{What about this?}}dz + \mathbb{E}_{p_{\theta}(z)} \Big[f_{\theta}(z)\Big] \end{align}$$

The first term of the last equation is not guaranteed to be an expectation. Monte Carlo methods require that we can sample from $p_{\theta}(z)$, but not that we can take its gradient. This is not a problem if we have an analytic solution to $\nabla_{\theta}p_{\theta}(z)$, but this is not true in general. 1

score 0 · Answer 2 · answered Feb 07 '22 at 02:29

It'd be easier to see with sampling from a categorical distribution. Say you have categorical distribution $$\pi_1, \pi_2,...,\pi_K, \pi_i \ge 0, \sum_{i=1}^K \pi_i = 1$$ with $p(x=i|\pi) = \pi_i$. To draw a sample $x$ from this distribution, a standard way is to do:

$$ u \sim U(0,1) \\ x = \arg\min_i \sum_{j=1}^i \pi_j \ge u $$

That is, we draw a value from a uniform distribution and check which "bin" it falls into the CDF of the categorical distribution. The operation "check which "bin" it falls into the CDF of the categorical distribution. " is not differentiable

Why is random sampling a non-differentiable operation?

2 Answers2

Undifferentiable expectations