Why is reparameterization trick necessary for variational autoencoders?

Question

I know it is said that we do the reparameterization trick so we can do back-propagation and back-propagation cant be applied on random sampling!
However, I don't precisely understand the last part. Why cant we do that? We have a mu and a std and our z is directly sampled from them. What is the naive method that we don't do ?

I mean what were we supposed to do that didn't work and instead made us do reparameterization?
we had to sample from the distribution using mean and std, any way, and we are doing it now, whats changed?
I dont get why z in is considered a random node previously, and not now!?
I'm trying to see how the first way is different than what we are doing in the reparameterization and I cant seem to find anything!

Some related questions: https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important and https://stats.stackexchange.com/questions/342762/how-do-variational-auto-encoders-backprop-past-the-sampling-step/342815#342815 — Sycorax, Sep 30 '19 at 13:36
These links are massively helpful thanks a lot. it would be a good idea to also add them to your accepted answer — Hossein, Sep 30 '19 at 19:08
Does this answer your question? [How does the reparameterization trick for VAEs work and why is it important?](https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important) — David Dao, Jun 17 '20 at 22:51
@DavidDao : That is different. In this question I am asking, what is the crude way of doing this that is not done by default(in technical sense)! all the answers in previous question try to explain why its a must have and not explaining why it is needed/different in this case from the technical point of it. in the answer I posted, it shows that its not technically impossible, rather, it will have a completely different meaning and that's why we use the reparameterization trick(not because its impossible to like it was pointed at as the sole reason of it). — Hossein, Jun 18 '20 at 05:10
in other words, its not because back-propagation is impossible! its becasue it will work, but not in the direction we expect it to. therefore we re-parameterize it this way for it to go into the direction we need it to. — Hossein, Jun 18 '20 at 05:10

Sycorax · Answer 1 · 2019-09-30T14:32:52.440

6

After $\epsilon$ is sampled, it is completely known; we can treat it the same way as any other data (image, text, feature vector) that's input to a neural network. Just like your input data, $\epsilon$ is known and won't change after you sample it.

This means that the expression $ z = \mu + \sigma \odot \epsilon $ has no random components after sampling: you know $\mu,\sigma$ because you obtained them from the encoder, and you know $\epsilon$ because you've sampled it. As a result of sampling, $\epsilon$ is known and fixed at a particular value. This means that you can backprop $\mu + \sigma \odot \epsilon$ with respect to $\mu, \sigma$ because all of its elements are known and fixed.

By contrast, the expression $z \sim \mathcal{N}(\mu,\sigma^2)$ is not deterministic in $\mu, \sigma$, so you can't write a backprop expression with respect to $\mu, \sigma$ for it. Even though $\mu, \sigma$ are fixed, you can obtain any real number as an output.

edited Sep 30 '19 at 14:32

answered Sep 30 '19 at 13:36

Sycorax

76,417
20
189
313

Thanks a lot. really appreciate it :) – Hossein Sep 30 '19 at 19:02
On the second thought, looking at how numpy for example implements the normal sampling, Im again confused . see in numpy source code this is nearly exactly what we are doing : [distributions.c#L516](https://github.com/numpy/numpy/blob/master/numpy/random/src/distributions/distributions.c#L516) its:`return loc + scale * random_gauss_zig(bitgen_state);` so it should be backpropagatable ! – Hossein Oct 01 '19 at 12:22
I don't follow. Why would this make what backprop-able? Your code snippet is the expression $\mu + \sigma \odot \epsilon$ where `random_gauss_zig(bitgen_state)` $=\epsilon$ – Sycorax Oct 01 '19 at 12:33
it was said that the normal sampling is not deterministic and thus cant be backpropagated (z~N(mu, std)) and for this we reparameterize this sampling operation in the way we do ! as you can see, the normal sampling operation is just the same as what we are doing by hand! – Hossein Oct 01 '19 at 12:46
Why are you confused? As you said, the numpy code implements the procedure that we do by hand, with the PRNG taking the place of $\epsilon \sim \mathcal{N}(0,1^2)$. – Sycorax Oct 01 '19 at 12:57
I want to know what is wrong, so far, I know what the correct way is(its reparameterization), what is the wrong way of doing this? simple normal sampling seem to be just right so what is the wrong way that we are not doing? I want to write down the wrong way and fail – Hossein Oct 01 '19 at 19:36
I don't understand what you're asking. Because a clear explanation won't fit in a comment box, I think it's best to ask a new Question, explaining what you know & understand, what you want to know, and where you're stuck. – Sycorax Oct 01 '19 at 20:22

score 1 · Answer 2 · answered Oct 02 '19 at 20:58

1

The answer is simple: Sampling is not differentiable.

You can't write an sampling procedure as derivative form such as $\frac{\partial z}{\partial \mu}$ without tricks (e.g. reparameterization and gumbel softmax).

answered Oct 02 '19 at 20:58

hpwww

201
2
7

Hi, thanks, but what do you mean by sampling is not differentiable ? are you referring to the way sampling is carried out? or something else? cause in numpy for example, sampling is done exactly the same way we do it in reparameterization . so I'm having a very hard time understanding this – Hossein Oct 03 '19 at 01:40
1

Although you have found an excellent explanation, I would like to clarify my idea. Yes, you are right. That should depend on what kind of sampling method you used, but that's not a general case. For example, we can sample normal distributions by central limited theorem with any arbitrary $P(x_i)$. In this case, computing gradients could be intractable. – hpwww Oct 03 '19 at 08:46
Thanks a lot. so thats very interesting. could you please elaborate on this in your answer a bit more? this is very very helpful – Hossein Oct 03 '19 at 09:42

score 0 · Accepted Answer · answered Oct 03 '19 at 13:55

Thanks to dear God I finally found the real explanation concerning this. This was originally posted by Gregory Gundersen here. For the full explanation you may visit this.

TLDR:

... Kingma: This reparameterization is useful for our case since it can be used to rewrite an expectation w.r.t $q_{\phi}(\textbf{z} \mid \textbf{x})$ such that the Monte Carlo estimate of the expectation is differentiable w.r.t. $\phi$.

from "Auto-Encoding Variational Bayes," (Kingma & Welling, 2013).

The issue is not that we cannot backprop through a “random node” in any technical sense. Rather, backproping would not compute an estimate of the derivative. Without the reparameterization trick, we have no guarantee that sampling large numbers of $\textbf{z}$ will help converge to the right estimate of $\nabla_{\theta}$.

Why is reparameterization trick necessary for variational autoencoders?

3 Answers3