How do Variational Autoencoders use Negative Log Likelihood/Cross entropy on real valued outputs?

Question

When training a Variational Autoencoder, the function being maximised is the expected lower bound:

$$ \mathscr{L}(\boldsymbol{\theta}, \phi; \mathbf{x}^{(i)}) = -D_{KL}\left(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})||p_{\boldsymbol{\theta}}(\mathbf{z})\right) + \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})}\left[\log p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})\right] $$

I feel that I understand the practical calculations involved with the first term (KL divergence) reasonably well, since the original paper provides a small derivation (appendix B), which isn't too difficult to follow.

It's the second term that's giving me trouble. Firstly, it's not clear exactly what the subscript on the expectation is actually denoting in this case. Secondly, if our model is producing real values, then how do we determine $p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ for even a single sample? Intuitively, the chance of predicting a specific sample is infinitesimally small, even if we output a probability distribution.

I've not been able to find any discussion (that I could understand as such, at least) on how this probability is determined from model outputs. But, if I look at some implementations (implementation 3, implementation 4), then I see that it's just calculated as the Negative Log Likelihood/Cross Entropy, or MSE.

How can the same expectation, given the same type of outputs (real valued; albeit not necessarily bounded the same way) have these different calculations? And how can I approach calculating second term, in general?

Additionally, $p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ is referred to as a probability sometimes, and a "model" other times. A model is very different to a probability; how can I reconcile these two descriptions?

score 3 · Accepted Answer · answered Jan 29 '20 at 05:21

Firstly, it's not clear exactly what the subscript on the expectation is actually denoting in this case.

We can alternatively write $\int q(z|x) \log p(x|z) dz$ (dropping sub/super scripts).

Secondly, if our model is producing real values, then how do we determine $p(x|z)$ for even a single sample. Intuitively, the chance of predicting a specific sample is infinitesimally small, even if we output a probability distribution.

For continuous distributions $p(x)$ is used to refer to the probability density at some point $x$. This is different from the probability that $x$ will be drawn from that distribution (which is 0). See more info here.

I see that it's just calculated as the Negative Log Likelihood/Cross Entropy, or MSE.

In the real case, $p_\theta(x|z)$ is typically $\mathcal{N}(\mu, \sigma^2I)$ where $\mu$ and $\sigma$ are deterministic functions of $z$ (computed via a neural network). In the discrete case, $p_\theta(x|z)$ is categorical, again with parameters determined by neural network. It just happens that the log of a gaussian density evaluated at $x$ comes out to be the MSE, and the log of the categorical distribution evaluated at $x$ comes out to be cross-entropy.

A model is very different to a probability; how can I reconcile these two descriptions?

A model is a function which maps parameters $\theta$ to some probability measure over the sample space. You can think of $p_\theta$ as an instantiation of the model with parameters $\theta$, and $p_\theta(x|z)$ as the (ratio of the) probability of some event(s) in the sample space according to that model.

Thank you for the answer. There's still one part that isn't clear to me. When you evaluate a probability density at a point, what does that number mean? The wikipedia page explicitly talks about having to calculate for intervals, in which I'm comfortable saying "it doesn't mean anything in particular, it's just part of the calculation". I have done a bit of derivation and seen that evaluating the log of a Gaussian at a point gives MSE, but I'm not sure why evaluating the pdf at particular points makes sense for calculating $p(x|z)$. — Multihunter, Jan 29 '20 at 05:57
I re-read https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/. And the second comment on https://stats.stackexchange.com/a/23627 says that it's a "likelihood", not a "probability". So I'm pretty sure that the part that I don't understand is simply part of the general reasoning in maximum likelihood estimation. However, I can't seem to find any description of maximum likelihood estimation that doesn't gloss over this calculation. But I'll keep looking. — Multihunter, Jan 29 '20 at 06:47
you can think of it as the derivative of the cumulative distribution function if you want ... sure, you can call it the likelihood instead, i don't think that really changes anything though. another way to think about it, very informally, is that for small values of $\epsilon > 0$, the probability that a sample lands in the interval $(x-\epsilon, x+\epsilon)$ should be something like $2\epsilon p(x)$. — shimao, Jan 29 '20 at 06:59

How do Variational Autoencoders use Negative Log Likelihood/Cross entropy on real valued outputs?

1 Answers1