20

In a VAE, the encoder learns to output two vectors:

$$\mathbf{\mu} \in\ \mathbb{R}^{z}$$ $$\mathbf{\sigma} \in\ \mathbb{R}^{z}$$

which are the mean and variances for the latent vector $\mathbf{z}$, the latent vector $\mathbf{z}$ is then calculated by:

$$\mathbf{z} = \mu + \sigma \epsilon$$

where: $\epsilon = N(0, \mathbf{I}_{z \times z})$

The KL divergence loss for a VAE for a single sample is defined as (referenced from this implementation and this explanation):

$$\frac{1}{2} \left[ \left(\sum_{i=1}^{z}\mu_{i}^{2} + \sum_{i=1}^{z}\sigma_{i}^{2} \right) - \sum_{i=1}^{z} \left(log(\sigma_{i}^{2}) + 1 \right) \right]$$

Though, I'm not sure how they got their results, would anyone care to explain or point me to the right resources?

YellowPillow
  • 1,031
  • 2
  • 9
  • 16

1 Answers1

35

The encoder distribution is $q(z|x)=\mathcal{N}(z|\mu(x),\Sigma(x))$ where $\Sigma=\text{diag}(\sigma_1^2,\ldots,\sigma^2_n)$.

The latent prior is given by $p(z)=\mathcal{N}(0,I)$.

Both are multivariate Gaussians of dimension $n$, for which in general the KL divergence is: $$ \mathfrak{D}_\text{KL}[p_1\mid\mid p_2] = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - n + \text{tr} \{ \Sigma_2^{-1}\Sigma_1 \} + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right] $$ where $p_1 = \mathcal{N}(\mu_1,\Sigma_1)$ and $p_2 = \mathcal{N}(\mu_2,\Sigma_2)$.

In the VAE case, $p_1 = q(z|x)$ and $p_2=p(z)$, so $\mu_1=\mu$, $\Sigma_1 = \Sigma$, $\mu_2=\vec{0}$, $\Sigma_2=I$. Thus: \begin{align} \mathfrak{D}_\text{KL}[q(z|x)\mid\mid p(z)] &= \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - n + \text{tr} \{ \Sigma_2^{-1}\Sigma_1 \} + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right]\\ &= \frac{1}{2}\left[\log\frac{|I|}{|\Sigma|} - n + \text{tr} \{ I^{-1}\Sigma \} + (\vec{0} - \mu)^T I^{-1}(\vec{0} - \mu)\right]\\ &= \frac{1}{2}\left[-\log{|\Sigma|} - n + \text{tr} \{ \Sigma \} + \mu^T \mu\right]\\ &= \frac{1}{2}\left[-\log\prod_i\sigma_i^2 - n + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]\\ &= \frac{1}{2}\left[-\sum_i\log\sigma_i^2 - n + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]\\ &= \frac{1}{2}\left[-\sum_i\left(\log\sigma_i^2 + 1\right) + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]\\ \end{align}

Wei Zhong
  • 127
  • 7
user3658307
  • 1,754
  • 1
  • 13
  • 26
  • 1
    Can you comment on why this looks so different from the univariate case? (https://stats.stackexchange.com/q/7440/141343 bottom of question) – BlackBear Sep 05 '19 at 16:48
  • @BlackBear your univariate example doesn't assume $\sigma_2 = 1$, which is the only difference – Firebug Sep 05 '19 at 16:50
  • 1
    @BlackBear as Firebug notes, $p$ is assumed to be multivariate standard normal (most of the time) in VAEs. If you set $n=1$ here and $\sigma_2=1,\,\mu_2=0$ there, they should match. Also, their answer starts from the definition of the KL divergence as an expectation of the log difference, whereas I started from the general expression for two Gaussians for simplicity. – user3658307 Sep 05 '19 at 16:56
  • What is $n$ in the equation? Why is it replaced with 1? – Gergő Horváth Mar 15 '21 at 10:22
  • @GergőHorváth, $n$ is the dimension of the vector z. It is not replaced with 1, it is replaced with $\sum_{i=1}^n 1$. – toliveira Apr 16 '21 at 23:42