12

I was studying VAEs and came across the loss function that consists of the KL divergence.

$$ \sum_{i=1}^n \sigma^2_i + \mu_i^2 - \log(\sigma_i) - 1 $$

I wanted to intuitively make sense of the KL divergence part of the loss function. It would be great if somebody can help me

raptorAcrylyc
  • 125
  • 1
  • 1
  • 8

1 Answers1

15

The KL divergence tells us how well the probability distribution Q approximates the probability distribution P by calculating the cross-entropy minus the entropy. Intuitively, you can think of that as the statistical measure of how one distribution differs from another.

In VAE, let $X$ be the data we want to model, $z$ be latent variable, $P(X)$ be the probability distribution of data, $P(z)$ be the probability distribution of the latent variable and $P(X|z)$ be the distribution of generating data given latent variable

In the case of variational autoencoders, our objective is to infer $P(z)$ from $P(z|X)$. $P(z|X)$ is the probability distribution that projects our data into latent space. But since we do not have the distribution $P(z|X)$, we estimate it using its simpler estimation $Q$.

Now while training our VAE, the encoder should try to learn the simpler distribution $Q(z|X)$ such that it is as close as possible to the actual distribution $P(z|X)$. This is where we use KL divergence as a measure of a difference between two probability distributions. The VAE objective function thus includes this KL divergence term that needs to be minimized.

$$ D_{KL}[Q(z|X)||P(z|X)] = E[\log {Q(z|X)} − \log {P(z|X)}] $$

zoozoo
  • 348
  • 2
  • 6
  • 1
    Thanks for the response, But I actually wanted to know that how did we reach from DKL[Q(z|X)||P(z|X)]=E[logQ(z|X)−logP(z|X)] to the equation mentioned in my question – raptorAcrylyc Mar 01 '19 at 08:33
  • 1
    You may find the full derivation [here](https://stats.stackexchange.com/q/370048). – zoozoo Mar 01 '19 at 08:41