1

I'm trying to understand one specific formula in a paper that I'm reading:

https://arxiv.org/pdf/1911.02469.pdf

It's concerning equation 10: enter image description here

Unfortunately, the authors don't explain the context of what is σ. I also tried to find other sources. In other implementations it seems quite common to just use conventional loss functions of the author's liking, without much explanation.

Sandro
  • 15
  • 6

1 Answers1

2

Recall VAE's loss has two components: reconstruction loss(since autoencoder's aim to learn to reconstruct) and KL loss (to measure how much information is lost or how much we have diverged from the prior). The actual form of the VAE loss(aim is to maximize this loss) is :

$$ L(\theta , \phi) = \sum_{i=1}^{N} E_{z_{i} \sim q_{\phi}(z|x_{i})} \left [ log p_{\theta} (x_{i}|z)\right] - KL(q_{\phi} (z | x_{i}) || p(z)) $$ where $\left (x , z \right)$ is input and latent vector pair. Encoder and decoder networks are $q$ and $p$ respectively. Since, we have a Gaussian prior, reconstruction loss becomes the squared difference(L2 distance) between input and reconstruction.(logarithm of gaussian reduces to squared difference).

In the paper, authors are trying to get intuition from probabilistic PCA to explain when the posterior collapse happens. pPCA model,trained EM or gradeint ascent on NLL. $logp(x)$, is defined as follows:

\begin{align} p(z) &= \textit{N}(0, I) \\ p(x|z) &= \textit{N}(Wz + \mu, \sigma ^{2} I) \end{align}

where $x$ and $z$ is the data and the latent respectively. The $\sigma ^{2}$ here is the variance of the observation noise. Posterior collapse turns out appers as stationary points of NLL. Authors show that $\sigma ^{2}$ affects the stability of the collapse stationary points in pPCA. Also, they show that this is similar in deep extension of pPCA (Deep Gaussian VAE).

So, the $\sigma ^{2}$ is derived from the first loss component of VAE while assuming that the data distribution has an inherent observation noise. You still use standard normal Gaussian prior as it is.

Emir Ceyani
  • 656
  • 1
  • 11
  • 1
    Thank you for your answer.Do I understand correctly that sigma is only used when drawing the random number from the the Gaussian, but not in the KL loss term? Do you know who came up with this idea? – Sandro May 06 '20 at 17:21
  • I have edited my response better. I felt like I've confused you. Basically, the noise under the data affects your latent space. You still use your standard normal prior as it is but also consider the observation noise(modeled as inherent additive noise). – Emir Ceyani May 06 '20 at 18:18
  • The thing is you cannot know the $\sigma$ , it's used to support their claim that linearVAE(VAE with linear encoder) can learn pPCA. They've never meant to implement it:https://colab.research.google.com/github/google-research/google-research/blob/master/linear_vae/DontBlameTheELBO.ipynb Yet, they also argue that DeepGaussianVAE loss is a scaled version of $\beta$-VAE loss. You may look at that paper too – Emir Ceyani May 06 '20 at 18:20