3

Context: https://arxiv.org/pdf/1312.6114.pdf

So if I start with this equation: $$ \mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) \simeq \frac{1}{2} \sum_{j=1}^{J}\left(1+\log \left(\left(\sigma_{j}^{(i)}\right)^{2}\right)-\left(\mu_{j}^{(i)}\right)^{2}-\left(\sigma_{j}^{(i)}\right)^{2}\right)+\frac{1}{L} \sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i, l)}\right) $$ where $\mathbf{z}^{(i, l)}=\boldsymbol{\mu}^{(i)}+\boldsymbol{\sigma}^{(i)} \odot \boldsymbol{\epsilon}^{(l)} \quad$ and $\quad \boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(0, \mathbf{I})$

So I'm looking at this part of the equation in particular: $$\frac{1}{L}\sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i, l)}\right) $$

And I'm looking at it where the decoder is a multivariate Gaussian with a diagonal covariance structure: $$ \begin{aligned} \log p(\mathbf{x} \mid \mathbf{z}) &=\log \mathcal{N}\left(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\sigma}^{2} \mathbf{I}\right) \\ \end{aligned} $$

In this stackexchange discussion, https://ai.stackexchange.com/questions/27341/in-variational-autoencoders-why-do-people-use-mse-for-the-loss, a few answers and comments talk about how manipulating this particular log(p(x|z)) bgets something that resembles the MSE.

So, if you are trying to predict e.g. floating-point numbers (in the case of images, these can be the RGB values in the range $[0,1])$, then you can assume $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ is a Gaussian, then you can equivalently minimise the MSE between the prediction of the decoder and the real image in order to maximise the likelihood. You can easily show this: just replace $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ with the Gaussian $\mathrm{pdf}_{1}$ then maximise that wrt the parameters, and you should end up with something that resembles the MSE.

And similarily in this discussion, Loss function autoencoder vs variational-autoencoder or MSE-loss vs binary-cross-entropy-loss, it is mentioned that

If you assume it follows a normal distribution you will end up with a MSE minimization since $p(x \mid z)$ can be reformulated as $p(x \mid \hat{x}) \sim \mathcal{N}(\hat{x}, \sigma)$

My question is, how do we show that this is true?

This is my attempt:

$$C= \mathbf{\sigma^2 I}$$ is the covariance matrix, and $$\sigma_i^2$$ is the ith diagonal values of this marix;

So then $$N(\textbf{x};\mathbf{\mu},\textbf{C}) =\frac{1}{(2 \pi)^{L/2} \sqrt{det(\textbf{C})}}exp(\frac{-1}{2} (\textbf{x}- \mathbf{\mu})^T C^{-1}(\textbf{x}-\mathbf{\mu}))$$

$$det(\textbf{C}) =\prod{\sigma_i^2}$$

$$ log(N(\textbf{x};\mathbf{\mu},\textbf{C}))=\frac{-L}{2} log(2\pi) + \frac{-1}{2} log(\prod{\sigma_i^2} ) + \sum{ \frac{(x_i-\mu_i)^2}{-2 \sigma_i^2}}$$

$$= \frac{-L}{2} log(2\pi) + \frac{-1}{2} \sum{\sigma_i^2} + \sum{ \frac{(x_i-\mu_i)^2}{-2 \sigma_i^2}} $$

And from this point, I am not sure what to do next.

a12345
  • 93
  • 5

1 Answers1

2

Starting from $$ \frac{-L}{2} log(2\pi) + \frac{-1}{2} \sum{\sigma_i^2} + \sum{ \frac{(x_i-\mu_i)^2}{-2 \sigma_i^2}} $$

$L$, the dimension, is a fixed quantity. $\sigma$ is also a fixed quantity, with $\sigma_i = \sigma_j$ for all $i,j$ (you can choose to treat it as a variable, but this is rarely done, and doesn't lead to MSE). So we can drop the first two terms, since they are constant, and we're left with

$$ -\frac{1}{2\sigma^2}\sum_i{ (x_i-\mu_i)^2} $$

Maximizing this quantity is equivalent to minimizing $\sum_i (x_i-\mu_i)^2$, which is MSE.

shimao
  • 22,706
  • 2
  • 42
  • 81
  • So I know $$\sum{ (x_i-\mu_i)^2}$$ looks similar to MSE,but there is one thing I am not understanding. Because typically if $$y_i$$ is the true label, and $$\hat{y_i}$$ is the estimated label, $$\sum{ (y_i-\hat{y_i})^2}$$ is the MSE , what is the $$y_i $$ and the $$\hat{y_i}$$ in $$\sum{ (x_i-\mu_i)^2}$$ – a12345 Aug 16 '21 at 19:24
  • @a12345 be careful not to conflate the i in your original attempt, which indexes the dimensions of each data point, with an i which indexes the data points. in this case, $y_i^{(j)}$, the $i$th dimension of the $j$th datapoint, is $x_i^{(j)}$, and $\hat y_i^{(j)}$ is $\mu_i^{(j)}$ – shimao Aug 16 '21 at 19:31
  • Ok, thats interesting,but confusing. Because the paper has the following formula: $\mathbf{z}^{(i, l)}=\boldsymbol{\mu}^{(i)}+\boldsymbol{\sigma}^{(i)} \odot \boldsymbol{\epsilon}^{(l)} \quad$ and $\quad \boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(0, \mathbf{I})$ (using notation from the paper) Then it gets fed to the decoder, which then outputs something. I thought that would be the $\hat{y}_{i}^{(j)}$ (using your notation in the comment) instead of $\mu_{i}^{(j)}$ (using your notation in the comment) – a12345 Aug 18 '21 at 04:26