0

This post describes how mean squared error loss can be interpreted as maximizing likelihood if $p(y|x)$ is modeled as a Gaussian with fixed $\sigma$.

Since standard deviation is often learned in the model, how does making standard deviation a parameter change the loss function which maximizes likelihood? That is, $p(y|x)$ is modeled as a Gaussian with standard deviation, $\sigma$ a function of $x$, $\sigma(x)$.

curiousgeorge
  • 299
  • 1
  • 7

2 Answers2

1

It doesn't change anything.

Suppose you have two possible distributions with means $\mu(x)$ and $\nu(x)$ and the same $\sigma^2$ (known or unknown). The likelihood is higher for $\mu(x)$ than $\nu(x)$ if and only if the mean squared error is lower, so Gaussian likelihood with known or unknown $\sigma^2$ will prefer the model with the smaller mean squared error.

To see why, the Gaussian loglikelihood is (up to a constant $c$)

$$\ell(\mu(x), \sigma^2) = c -\frac{n}{2}\log\sigma^2 -\frac{1}{2}\sum_i \frac{(y_i-\mu(x_i))^2}{\sigma^2}$$ so for any value of $\sigma$ it's just a linear function of the MSE.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
0

The sort of model you are describing is called a heteroskedastic model. For the univariate case, it assumes that.

$$ t\vert {\bf x}\sim\mathcal{N}\big(\mu({\bf x}), \sigma({\bf x})\big) $$

For functions $\mu:\mathbb{R}^M\to\mathbb{R}$, $\sigma:\mathbb{R}^M\to\mathbb{R}$.

Assuming that both $\mu$ and $\sigma$ depend on a set of parameters ${\bf w}_\mu$ and ${\bf w}_\sigma$ respectively, we can find the values for ${\bf w}_\mu$, ${\bf w}_\sigma$ via maximum likelihood. To do so, let $\mathcal{D}=\{({\bf x}_n, t_n) \vert {\bf x}_n\in\mathbb{R}^M, t_n\in\mathbb{R}\}_{n=1}^N$, ${\bf w}=\{{\bf w}_\mu, {\bf w}_\sigma\}$

$$ \begin{aligned} {\bf w} &= \arg\max_{{\bf w}} p(\mathcal{D}\vert{\bf w})\\ &= \arg\max_{\bf w}\prod_{n=1}^N p(t_n\vert{\bf x}_n,\bf w)\\ &= \arg\max_{\bf w} \sum_{n=1}^N\log p(t_n\vert, {\bf x}_n, {\bf w}) \\ &= \arg\max_{\bf w} \sum_{n=1}^N\log \mathcal{N}\big(t_n\vert\mu({\bf x}), \sigma^2({\bf x})\big)\\ &= \arg\max_{\bf w} \sum_{n=1}^N -\frac{1}{2}\left(\log2\pi + \log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({\bf x}_n)}(\mu({\bf x}_n) - t_n)^2\right) \\ &= \arg\min_{{\bf w}} \frac{N}{2}\log 2\pi + \sum_{n=1}^N\left(\log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({{\bf x}_n})}(\mu({\bf x}_n) - t_n)^2\right) \\ &= \arg\min_{{\bf w}} \sum_{n=1}^N\left(\log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({{\bf x}_n})}(\mu({\bf x}_n) - t_n)^2\right) \end{aligned} $$

Thus, denoting $\mathcal L = \sum_{n=1}^N\left(\log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({{\bf x}_n})}(\mu({\bf x}_n) - t_n)^2\right)$ as our generalized loss function, we arrive at a loss function that is function of both the mean and variance of the gaussian. Taking the derivative of $\mathcal L$ w.r.t. ${\bf w}_\mu$ and ${\bf w}_\sigma$ and setting them to zero, we arrive at our model parameters.

An example of this model is the GARCH model, in which $\mu({\bf x}) = 0$