What is the cost function that maximizes likelihood when the target variable is Gaussian of unknown variance?

Question

This post describes how mean squared error loss can be interpreted as maximizing likelihood if $p(y|x)$ is modeled as a Gaussian with fixed $\sigma$.

Since standard deviation is often learned in the model, how does making standard deviation a parameter change the loss function which maximizes likelihood? That is, $p(y|x)$ is modeled as a Gaussian with standard deviation, $\sigma$ a function of $x$, $\sigma(x)$.

score 1 · Answer 1 · answered Aug 17 '20 at 03:47

1

It doesn't change anything.

Suppose you have two possible distributions with means $\mu(x)$ and $\nu(x)$ and the same $\sigma^2$ (known or unknown). The likelihood is higher for $\mu(x)$ than $\nu(x)$ if and only if the mean squared error is lower, so Gaussian likelihood with known or unknown $\sigma^2$ will prefer the model with the smaller mean squared error.

To see why, the Gaussian loglikelihood is (up to a constant $c$)

$$\ell(\mu(x), \sigma^2) = c -\frac{n}{2}\log\sigma^2 -\frac{1}{2}\sum_i \frac{(y_i-\mu(x_i))^2}{\sigma^2}$$ so for any value of $\sigma$ it's just a linear function of the MSE.

answered Aug 17 '20 at 03:47

Thomas Lumley

21,784
1
22
73

I understand this works for fixed $\sigma$. However, my question concerns $\sigma$ which is learned so is a function in x, $\sigma(x)$. – curiousgeorge Aug 17 '20 at 06:08
Ah. You didn't say it was a function of $x$, just that it was learned. – Thomas Lumley Aug 17 '20 at 06:25
thanks for pointing that out--edited to make more clear – curiousgeorge Aug 17 '20 at 06:31

score 0 · Answer 2 · answered Aug 17 '20 at 11:49

The sort of model you are describing is called a heteroskedastic model. For the univariate case, it assumes that.

$$ t\vert {\bf x}\sim\mathcal{N}\big(\mu({\bf x}), \sigma({\bf x})\big) $$

For functions $\mu:\mathbb{R}^M\to\mathbb{R}$, $\sigma:\mathbb{R}^M\to\mathbb{R}$.

Assuming that both $\mu$ and $\sigma$ depend on a set of parameters ${\bf w}_\mu$ and ${\bf w}_\sigma$ respectively, we can find the values for ${\bf w}_\mu$, ${\bf w}_\sigma$ via maximum likelihood. To do so, let $\mathcal{D}=\{({\bf x}_n, t_n) \vert {\bf x}_n\in\mathbb{R}^M, t_n\in\mathbb{R}\}_{n=1}^N$, ${\bf w}=\{{\bf w}_\mu, {\bf w}_\sigma\}$

$$ \begin{aligned} {\bf w} &= \arg\max_{{\bf w}} p(\mathcal{D}\vert{\bf w})\\ &= \arg\max_{\bf w}\prod_{n=1}^N p(t_n\vert{\bf x}_n,\bf w)\\ &= \arg\max_{\bf w} \sum_{n=1}^N\log p(t_n\vert, {\bf x}_n, {\bf w}) \\ &= \arg\max_{\bf w} \sum_{n=1}^N\log \mathcal{N}\big(t_n\vert\mu({\bf x}), \sigma^2({\bf x})\big)\\ &= \arg\max_{\bf w} \sum_{n=1}^N -\frac{1}{2}\left(\log2\pi + \log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({\bf x}_n)}(\mu({\bf x}_n) - t_n)^2\right) \\ &= \arg\min_{{\bf w}} \frac{N}{2}\log 2\pi + \sum_{n=1}^N\left(\log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({{\bf x}_n})}(\mu({\bf x}_n) - t_n)^2\right) \\ &= \arg\min_{{\bf w}} \sum_{n=1}^N\left(\log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({{\bf x}_n})}(\mu({\bf x}_n) - t_n)^2\right) \end{aligned} $$

Thus, denoting $\mathcal L = \sum_{n=1}^N\left(\log\sigma^2({\bf x}_n) + \frac{1}{\sigma^2({{\bf x}_n})}(\mu({\bf x}_n) - t_n)^2\right)$ as our generalized loss function, we arrive at a loss function that is function of both the mean and variance of the gaussian. Taking the derivative of $\mathcal L$ w.r.t. ${\bf w}_\mu$ and ${\bf w}_\sigma$ and setting them to zero, we arrive at our model parameters.

An example of this model is the GARCH model, in which $\mu({\bf x}) = 0$

What is the cost function that maximizes likelihood when the target variable is Gaussian of unknown variance?

2 Answers2