39

In 5.5, Deep Learning (by Ian Goodfellow, Yoshua Bengio and Aaron Courville), it states that

Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model. For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model.

I can't understand why they are equivalent and the authors do not expand on the point.

Mufei Li
  • 553
  • 1
  • 5
  • 9

2 Answers2

39

Let the data be $\mathbf{x}=(x_1, \ldots, x_n)$. Write $F(\mathbf{x})$ for the empirical distribution. By definition, for any function $f$,

$$\mathbb{E}_{F(\mathbf{x})}[f(X)] = \frac{1}{n}\sum_{i=1}^n f(x_i).$$

Let the model $M$ have density $e^{f(x)}$ where $f$ is defined on the support of the model. The cross-entropy of $F(\mathbf{x})$ and $M$ is defined to be

$$H(F(\mathbf{x}), M) = -\mathbb{E}_{F(\mathbf{x})}[\log(e^{f(X)}] = -\mathbb{E}_{F(\mathbf{x})}[f(X)] =-\frac{1}{n}\sum_{i=1}^n f(x_i).\tag{1}$$

Assuming $x$ is a simple random sample, its negative log likelihood is

$$-\log(L(\mathbf{x}))=-\log \prod_{i=1}^n e^{f(x_i)} = -\sum_{i=1}^n f(x_i)\tag{2}$$

by virtue of the properties of logarithms (they convert products to sums). Expression $(2)$ is a constant $n$ times expression $(1)$. Because loss functions are used in statistics only by comparing them, it makes no difference that one is a (positive) constant times the other. It is in this sense that the negative log likelihood "is a" cross-entropy in the quotation.


It takes a bit more imagination to justify the second assertion of the quotation. The connection with squared error is clear, because for a "Gaussian model" that predicts values $p(x)$ at points $x$, the value of $f$ at any such point is

$$f(x; p, \sigma) = -\frac{1}{2}\left(\log(2\pi \sigma^2) + \frac{(x-p(x))^2}{\sigma^2}\right),$$

which is the squared error $(x-p(x))^2$ but rescaled by $1/(2\sigma^2)$ and shifted by a function of $\sigma$. One way to make the quotation correct is to assume it does not consider $\sigma$ part of the "model"--$\sigma$ must be determined somehow independently of the data. In that case differences between mean squared errors are proportional to differences between cross-entropies or log-likelihoods, thereby making all three equivalent for model fitting purposes.

(Ordinarily, though, $\sigma = \sigma(x)$ is fit as part of the modeling process, in which case the quotation would not be quite correct.)

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 2
    +1 with two suggestion - could use $g () $ instead of $f () $ to avoid confusion with $F () $. The second is most estimates of $\sigma^2$ are going to be $k\sum_{i=1}^n \left (x_i - p (x_i)\right)^2$. When you plug this in and add it up you get $-\frac {1}{2}\log\left [\sum_{i=1}^n \left (x_i - p (x_i)\right)^2\right] +h(k)$. Similar to AIC-type formula... – probabilityislogic Dec 10 '17 at 02:44
  • @probabilityislogic I choose the pair $F$ and $f$ because they *do* represent closely related quantities. – whuber Dec 10 '17 at 16:58
  • Hi, I think this is only applied to linear distribution. In nonlinear distribution problems, I think we can still use MSE as cost function, right? – Lion Lai Feb 01 '18 at 03:46
9

For readers of the Deep Learning book, I would like to add to the excellent accepted answer that the authors explain their statement in detail in section 5.5.1 namely the Example: Linear Regression as Maximum Likelihood.

There, they list exactly the constraint mentioned in the accepted answer:

$p(y | x) = \mathcal{N}\big(y; \hat{y}(x; w), \sigma^2\big)$. The function $\hat{y}(x; w)$ gives the prediction of the mean of the Gaussian. In this example, we assume that the variance is fixed to some constant $\sigma^2$ chosen by the user.

Then, they show that the minimization of the MSE corresponds to the Maximum Likelihood Estimate and thus the minimization of the cross-entropy between the empirical distribution and $p(y|x)$.

Kilian Batzner
  • 551
  • 5
  • 10