6

I usually see a discussion of the following loss functions in the context of the following types of problems:

  • Cross entropy loss (KL divergence) for classification problems
  • MSE for regression problems

However, my understanding (see here) is that doing MLE estimation is equivalent to optimizing the negative log likelihood (NLL) which is equivalent to optimizing KL and thus the cross entropy.

So:

  • Why isn't KL or CE used also for regression problems?
  • What's the relationship between CE and MSE for regresion? Are they one and the same loss under some circumstances?
  • If different, what's the benefit of using MSE for regression instead?

Related questions:

Josh
  • 3,408
  • 4
  • 22
  • 46
  • Some related discussion: https://stats.stackexchange.com/questions/378274/how-to-construct-a-cross-entropy-loss-for-general-regression-targets – Sycorax Jul 14 '20 at 20:35

2 Answers2

9

The mean squared error is the cross-entropy between the data distribution $p^*(x)$ and your Gaussian model distribution $p_{\theta}$. Note that the standard MLE procedure is:

$$ \begin{align} \max_{\theta} E_{x \sim p^*}[\log p_{\theta}(x)] &= \min_{\theta} \left(- E_{x \sim p^*}[\log p_{\theta}(x)]\right)\\ &= \min_{\theta} H(p^* \Vert p_{\theta}) \\ &\approx \min_{\theta} \sum_i \frac{1}{2} \left(\Vert x_i - \theta_1\Vert^2/\theta_2^2 - \log 2 \pi \theta_2^2\right) \end{align} $$

Where $H(p^* \Vert p_{\theta})$ denotes the CE and we use a Monte Carlo approximation to the expectation. And as you stated, this is equivalent to minimizing the KL divergence between the data distribution and your model distribution. Commonly the variance $\theta_2$ is fixed and drops out of the objective.

Some people get confused because certain textbooks introduce the cross-entropy in terms of the Bernoulli/Categorical distribution (almost all machine learning libraries are guilty of this!), but it applies more generally than the discrete setting.

Eweler
  • 302
  • 1
  • 6
  • Thanks Eweler. When you said _"this is equivalent to minimizing the KL divergence between the data distribution and your model."_ would you mind elaborating? Sorry not sure I follow that just from the expression that you wrote. – Josh Jul 15 '20 at 03:04
  • 2
    Sorry, I was a bit loose there. What I meant was that minimization of the cross-entropy is equivalent to minimizing the KL divergence between the true data distribution $p^*$ and your model distribution $p_{\theta}$. Typically $p_{\theta}$ belongs to some restricted class of functions, so this will be nonzero. To show this, note that the cross entropy is: $E_{x \sim p^*}[\log p_{\theta}(x)] = E_{x \sim p^*}[ \log p_{\theta}(x) - \log p^*(x)] + E_{x \sim p^*}[\log p^*(x)] $ – Eweler Jul 15 '20 at 04:22
  • 1
    Thanks! So to circle back to some of the questions in the OP. Is optimizing the CE loss equivalent to optimizing the MSE loss in regression? (or only under a normality assumptions?). If they are not equivalent in the general case, what's a good way to think about their relationship? – Josh Jul 15 '20 at 14:07
8

In a regression problem you have pairs $(x_i, y_i)$. And some true model $q$ that characterizes $q(y|x)$. Let's say you assume that your density

$$f_\theta(y|x)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2\right\}$$

and you fix $\sigma^2$ to some value

The mean $\mu(x_i)$ is then e.g. modelled via a a neural network (or any other model)

Writing the empirical approximation to the cross entropy you get:

$$\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2\right\} \right)$$

$$=\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi\sigma^2}}\right) +\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2$$

If we e.g. set $\sigma^2 = 1$ (i.e. assume we know the variance; we could also model the variance than our neural network had two ouputs, i.e. one for the mean and one for the variance) we get:

$$=\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi}}\right) +\frac{1}{2}(y_i-\mu_\theta(x_i))^2$$

Minimizing this is equivalent to the minimization of the $L2$ loss.

So we have seen that minimizing CE with the assumption of normality is equivalent to the minimization of the $L2$ loss

Sebastian
  • 2,733
  • 8
  • 24
  • What if we don't have (identical) normal error terms? – Dave Jul 14 '20 at 19:10
  • the estimate for the mean would not change – Sebastian Jul 14 '20 at 19:15
  • 1
    Then what's the point of the Gaussian assumption? In linear regression, we make a Gaussian assumption to do parameter inference, which is less important in neural networks. – Dave Jul 14 '20 at 19:19
  • The goal was just to illustrate that minimizing CE results in the minimization of L2 loss when we assume normality. – Sebastian Jul 14 '20 at 19:22
  • Thanks Sebastian. What do you mean by _"**empirical approximation** to the cross entropy"_ ? – Josh Jul 14 '20 at 19:49
  • 1
    That means approximating the expectation with respect to the true data distribution $p^*(x)$ with a Monte Carlo estimate using a set of samples $S$: $ \int dx \, p^*(x) f(x) \approx \frac{1}{\vert S \vert}\sum_{i \in S} f(x_i), x_i \sim p^*(x)$. Since typically we are unable to evaluate the integral analytically or in reasonable (polynomial) time for most problems. – Eweler Jul 15 '20 at 04:18
  • 1
    @Josh as Eweler already pointed out: we imagine that $H(q, f_\theta)$ has some fixed but unknown value (for a fixed $\theta$) by the term $1/n\sum_{i=1}^n-\log(f_\theta(x_i)) \approx \int -\log(f_\theta(x)) q(x)dx$ we approximate this quantity empirically and minimize this quantity as a proxy because we have no way to minimize the term we actually care about, i.e. the underlying real cross-entropy. This is generally referred to as empirical risk minimization (the risk is the theoretical value). Note that we usually drop the $1/n$ because is irrelevant to the optimization. – Sebastian Jul 15 '20 at 06:13
  • Thanks - so in relation to _"What's the benefit of using MSE for regression instead?"_, when you say _"minimizing CE with the assumption of normality is equivalent to the minimization of the 2 loss"_ what's the relationship then between CE and MSE, e.g. under the assumption of normality? Would the Empirical Risk be the same and we would be minimizing the same loss? and would that relationship hold when we don't use a Normal distribution? – Josh Jul 15 '20 at 13:03
  • 3
    Yes exactly, they are minimizing the same loss (or to be more precise their solution is the same (CE loss has this additional constant $n\log(\frac{1}{\sqrt{2\pi}})$ that does not matter for the optimization No without a normal distribution this does not hold. If you substitute normal distribution with Laplace distribution this will result in the minimization of the $L1$ loss – Sebastian Jul 15 '20 at 13:43
  • Thanks @Sebastian. That makes sense. I suppose the above is equivalent to arguing that minimizing CE is **always** the same as doing MLE, and of course, the MLE solution does **not** have to be the one that minimizes the L2 loss (i.e. it's model dependent) right? – Josh Jul 15 '20 at 16:42
  • 1
    Exactly, you got it:) – Sebastian Jul 15 '20 at 16:44