Cross entropy vs KL divergence: What's minimized directly in practice?

Question

My understanding is that in ML one can establish a connection between these quantities using the following line of reasoning:

Assuming we plan to use ML to make decisions, we choose to minimize our Risk against a well defined loss function that scores those decisions. Since we often don't know the true distribution of the data, we can't directly minimize this Risk (our expected loss), and instead choose to minimize our Empirical Risk i.e. ER (or structural risk, if using regularization). It's empirical because we compute this risk as an average of the loss function on observed data.
If we assume that our model can output probabilities for those decisions, and we are solving a problem that involves hard decisions for which we have some ground truth examples, we can model the optimization of those decisions as minimizing ER with a cross-entropy loss function, and thus model decisions as a problem of classifying data. Under this loss, the ER is actually the same (not just equivalent) to the negative log likelihood (NLL) of the model for the observed data. So one can interpret minimizing ER as finding an MLE solution for our probabilistic model given the data.
From the above, we can also establish that the CE is equivalent to minimizing a KL divergence between our model (e.g. Q) for generating decisions and the true model (P) that generates the actual data and decisions. This is apparently a nice result, because one can argue that while we don't know the true data generating (optimal decision making) distribution, we can establish that we are doing "our best" to estimate it, in a KL sense. However, CE is not the same as KL. They measure different things and of course take on different values.

Is the above line of reasoning correct? Or do people e.g. use cross-entropy and KL divergence for problems other than classification? Also, does the "CE ≡ KL ≡ NLL" equivalence relationship (in terms of optimization solutions) always hold?

In either case, what is minimized in practice directly (KL vs the CE) and in what circumstances?

Motivation

Consider the following from a question on this site:

"The KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part). ... [From the comments] In my own experience ... BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions".

I have read similar statements online. That these two quantities are not the same, and in practice we use one (or the other) for optimization. Is that actually the case? If so, which quantity is actually evaluated and optimized directly in practice, for what types of problems, and why?

Related questions:

If your data is assumed constant, minimizing them is *equivalent*. — Neil G, Jul 14 '20 at 14:38
Hope [this answer](https://stats.stackexchange.com/a/329578/103153) would be of any use to you. — Lerner Zhang, Jul 14 '20 at 14:48

Sebastian · Accepted Answer · 2020-07-14T18:04:20.997

8

Let $q$ be the density of your true data-generating process and $f_\theta$ be your model-density.

Then $$KL(q||f_\theta) = \int q(x) log\left(\frac{q(x)}{f_\theta(x)}\right)dx = -\int q(x) \log(f_\theta(x))dx + \int q(x) \log(q(x)) dx$$

The first term is the Cross Entropy $H(q, f_\theta)$ and the second term is the (differential) entropy $H(q)$. Note that the second term does NOT depend on $\theta$ and therefore you cannot influence it anyway. Therfore minimizing either Cross-Entropy or KL-divergence is equivalent.

Without looking at the formula you can understand it the following informal way (if you assume a discrete distribution). The entropy $H(q)$ encodes how many bits you need if you encode the signal that comes from the distribution $q$ in an optimal way. The Cross-Entropy $H(q, f_\theta)$ encodes how many bits on average you would need when you encoded the singal that comes from a distribution $q$ using the optimal coding scheme for $f_\theta$. This decomposes into the Entropy $H(q)$ + $KL(q||f_\theta)$. The KL-divergence therefore measures how many additional bits you need if you use an optimal coding scheme for distribution $f_\theta$ (i.e. you assume your data comes from $f_\theta$ while it is actually generated from $q$). This also explains why it has to be positive. You cannot be better than the optimal coding scheme that yields the average bit-length $H(q)$.

This illustrates in an informal way why minimizing KL-divergence is equivalent to minimizing CE: By minimzing how many more bits you need than the optimal coding scheme (on average) you of course also minimize the total amount of bits you need (on average)

The following post illustrates the idea with the optimal coding scheme: Qualitively what is Cross Entropy

edited Jul 14 '20 at 18:04

answered Jul 14 '20 at 14:43

Sebastian

2,733
8
24

1

Thx Sebastian, this is helpful, although my Q (perhaps not completely clear since it's admittedly long) asks more specifically: 1) which of these two quantities KL vs CE is actually _directly_ optimized in practice (e.g. when doing the fwd and bckwd pass which one do we evaluate and why), 2) under what circumstances and types of models we have CE = NLL and 3) possibly related, if the KL / CE loss functions are **only** useful in classification, i.e. problems where we have examples of hard-labels as ground truth (if not, e.g. regression, how would we use these losses to penalize lack of fit?) – Josh Jul 14 '20 at 17:01
1

1. CE is directly optimized. 2. The empirical approximation to the cross-entropy ALWAYS corresponds to the negative log likelihood. 3. I don't understand what you mean. – Sebastian Jul 14 '20 at 17:13
Thanks. For #3 I'm used to seeing KL and CE used exclusively in classification problems, and MSE (or other types of losses) for e.g. auto-encoders and regression problems, so my question is, why? Isn't the above result KL≡CE≡NLL general enough that it could be a natural loss for non-classification problems as well? And if so, how would one model those problems with KL? Or does this equivalence not work in those cases? (examples in the literature are totally fine) – Josh Jul 14 '20 at 17:56
Also, on #1, why is CE the quantity directly optimized? (and not KL)? – Josh Jul 14 '20 at 18:02
1

You are absolutely right. You could e.g. use the normal density for $f_\theta$ in a regression problem where $\theta = (\mu, \sigma^2)$ then using KL-divergence as loss would result in the Maximum Likelihood estimate under the assumption of normality (this in turn minimizes the L2 loss) To your second question: When we optimize CE we simultaneously optimize KL only without having the need to additionally estimate a KONSTANT (i.e. the entropy) that we cannot influence anyway because it does not depend on $\theta$ – Sebastian Jul 14 '20 at 18:12
Thanks Sebastian. That helps. I left a follow-up to the last question [here](https://stats.stackexchange.com/questions/477152/using-cross-entropy-for-regression-problems). Hopefully that helps us / others dive deeper into the last item. – Josh Jul 14 '20 at 18:30

Cross entropy vs KL divergence: What's minimized directly in practice?

Motivation

1 Answers1

Linked