2

I'm trying to understand some machine learning theory background: specifically, the relationship between cross entropy loss and "negative log likelihood".

To start, I already fully understand these definitions:

  1. Entropy of a probability distribution $p$ with $K$ classes:

$$ H(p) = - \sum_{k=1}^{K} p_k \log p_k $$

  1. Cross entropy between two probability distributions $p$ (ground-truth) and $q$ (predicted):

$$ H(p, q) = - \sum_{k=1}^{K} p_k \log q_k $$

My specific confusion comes from reading Kevin Murphy's 2021 book "Probabilistic Machine Learning: An Introduction". He says something like this about Kullback-Leibler divergence (it's a paraphrase summarization of sections 4.2 and 6.2):

$KL(p||q) = \sum_{k=1}^{K} p_k \log p_k - \sum_{k=1}^{K} p_k \log q_k$

We recognize the first term as the negative entropy and the second term as the cross entropy. The first term is a constant with respect to our predictions $q$, so we can ignore it.

Let us suppose the $p$ distribution is defined with a delta function $\delta$ like this: $ p(x) = \frac{1}{N} \sum_{n=1}^{N} \delta(x - x_n)$ .

Then the KL divergence becomes \begin{align} KL(p||q) &= -H(p) - \frac{1}{N} \sum_{n=1}^{N} \log q(y_n)\\ &= constant + NLL \end{align} This is called the cross entropy objective, and is equal to the average negative log likelihood of q on the training set.

Questions:

  1. The term $\frac{1}{N} \sum_{n=1}^{N} \log q(y_n)$ mentions one distribution $q$. How can it be a cross-entropy term when cross entropy is defined for two distributions $p$ and $q$?

  2. How does a log-likehood expression in terms of $N$ training instances ($\frac{1}{N} \sum_{n=1}^{N})$ turn into a cross-entropy expression in terms of $K$ classes ($\sum_{k=1}^{K}$)?

  3. Is the author's use of a delta function $\delta$ just another way of saying a one-hot distribution?

I'm still confused even after reading other posts like this one, this one, and this one.

stackoverflowuser2010
  • 3,190
  • 5
  • 27
  • 35

1 Answers1

2
  1. This is because of the claim about delta distributions. Now $p_k$ is $0$ for all $k$ but one. Those $p_k \log q_k$ terms in the sum over $k$ can be skipped. For the other, $p_k$ is $1$, so there’s no need to write that you multiply by $1$.
  2. I think this is answered in (1). The $0$-valued terms are skipped. In a sense, you have a double sum over $n$ and over $k$, but then the sum over $k$ goes away.
  3. Yes, that’s right. “One-hot” is terminology from the area of digital circuits; the delta function is from mathematics.
Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
  • Thank you. For 3, I've only seen the term "one-hot" in ML books and websites. Why do you think this author would bring up delta functions in a book about machine learning? And *which* delta function? I know of the Dirac and Kroneker delta functions. Does it matter? – stackoverflowuser2010 May 20 '21 at 19:56
  • For 1, I know that to compute total cross-entropy loss over a data set, you compute the mean cross-entropy loss over each training instances, like $H_{total}(P, Q) = -\frac{1}{N} \sum_{n=1}^{N} \sum_{k=1}^{K} p_{k,n} \log q_{k,n}$ for class $k$ and training instance $n$. So the relation between cross-entropy and negative log likelihood is that the true distribution $p$ is implicitly one-hot, so you can remove the summation over $K$ ? Is that all there is??? – stackoverflowuser2010 May 20 '21 at 20:03
  • 1
    Yes, ML borrowed it from circuits. Why bring up delta functions in an ML book? Because probability is mathematical, and it’s the focus of the book! Which delta depends on the data but is a longer discussion answered elsewhere on this site. To your last question: without going a rabbit hole of nuance, yes. – Arya McCarthy May 20 '21 at 20:12
  • For 1, can you please point me to an understandable reference that explains whatever nuance I'm missing? – stackoverflowuser2010 May 20 '21 at 20:26
  • “Understandable” and “nuance” tend to be opposed to each other; the details are more academic than practical. It’s enough if you’re reading an intro book to take the answer as “yes”. – Arya McCarthy May 20 '21 at 20:39
  • 2
    @stackoverflowuser2010 perhaps the paper [Why the logistic function? A tutorial discussion on probabilities and neural networks](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.476.1842&rep=rep1&type=pdf) by Michael I. Jordan, 1995, could help. – mhdadk May 20 '21 at 20:46
  • @stackoverflowuser2010 Was your question answered? If not, what can be clarified? – Arya McCarthy May 24 '21 at 03:02
  • @AryaMcCarthy: I understand that computing the MLE is equivalent to finding the `argmin` of the negative log likelihood (NLL), which is equivalent to minimizing the cross-entropy loss. But I find there's a discrepancy with a $\frac{1}{N}$ term, where $N$ is the number of training instances. NLL is defined as $-\sum_{i}^{N} \log P(y_i | x_i)$, but total cross entropy is defined as $- \frac{1}{N} \sum_{i}^{N} \log q_i = - \frac{1}{N} \sum_{i}^{N} \log P(y_i | x_i)$. Why does cross-entropy loss have that $\frac{1}{N}$ term (for computing the mean), while NLL does not? How can they be equivalent? – stackoverflowuser2010 May 25 '21 at 21:05
  • 1
    Don’t forget the word “average” in Murphy’s description, which you quoted in the original question. It’ll fix the discrepancy. – Arya McCarthy May 26 '21 at 02:39
  • @AryaMcCarthy Thanks. – stackoverflowuser2010 May 26 '21 at 20:56
  • I've done further reading online (e.g. http://www.awebb.info/probability/2017/05/18/cross-entropy-and-log-likelihood.html), where I've come across loose terminology stating that the likelihood of a training instance is $\prod_{k}^{K} \hat{y_{k}} ^{y_k}$, where $K$ is the number of classes, $\hat{y}$ is the predicted probably, and $y$ is the one-hot true probability. I have not seen that formulation anywhere in any text book. Is it standard, and if so, do you know of a good reference? – stackoverflowuser2010 Jun 01 '21 at 20:58
  • 1
    It's equivalent—try taking the log of it. – Arya McCarthy Jun 01 '21 at 23:43