How do you interpret the cross-entropy value?

Question

In machine learning, cross-entropy is often used while training a neural network.

During my training of my neural network, I track the accuracy and the cross entropy. The accuracy is pretty low, so I know that my network isn't performing well. But what can I say about my model knowing the cross-entropy?

score 12 · Accepted Answer · edited Jan 16 '20 at 19:42

Andrew Ng explains the intuition behind using cross-entropy as a cost function in his ML Coursera course under the logistic regression module, specifically at this point in time with the mathematical expression:

$$\text{Cost}\left(h_\theta(x),y\right)=\left\{ \begin{array}{l} -\log\left(h_\theta(x)\right) \quad \quad\quad \text{if $y =1$}\\ -\log\left(1 -h_\theta(x)\right) \quad \;\text{if $y =0$} \end{array} \right. $$

The idea is that with an activation function with values between zero and one (in this case a logistic sigmoid, but clearly applicable to, for instance, a softmax function in CNN, where the final output is a multinomial logistic), the cost in the case of a true 1 value ($y=1$), will decrease from infinity to zero as $h_\theta(x)\to1$, because ideally we would like for its to be $1$, predicting exactly the true value, and hence rewarding an activation output that gets close to it; reciprocally, the cost will tend to infinity as the activation function tends to $0$. The opposite is true for $y=0$ with the trick of obtaining the logarithm of $1-h_\theta(x)$, as opposed to $h_\theta(x).$

Here is my attempt at showing this graphically, as we limit these two functions between the vertical lines at $0$ and $1$, consistent with the output of a sigmoid function:

This can be summarized in one more succinct expression as:

$$\text{Cost}\left(h_\theta(x),y\right)=-y\log\left(h_\theta(x)\right)-(1-y) \log\left(1 - h_\theta(x)\right).$$

In the case of softmax in CNN, the cross-entropy would similarly be formulated as

$$\text{Cost}=-\sum_j \,t_j\,\log(y_j)$$

where $t_j$ stands for the target value of each class, and $y_j$ the probability assigned to it by the output.

Beyond the intuition, the introduction of cross entropy is meant to make the cost function convex.

To the beyond intuition part I would also add that cross entropy also emerges from maximum likelihood estimation for logistic regression model — Łukasz Grad, Apr 11 '17 at 21:34
@BobBurt I included a link to a page explaining the extrapolation to softmax and the cross entropy equations that follow, — Antoni Parellada, Apr 30 '17 at 06:28
@BobBurt This answer explains the relationship between binary and multinomial cross-entropy. https://stats.stackexchange.com/questions/260505/machine-learning-should-i-use-a-categorical-cross-entropy-or-binary-cross-entro/260537#260537 — Sycorax, Nov 16 '18 at 00:52
The second cost expression should be: $$\text{Cost}\left(h_\theta(x),y\right)=-y\log\left(h_\theta(x)\right)-(1-y)\log\left(1-h_\theta(x)\right).$$ — user650654, Jan 15 '20 at 07:12
@user650654 Thank you. Can you please edit the answer accordingly? — Antoni Parellada, Jan 16 '20 at 12:45

How do you interpret the cross-entropy value?

1 Answers1

Linked