58

I am playing with convolutional neural networks using Keras+Tensorflow to classify categorical data. I have a choice of two loss functions: categorial_crossentropy and sparse_categorial_crossentropy.

I have a good intuition about the categorial_crossentropy loss function, which is defined as follows:

$$ J(\textbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \text{log}(\hat{y}_i) + (1-y_i) \text{log}(1-\hat{y}_i) \right] $$

where,

  • $\textbf{w}$ refer to the model parameters, e.g. weights of the neural network
  • $y_i$ is the true label
  • $\hat{y_i}$ is the predicted label

Both labels use the one-hot encoded scheme.

Questions:

  • How does the above loss function change in sparse_categorial_crossentropy?
  • What is the mathematical intuition behind it?
  • When to use one over the other?
kedarps
  • 2,902
  • 2
  • 19
  • 30
  • Does https://stackoverflow.com/questions/37312421/tensorflow-whats-the-difference-between-sparse-softmax-cross-entropy-with-logi answer your questions? – Mark L. Stone May 10 '18 at 17:32
  • @MarkL.Stone it answers the question partially. I am looking for a mathematical intuition as to how sparsity affects the cost function. – kedarps May 11 '18 at 14:03
  • @kedarps they are mathematically identical, although sparse CE has the restriction that the labels $y_i$ are hard (0 or 1). Also depending on the implementation, sparse CE is possibly cheaper in terms of computation. – shimao May 12 '18 at 16:08

4 Answers4

65

Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above. The only difference is the format in which you mention $Y_i$ (i,e true labels).

If your $Y_i$'s are one-hot encoded, use categorical_crossentropy. Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]

But if your $Y_i$'s are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]

The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
skadaver
  • 766
  • 5
  • 3
  • What does the sparse refer to in sparse categorical cross-entropy? I thought it was because the data was sparsely distributed among the classes. – nid May 19 '20 at 11:44
  • it sparse because of using 10 values to store one correct class (in case of mnist), it uses only one value . – Amit Portnoy Jun 29 '20 at 18:21
  • It seems it is more than just mater of data format take a look at [this](https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy#answer-41923) – Ali Asgari Mar 11 '21 at 05:57
  • Can't we use simple categorical_crossentropy if Yi's are integers? – Thunder Apr 23 '21 at 02:59
5

I have no better answer than the links and me too encountered the same question. I just want to point out, that the formula for loss function (cross entropy) seems to be a little bit erroneous (and might be misleading.) One should probably drop the 2nd term in the bracket to have simply $$J(\textbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} y_i \text{log}(\hat{y}_i).$$ Sorry for writing my comment here, but I haven't got enough reputation points to be able to comment...

phunc20
  • 51
  • 1
  • 3
  • This is the standard technical definition of entropy, but I believe it's not commonly used as a loss function because it's not symmetric between 0-1 labels. In fact, if the true y_i is 0, this would calculate the loss to also be zero, regardless of prediction. OP's version corrects for this symmetry. – Joey F. Nov 22 '18 at 05:23
  • @JoeyF. Yes, I do notice that when I read the original post. There seems to be quite some discussion on the Internet about this issue [e.g.](https://datascience.stackexchange.com/questions/20296/cross-entropy-loss-explanation/20301). I tried to conceive an illustrative example in a short time but failed. What others said about the gradient involves also the label-0 $\hat{y}_i$. However, I still believe those who named it with cross entropy did it with a reason and the fact that since the $\hat{y}_i$ sums to $1$, maximizing any one of them automatically minimizes the rest. – phunc20 Nov 22 '18 at 07:02
3

The formula which you posted in your question refers to binary_crossentropy, not categorical_crossentropy. The former is used when you have only one class. The latter refers to a situation when you have multiple classes and its formula looks like below:

$$J(\textbf{w}) = -\sum_{i=1}^{N} y_i \text{log}(\hat{y}_i).$$

This loss works as skadaver mentioned on one-hot encoded values e.g [1,0,0], [0,1,0], [0,0,1]

The sparse_categorical_crossentropy is a little bit different, it works on integers that's true, but these integers must be the class indices, not actual values. This loss computes logarithm only for output index which ground truth indicates to. So when model output is for example [0.1, 0.3, 0.7] and ground truth is 3 (if indexed from 1) then loss compute only logarithm of 0.7. This doesn't change the final value, because in the regular version of categorical crossentropy other values are immediately multiplied by zero (because of one-hot encoding characteristic). Thanks to that it computes logarithm once per instance and omits the summation which leads to better performance. The formula might look like this:

$$J(\textbf{w}) = -\text{log}(\hat{y}_y).$$

2

By the nature of your question, it sounds like you have 3 or more categories. However, for the sake of completion I would like to add that if you are dealing with a binary classification, using binary cross entropy might be more appropriate.

Furthermore, be careful to choose the loss and metric properly, since this can lead to some unexpected and weird behaviour in the performance of your model.