Categorical cross-entropy vs Binary cross-entropy for multi-class classification with mixup

Question

I understand that for multi-class classification the correct loss to use is categorical cross-entropy. However, when performing mixup as a regularisation technique two samples $(X_1, y_1)$ and $(X_2, y_2)$ are combined to create a new sample such that $(X_{new}, y_{new}) = \lambda(X_1, y_1) + (1-\lambda)(X_2, y_2)$, which effectively gives the new sample two labels with different weights.

My question is should I be using categorical cross-entropy because we are classifying non-mixed samples during evaluation, or should I be using binary cross-entropy because the training has effectively become a multi-label classification problem?

Edit: Just to clarify this is a multi-class classification problem where all 100 classes are mutually exclusive, however during training mixup can cause a sample to be labelled with 2 classes where class $i$ has label weight $\lambda$ and class $j$ has label weight $1 -\lambda$. The two losses I am comparing are specifically keras.losses.BinaryCrossentropy and keras.losses.CategoricalCrossentropy. During evaluation, samples can only be labelled with one class.

The new sample is a convex combination of the two inputs. If the input labels match, then the label is either 0 or 1. If the labels don’t match, then the label is either $\lambda$ or $1-\lambda$. In any of these four cases, the BCE loss works because it achieves a minimum when the model predicts the correct label exactly — regardless of whether the label is 0,1 or in between. — Sycorax, Jun 29 '21 at 17:48
@Sycorax perfect explanation, thank you! Additionally, should the output layer be using sigmoid activation as opposed to softmax? On one hand sigmoid is the 'standard' for multi-label with BCE, however I feel softmax may be more suited since the sample labels will always sum to exactly 1. — Avelina, Jun 29 '21 at 18:00
Both sum to 1. For a binary outcome, we can write $P(A) + P(A^c)=P(y=1)+P(y=0)=1$. For binary events, the difference wrt to outputs between sigmoid and softmax is that a sigmoid output solely gives $P(A)=P(y=1)$, while a softmax output gives $P(y=0), P(y=1)$. More broadly, you can show that for 2 classes, sigmoid is a special case of softmax. — Sycorax, Jun 29 '21 at 19:24
@Sycorax yes I completely understand that for the 2 class case, however I have 100 classes, not just 2. — Avelina, Jun 29 '21 at 21:20
Can you [edit] your post to clarify the two losses that you’re comparing? And are the classes mutually exclusive? — Sycorax, Jun 29 '21 at 21:30
The documentation says that `keras.losses.BinaryCrossentropy` is for the case of 2 classes ("Use this cross-entropy loss for binary (0 or 1) classification applications.") but you have 100. The documentation for `keras.losses.CategoricalCrossentropy` says "Use this crossentropy loss function when there are two or more label classes." Does this answer your question? — Sycorax, Jun 29 '21 at 21:55
@Sycorax that's what it says, however it can be used with more than 2 classes. I looked at the source code and when there is more than 1 output logit it simply computes BCE for each logit and returns the mean. There are also dozens of online tutorials which use BCE for multi-class multi-label classification in keras. — Avelina, Jun 29 '21 at 22:00
I'm surprised that the source code is doing that. I guess my question to you is "what is the negative likelihood that you want to minimize?" It's not necessarily the case that Keras will implement a loss for the likelihood that you care about. I could see a case for either one, or some third option, depending on how you're thinking about your data. — Sycorax, Jun 29 '21 at 22:01
For instance, this concept is developed in the context of pixel intensities here https://stats.stackexchange.com/questions/206925/is-it-okay-to-use-cross-entropy-loss-function-with-soft-labels and here https://stats.stackexchange.com/questions/490062/can-we-derive-cross-entropy-formula-as-maximum-likelihood-estimation-for-soft-la — Sycorax, Jun 30 '21 at 21:48

Categorical cross-entropy vs Binary cross-entropy for multi-class classification with mixup

0 Answers0