0

Most of the classification models that I've encountered so far perform classification using CE loss.

For example, if we have 2 possible classes and the GT class is 1, then:

the CE loss will be $-\log{\frac{e^{x_1}}{e^{x_1} + e^{x_2}}}$ ,

where $x1,x2$ are the activations from the neurons of classes 1,2 respectively.

Whereas, if we trained the two neurons separately using BCE loss, then

the sum of two BCE losses, would be: $-\log{\frac{1}{1+e^{x_1}}} - \log{(1 - \frac{1}{1 + e^{x_2}})}$.

According to the YOLO object detection papers, their classification is performed using BCE loss (logistic regression) on each of the neurons, rather than using CE loss with Softmax.

I first thought that it would be pretty straightforward to show that the two expressions were equal, but failed to do it. Actually, it turned out to be very straightforward to prove that they were not equal, using the following Python code, which computes each of the loss values when the GT class is 1:

>>> import math
# sum of BCE losses: x1 is rewarded for being high, while x2 is rewarded for being low
>>> bce_loss = lambda x1,x2 : -math.log(1.0/(1.0+math.exp(-x1))) -math.log(1.0 - 1.0/(1.0+math.exp(-x2)))
# Softmax loss: the softmaxed, L1 normalized value of x1 is rewarded for being high
>>> softmax_loss = lambda x1,x2 : -math.log(math.exp(x1) / (math.exp(x1) + math.exp(x2)))

# For example, if the activations of our model are  x1=-2, x2=2 (such that our classifier 
#  is going to be wrong), then our loss is going to be rather high - yet, we can 
#  immediately tell that the two loss terms aren't equal:
>>> bce_loss(-2,2)
4.2538560220859445
>>> softmax_loss(-2,2)
4.0181499279178094

So there is a difference between CE loss and sum of BCE losses. The question is when and why any of them is more favorable than the other?

  • 2
    Are you sure that these two "losses" correspond to the same probability model? There are lots of "cross-entropy losses," and they do not imply the same underlying model. See: https://stats.stackexchange.com/questions/378274/how-to-construct-a-cross-entropy-loss-for-general-regression-targets It's not a question of "more favorable," it's a question of what your model is. Note also that these functions are not losses *per se* because they do not involve the target variable $y$. – Sycorax Nov 18 '21 at 15:13
  • @Sycorax thank you for your answer. I'm not sure I fully understand what you mean. In the two cases there are two neurons, where in one case they are trained using CE with Softmax and in the other case each is trained separately using BCE with Sigmoid. In both cases, when one is trained to be 0, the other one is trained to be 1 and vice versa. If you reduce the softmax fraction by $e^{x_1}$ then you get kind of a Sigmoid, but apparently it behaves differently from BCE with Sigmoid. My question is why everybody uses Softmax whereas e.g. YOLO uses multiple BCEs? Is it just a heuristic choice? – SomethingSomething Nov 19 '21 at 21:45
  • BTW, in the case of two classes 1 neuron is enough. The second neuron is trained to be exactly the opposite of the first one, so actually it is redundant. But suppose that we have N classes... – SomethingSomething Nov 19 '21 at 21:49
  • 1
    Cross-entropy of what probability model? Because these two expressions are clearly not equal in general, I'm not sure why you'd expect them to produce the same result. So, it seems plausible that the two expressions arise from two different purposes -- perhaps two different probability models, such as the binary cross entropy of some event compared to the multi-label cross-entropy. That's what I mean. – Sycorax Nov 20 '21 at 22:21

0 Answers0