Most of the classification models that I've encountered so far perform classification using CE loss.
For example, if we have 2 possible classes and the GT class is 1, then:
the CE loss will be $-\log{\frac{e^{x_1}}{e^{x_1} + e^{x_2}}}$ ,
where $x1,x2$ are the activations from the neurons of classes 1,2 respectively.
Whereas, if we trained the two neurons separately using BCE loss, then
the sum of two BCE losses, would be: $-\log{\frac{1}{1+e^{x_1}}} - \log{(1 - \frac{1}{1 + e^{x_2}})}$.
According to the YOLO object detection papers, their classification is performed using BCE loss (logistic regression) on each of the neurons, rather than using CE loss with Softmax.
I first thought that it would be pretty straightforward to show that the two expressions were equal, but failed to do it. Actually, it turned out to be very straightforward to prove that they were not equal, using the following Python code, which computes each of the loss values when the GT class is 1:
>>> import math
# sum of BCE losses: x1 is rewarded for being high, while x2 is rewarded for being low
>>> bce_loss = lambda x1,x2 : -math.log(1.0/(1.0+math.exp(-x1))) -math.log(1.0 - 1.0/(1.0+math.exp(-x2)))
# Softmax loss: the softmaxed, L1 normalized value of x1 is rewarded for being high
>>> softmax_loss = lambda x1,x2 : -math.log(math.exp(x1) / (math.exp(x1) + math.exp(x2)))
# For example, if the activations of our model are x1=-2, x2=2 (such that our classifier
# is going to be wrong), then our loss is going to be rather high - yet, we can
# immediately tell that the two loss terms aren't equal:
>>> bce_loss(-2,2)
4.2538560220859445
>>> softmax_loss(-2,2)
4.0181499279178094
So there is a difference between CE loss and sum of BCE losses. The question is when and why any of them is more favorable than the other?