I found this answers. But, I don't get fully. If I have three labels in multi label classification task, did BCE produce 3 separate outputs? Why we shouldn't use CCE?
In this Facebook work they claim that, despite being counter-intuitive, Categorical Cross-Entropy loss, or Softmax loss worked better than Binary Cross-Entropy loss in their multi-label classification problem.