Was just having a discussion with a colleague, and realize I have the following questions about cross-entropy that is typically used in classification problems.
We know that cross entropy contains both the entropy of the ground truth and the forward KL divergence between ground truth and prediction. From a minimizing loss function perspective, minimizing cross-entropy should be equivalent to minimizing the forward KL divergence. Why do we use cross entropy and not just the forward KL divergence?
What is the intuition behind using forward KL and not the reverse KL? (or JS divergence that is symmetrical?)
I have seen another post discussing this very topic, where my takeaway is that because ground truth has hard 0s and we want to avoid encountering log(0) in loss function, so we stick with $true \times log(pred)$. Is this indeed the broadly accepted view?
Thanks!