Why do we use cross entropy instead of Kullback-Leibler divergence as loss function? Why do we use forward KL divergence and not the reverse?

Question

Was just having a discussion with a colleague, and realize I have the following questions about cross-entropy that is typically used in classification problems.

We know that cross entropy contains both the entropy of the ground truth and the forward KL divergence between ground truth and prediction. From a minimizing loss function perspective, minimizing cross-entropy should be equivalent to minimizing the forward KL divergence. Why do we use cross entropy and not just the forward KL divergence?
What is the intuition behind using forward KL and not the reverse KL? (or JS divergence that is symmetrical?)

I have seen another post discussing this very topic, where my takeaway is that because ground truth has hard 0s and we want to avoid encountering log(0) in loss function, so we stick with $true \times log(pred)$. Is this indeed the broadly accepted view?

Thanks!

[When you find parameters that minimize cross-entropy loss in a logistic regression or neural network classifier, that is equivalent to finding maximum likelihood estimates of the parameters.](https://stats.stackexchange.com/a/364237) — Dave, Oct 13 '21 at 21:24

Why do we use cross entropy instead of Kullback-Leibler divergence as loss function? Why do we use forward KL divergence and not the reverse?

0 Answers0