My understanding is that in ML one can establish a connection between these quantities using the following line of reasoning:
Assuming we plan to use ML to make decisions, we choose to minimize our Risk against a well defined loss function that scores those decisions. Since we often don't know the true distribution of the data, we can't directly minimize this Risk (our expected loss), and instead choose to minimize our Empirical Risk i.e. ER (or structural risk, if using regularization). It's empirical because we compute this risk as an average of the loss function on observed data.
If we assume that our model can output probabilities for those decisions, and we are solving a problem that involves hard decisions for which we have some ground truth examples, we can model the optimization of those decisions as minimizing ER with a cross-entropy loss function, and thus model decisions as a problem of classifying data. Under this loss, the ER is actually the same (not just equivalent) to the negative log likelihood (NLL) of the model for the observed data. So one can interpret minimizing ER as finding an MLE solution for our probabilistic model given the data.
From the above, we can also establish that the CE is equivalent to minimizing a KL divergence between our model (e.g. Q) for generating decisions and the true model (P) that generates the actual data and decisions. This is apparently a nice result, because one can argue that while we don't know the true data generating (optimal decision making) distribution, we can establish that we are doing "our best" to estimate it, in a KL sense. However, CE is not the same as KL. They measure different things and of course take on different values.
Is the above line of reasoning correct? Or do people e.g. use cross-entropy and KL divergence for problems other than classification? Also, does the "CE ≡ KL ≡ NLL" equivalence relationship (in terms of optimization solutions) always hold?
In either case, what is minimized in practice directly (KL vs the CE) and in what circumstances?
Motivation
Consider the following from a question on this site:
"The KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part). ... [From the comments] In my own experience ... BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions".
I have read similar statements online. That these two quantities are not the same, and in practice we use one (or the other) for optimization. Is that actually the case? If so, which quantity is actually evaluated and optimized directly in practice, for what types of problems, and why?
Related questions: