3

In the context of (multiclass) classification, I've read papers which imply that NLL is minimized iff the model is well-calibrated (outputing true probability for each class and not just confidence), by why is that?

My intuition tells me that since NLL takes in account only the confidence of the model's predicted class $p_i$, then NLL is minimized as long as $p_i$ approaches $1$.

Thus, a model can be overconfident (not well-calibrated) and still minimize NLL.

Can someone elaborate on what am I missing here?

Paris
  • 155
  • 4
  • Does this answer your question? [What is "Likelihood Principle"?](https://stats.stackexchange.com/questions/491317/what-is-likelihood-principle) – Arya McCarthy Apr 08 '21 at 14:28
  • @AryaMcCarthy Could you elaborate on why this (probably) answers my question? – Paris Apr 08 '21 at 15:38
  • It is not. I would suggest you look for question explaining what calibration in this context is. For example, https://stats.stackexchange.com/questions/499696/why-focal-loss-works/508616#508616. But there are many others – jpmuc May 11 '21 at 21:46

1 Answers1

1

Without loss of generality, let's assume binary classification. For simplicity and illustration, let's assume that there is only one feature and it takes only one value (that is, it's a constant). Since effectively there are no covariates, there is only one parameter to estimate here, the probability $p$ of the positive class. Given data, which effectively consists of only $y$ in this case, learning or training becomes identical to the problem of parameter estimation for binomial distribution, for which any standard statistics textbook would contain some derivation like this:

Likehood $\displaystyle L(p) = {n \choose k} p^k (1-p)^{n-k}$, take the log of it and set the partial derivative to zero, $\displaystyle \frac{\partial \log L(p)}{\partial p}=0$. Solving it gives $\hat{p} = \frac{k}{n}$.

Now, allow $n \rightarrow \infty$, and let the true but unknown probability of the positive class be $\pi$. The likelihood becomes $\displaystyle L(p) = {n \choose n\pi} p^{n\pi} (1-p)^{n(1-\pi)}$. Repeating the same steps as above, which is legitimate despite $n \rightarrow \infty$, gives $\hat{p} = \pi$. Perfect calibration, achieved through likelihood maximization.

Allowing for covariates means one has to model $p(y=1|x)$ (say, using $1/\left(1+\exp{(-(\beta_0+\beta^T x))}\right)$ as in logistic regression), which can be imperfect and hence likelihood is only maximized over a particular functional family (say, the one used in logistic regression above; this is aka "parametric restriction" in some contexts) but not over all possible families, hence giving potentially miscalibrated probabilities. If we allow for all possible functional families to model $p(y=1|x)$, the likelihood would be truly maximized and perfect calibration achieved, in the same way as the toy example shows above. But that would understandably require infinite data, since it amounts to a parametric model with infinite parameters.

I think your intuition missed the fact that the likelihood depends on the true probabilities in the exponentiated form above, hence maximizing it would bring the estimated probabilities close to the true ones, as oppose to close to 1.

Lei Huang
  • 756
  • 6
  • 13
  • Can you please share the reference to the "calibration - NLL minimization correspondence" statement in your question by the way? – Lei Huang May 11 '21 at 08:25
  • The word “class” is too overloaded in this post. OP may confuse class labels/classification with functional classes. – Cagdas Ozgenc May 11 '21 at 19:16
  • Okay, edited the answer to use "functional family" instead of "functional class". – Lei Huang May 11 '21 at 19:22
  • I suspect, at a deep level, the statement in the question is related to the fact that maximizing likelihood is asymptotically [equivalent](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation#Relation_to_minimizing_Kullback%E2%80%93Leibler_divergence_and_cross_entropy) to minimizing the KL divergence or cross entropy between the true and predicted distributions. – Lei Huang May 11 '21 at 19:37