6

Neural network "classifiers" output probability scores, and when they are optimized via crossentropy loss (common) or another proper scoring rule, they are optimized in expectation by the true probabilities of class membership.

However, I have read on Cross Validated and perhaps elsewhere that neural networks are notorious for being overly confident. That is, they will be happy to predict something like $P(1) = 0.9$ when they should be predicting $P(1) = 0.7$, which still favors class $1$ over class $0$ but by less.

If neural networks are optimizing a proper scoring rule like crossentropy loss, how can this be?

All that comes to mind is that the model development steps optimize improper metrics like accuracy. Sure, the model in cross validation is fitted to the training data using crossentropy loss, but the hyperparameters are tuned to get the highest out-of-sample accuracy, not the lowest crossentropy loss.

(But then I figure that the model would be less confident in its predictions. Why be confident in your prediction when you get the right classification with a low-confidence classification like $0.7$ than a high-confidence classification like $0.9$?)

Dave
  • 28,473
  • 4
  • 52
  • 104
  • 4
    Good question. I suspect part of the answer is that you can overfit to proper scoring rules just as easily as to other KPIs if you use them in-sample. After all, OLS is fitted by maximizing the log likelihood, which is the log score, a proper scoring rule - but that OLS can overfit is common knowledge. – Stephan Kolassa Jun 30 '21 at 14:38
  • I think the key term to google is Expected Callibration Error (ECE). I suspect this post will answer your question http://alondaks.com/2017/12/31/the-importance-of-calibrating-your-deep-model/ – Aleksejs Fomins Jun 30 '21 at 15:20
  • @StephanKolassa Why would that be so unique to neural networks and not logistic regression? Is it a matter of a neural network having (perhaps) millions of parameters but the logistic regression maybe having dozens? – Dave Jul 02 '21 at 16:48
  • 1
    @Dave: yes, that makes sense. Logistic regression can also overfit if you over-parameterize it. And conversely, I would not expect a simple network architecture to overfit badly. – Stephan Kolassa Jul 02 '21 at 19:25
  • @StephanKolassa indeed it can. LR can even overfit when it is not over-parameterised, which is why regularised (ridge) logistic regression is a very useful tool to have in your statistic toolbox. – Dikran Marsupial Jul 26 '21 at 07:10
  • @StephanKolassa I found an ICML paper by Guo, ["On calibration of modern neural networks"](http://proceedings.mlr.press/v70/guo17a/guo17a.pdf), that seems to align with what I posit. [I think Guo misses some elements of calibration](https://stats.stackexchange.com/questions/552533/does-guos-on-calibration-of-modern-neural-networks-miss-the-probabilities-of), but the paper does mention that log loss (paper calls it "NLL", if you are doing CTRL+F) can be ovefitted without overfitting accuracy based on the category with the highest probability. – Dave Nov 17 '21 at 17:04

2 Answers2

4

"If neural networks are optimizing a proper scoring rule like cross-entropy loss, how can this be?"

This is likely to be traditional over-fitting of the training data. A deep neural network can implement any mapping that a radial basis function neural network can implement (they are both universal aproximators). Consider a problem with a small data set and a narrow width for the Gaussian radial basis functions. It is possible that you might be able to place a basis function directly over each positive pattern, such that the value has decreased to nearly zero by the time you get to the nearest negative pattern. This model will give a probability of class membership of essentially zero or one for every training pattern (probably way over-confident) and a training set cross-entropy of zero. This means there will also be a zero cross-entropy solution for a suitably large deep neural network as well (the good thing is that solution is a lot harder to find for a DNN - sometimes local minima are a good thing).

Making architecture or hyper-parameter choices gives more ways in which to over-fit the data, but I suspect the largest part of the problem is traditional over-fitting of the training set, unless steps are taken to avoid it.

BTW using cross-entropy as the model selection criterion for tuning the model is not without it's own problems, for instance if you have one very confident miss-classification, then the entire cross-entropy is dominated by the contribution of that one test example. Something a little less sensitive, like the Brier score might be better (if less satisfying).

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • This is interesting and gets a +1 from me, but then why would the accuracy (or something similar) be decent out-of-sample? – Dave Jul 26 '21 at 10:00
  • I wouldn't recommend using accuracy for model selection either (because it is brittle). The problem is that you can have decent (but not perfect) out-of-sample accuracy, but the model be very confident in one of its out-of-sample miss-classifications, which results in an essentially infinitely bad out-of-sample cross-entropy. There are ways of getting round this, such as placing a limit on the plausible confidence of the model, but something like Brier score may be easier. – Dikran Marsupial Jul 26 '21 at 10:09
  • I use least-square support vector machines a fair bit, and it seems consistent to use least-squares error on the out-of-sample criterion used for model selection as it is also used in-sample for fitting the model. It would be nice to use out-of-sample cross-entropy for model selection for kernel logistic regression, but you run in to the problem I mentioned earlier in some cases. – Dikran Marsupial Jul 26 '21 at 10:11
  • BTW I did a bit of a comparative study on this sort of thing for LSSVMs for a machine learning challenge at a conference https://ieeexplore.ieee.org/abstract/document/1716307 (pre-print here: http://theoval.cmp.uea.ac.uk/publications/pdf/ijcnn2006a.pdf ). Looks like least-squares based model selection criterion (PRESS) was a good approach. – Dikran Marsupial Jul 26 '21 at 10:19
  • It’s the infinitely-bad crossentropy that’s tripping me up. If an developer is optimizing an out-of-sample metric like accuracy (probably even AUC), one bad miss like that should get washed out by the other good predictions, but the out-of-sample crossentropy loss will be poor. That should alert the modeler that something weird has happened…which gets missed by looking at the threshold-based metrics. – Dave Jul 26 '21 at 10:27
  • the cross-entropy includes a t_i*log(y_i) term, where y_i is an output of the model and t_i the desired output. If y_i = 0 and t_i = 1, then this will be the log of zero, which is infinite. Accuracy only cares whether y_i was the right side of 0.5, but doesn't care how wrong it is when it is wrong, so one catastrophicly confident error is no big deal. AUC is a measure of the quality of the ranking, but is bounded, so is better in this sense than XENT, but doesn't measure the calibration of the probabilities. – Dikran Marsupial Jul 26 '21 at 10:41
  • The problem is that if you use xent for a model selection metric, rather than just a diagnostic measure, then it can lead the model selection astray just because of a problem with one individual test case, rather than seeing the big picture. It isn't really weird, it is just an indication of over-fitting the training sample, which will also be picked up by e.g. Brier score, but without the problem caused by the unboundedness of xent. – Dikran Marsupial Jul 26 '21 at 10:43
0

There are some interesting properties about cross entropy loss (well, even logistic loss).

Not only we want to classify the instance correctly, we want strong options

Please check following example, note, weakly and correctly classifying one data point is almost as bad ad wrongly classifying the point.

  import numpy as np

  def cross_entropy(pred_prob, target): 
      return -np.sum(np.log(pred_prob) * (target))

  target = np.array([1,0,0])

  pred_prob = np.array([1/3,1/3,1/3])
  print('3 classes, even dist\t',cross_entropy(pred_prob, target))

  pred_prob = np.array([0.3334,0.3333,0.3333])
  print('3 classes, weakly right \t',cross_entropy(pred_prob, target))

  pred_prob = np.array([0.3333,0.3333,0.3334])
  print('3 classes, weakly wrong \t',cross_entropy(pred_prob, target))

I also had a related question here

What are the impacts of choosing different loss functions in classification to approximate 0-1 loss

In addition, the as discussed in another answer, one major reason model is confidently wrong is the overfitting.

I recently had some interesting ideas on what may be happening inside DNN. I think essentially the overfitted model is trying to learn some "hash functions" and hardly remember "the has and the target".

In this way, it is easily to get very low loss in training data. And this "hash" is specific to certain training examples, and will very likely to be over confident and have very strong option on one class.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213