Is there something like a confusion matrix for a probabilistic score rather than categories?

Question

Imagine we have pictures of three animals: dogs, cats, and horses. We train our image classifier and get a confusion matrix, noticing that the model tends to predict that dogs are horses.

But then we read Cross Validated and learn that threshold-based scoring rules like accuracy have serious flaws, so we want to look at the predicted probabilities, not just take the category with the highest probability.

Is there a way to adopt the confusion matrix to the probability outputs to notice that the model tends to believe that dogs are horses? The idea that comes to mind is to take the sum of the predicted probabilities (which works out to be the same as a confusion matrix, if we do this to predicted probabilities that we "round" to give a probability of $1$ of the category with the highest probability). Has this been explored in any literature?

DifferentialPleiometry · Answer 1 · 2021-06-21T15:55:03.883

This is only a partial answer as this comes from my personal experience of training classifiers rather than the literature.

Many classifiers output a weight (or probability) for each class simultaneously, which means the weights are paired by the example from the data set. The approach I have taken is to treat this resulting matrix (rows correspond to examples, columns to the class, and entries are the output weights) as a dataset unto itself to study.

In some cases this involves estimating conditional metaprobabilities between classes, but often pairplots and dimensionality reduction plots (PCA/MDS/etc) reveal a lot about what is going on between classes. However, the metaprobability distributions may be what you're interested in if you wish quantify dependence between class confidences.

score 1 · Answer 2 · answered Jun 21 '21 at 15:50

I have never come across such a thing in literature, but it is a very interesting idea. Firstly, I'd like to point out that normalised confusion matrices exist (I know this isn't what you are asking for but it will illustrate a point I'm going to make, so just bare with me); for these types of confusion matrix there is some form of normalisation such that rows or columns sum to 1, the matrix has a norm of 1, or individual elements are normalised relative to the total number of samples. This means, of course, that a confusion matrix can contain entries which are in the range $[0,1]$ instead of the typical confusion matrix where entries are in range $[0, NumSamples]$, it encapsulates the same relationships as an un-normalized confusion matrix but simply with the values scaled.

My idea would be instead of creating a normalised matrix that contains TP/TN/FP/FN as entries you instead construct a matrix of the One-vs-One scores for different classes using a metric such as Average Precision which takes into account how thresholding affects prediction. Of course, this matrix would be symmetric as Dog-vs-Cat has the same AP as Cat-vs-Dog, but it would give an idea of prediction confidence based on the probabilistic scores rather than the hard predictions. AP would be my first choice, but this method would be relevant to any metric which use prediction scores (and would even work for metrics that use hard predictions too).

Is there something like a confusion matrix for a probabilistic score rather than categories?

2 Answers2