Probability calibration metric for multiclass classifier

Question

A machine learning classifier can be calibrated so that when the probability that datapoint i is of class A is 0.6, this is true 60% of the time.

In the binary class setting, this can be visualised with a reliability curve, or measured with a metric like Mean Calibration Error, which is the weighted root-mean squared error between predicted probabilities and true probabilities on a calibration plot (see here).

My question is, how do you extend this to the multiclass setting. Clearly it can't be visualised, but is a reliability curve for each class appropriate? Or does this depend on the classifier being used (for example I'm using an SVM with OVA). Is Brier Score or Log-loss the best way to go (the volatility of log-loss puts me off a bit), or is it possible (how?) to extend Mean Calibration Error to multiclass (another possibility is CAL, defined here)

Flag to @B_Miner re: https://stats.stackexchange.com/questions/6599/calibrating-a-multi-class-boosted-classifier/6665#6665 — TMrtSmith, Aug 16 '18 at 15:13
Definitely have a look at the link in the prior comment - one of the referenced papers describes how to treat a multiclass problem as binary OVR with individually calibrated probabilities, which can then be normalised across the individual classes. — songololo, Jun 27 '19 at 09:45

score 0 · Accepted Answer · answered Aug 20 '18 at 10:45

Following Guo et al., I ended up using the Expected Calibration Error, defined as $$\sum_{m=1}^M\frac{|{B_{m}|}}{n}\left|acc(B_m) - conf(B_m)\right|$$

In extending this to multiclass, one can either take the maximum probability for each prediction, or average across the top $n$ predictions, if desired.

Probability calibration metric for multiclass classifier

1 Answers1

Linked