Statistical Tests for AUC and Risk Calibration between Two Classifiers?

Question

We have two classifiers A & B trained on the same training set and then tested on the same test set. Both classifiers output a risk score (probability of the outcome occurring). What statistical tests do you use to compare:

AUCs between both classifiers?
Expected risk calibration between both classifiers?

We define calibration as the extent to which the probabilities are good estimates of the rate of observed outcomes. Perfectly calibrated models are equivalent to P(Y=1 | output of classifier = p) = p for all p in (0,1].

score 1 · Answer 1 · answered Aug 27 '20 at 22:30

Frank Harrell (whose c-index is equivalent to AUC) finds that AUC is not sensitive enough to discriminate well among models. He outlines his reasons here. A proper scoring rule like the Brier score (effectively the mean-square error between predicted probability and class membership) would be a much better choice, as proper scoring rules directly evaluate the calibration.

Statistical Tests for AUC and Risk Calibration between Two Classifiers?

1 Answers1