We have two classifiers A & B trained on the same training set and then tested on the same test set. Both classifiers output a risk score (probability of the outcome occurring). What statistical tests do you use to compare:
- AUCs between both classifiers?
- Expected risk calibration between both classifiers?
We define calibration as the extent to which the probabilities are good estimates of the rate of observed outcomes. Perfectly calibrated models are equivalent to P(Y=1 | output of classifier = p) = p for all p in (0,1].