After understanding the benefits of AUC
I was stumbled to know that in some scenarios it might not be a good performance metric for evaluating a classification model. The below are the 2 scenarios:
1. Scale invariance is not always desirable:
For example, sometimes we really do need well-calibrated probability outputs, and AUC won’t tell us about that.
2. Classification-threshold invariance is not always desirable:
In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase in false negatives). AUC isn't a useful metric for this type of optimization.
Questions:
Could anyone explain with an example what a well-calibrated ? probability outputs of a model are? and How AUC will fail to evaluate in this condition?
Could anyone suggest a good metric for second scenarios?