Why AUC is not a good performance metric for a classification model?

Question

After understanding the benefits of AUC

I was stumbled to know that in some scenarios it might not be a good performance metric for evaluating a classification model. The below are the 2 scenarios:

1. Scale invariance is not always desirable:

For example, sometimes we really do need well-calibrated probability outputs, and AUC won’t tell us about that.

2. Classification-threshold invariance is not always desirable:

In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase in false negatives). AUC isn't a useful metric for this type of optimization.

Questions:

Could anyone explain with an example what a well-calibrated ? probability outputs of a model are? and How AUC will fail to evaluate in this condition?
Could anyone suggest a good metric for second scenarios?

Your point 2 seems to confuse threshold invariance with the lack of a cost function. These are 2 pretty different things to me, I think you could try to clarify this a bit. — Calimo, Nov 05 '18 at 07:21
here, the "cost" is not a cost function. I think the author describes a scenario wherein one needs false positives to be lower no matter How much ever increase in False negatives(which case AUC will not help!). And the classification threshold invariance is a property of AUC that it measures the quality of model predictions irrespective of what the classification threshold is chosen (a.k.a on all classification threshold values.) — Anu, Nov 05 '18 at 20:12
It might help to be clear about what you're quoting (& from where), to distinguish it from what you're asking. Else e.g. you appear to assert that you sometimes need well-calibrated probability outputs while at the same time asking what they are. — Scortchi - Reinstate Monica, Nov 06 '18 at 17:20
@Scortchi, I referenced the [article](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) above and quoted 2 statements from there, afterward I asked my question on it so that it's clear to the audience what I am looking and where I am looked for it. I apologize if you aren't clear about anything in my post or the structure, Please feel free to ask and I will elaborate! — Anu, Nov 06 '18 at 18:41
It's just a matter of formatting - I've done it now after finding the quoted text in the link. (FYI there's a button in the question menu that applies this formatting to selected text. — Scortchi - Reinstate Monica, Nov 06 '18 at 19:27
In terms of "evaluating the performance of a classification model" I always favor sensitivity, specificity and confidence for both the positive as well as the negative case. — mroman, Nov 06 '18 at 20:10
@mroman, could you post some links/code/blog post to back your statement, it will be very useful to me & community. Also, it would be great if you add some examples of why you don't think AUC to be a good metric, you can add here as an answer. :) — Anu, Nov 06 '18 at 23:47
I don't really use AUC/ROC directly (mostly because they are for binary classification), I tend to only use Youden's index. If the costs aren't equal then I assign weights to the sensitivites/specificities when computing the index which gives me a metric of how good my classification is given the associated costs. I don't know if this method has an associated name but that's what I use because it's the most meaningful to me. You can also calculate a similar index using the confidences instead which takes real world distributions into account. — mroman, Nov 07 '18 at 11:27
https://mroman.ch/guides/sensspec.html describes this with somewhat more detail. — mroman, Nov 07 '18 at 11:27
You might want to read this Q&A https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it and especially Frank Harrell's answer — mdewey, Nov 08 '18 at 13:34

score 1 · Answer 1 · answered Nov 08 '18 at 10:49

First a little disclaimer: I don't have the academic credentials to back anything up that I'm saying now. This is just what I use in practice.

There's the metric called Youden's index which is:

$$Y = -1 + sensitivity + specificity$$

If $Y = 0$ then your classification system is random, if $Y = 1$ then it is a perfect classification system.

It is possible to favor sensitivity over specificity or vice-versa by adding weights:

$$Y = -1 + sensitivity \cdot 2 \cdot w + specificity \cdot 2 \cdot (1 - w)$$

If your classifier is detecting spam and the cost of a false positive is high you want as much true negatives as possible and thus favor specificity over sensitivity. By adding weights you get an index of how good your classification system is given the associated costs of wrong classifications. You can also plot a $Y$ curve(s) (even multi-dimensional) in case you have parameters in your classification system and then calculate the area under the curve or volume (in case you have two parameters) or you can just sum up the $Y$s for each combination of parameters. This of course can easily be extended to multiple classes:

$$Y = \frac{-N + \sum_{i}{sensitivity_{i}\cdot 2\cdot w_{i} + specificity_{i}\cdot 2 \cdot (1-w_{i})}}{N}$$

I use this to compare neural networks that distinguish between multiple classes but it's more important to be able to correctly classify a few classes and the remaining classes are not that important (e.g. being able to recognize a stop sign is much more important than being able to recognize a sign that tells you at which time of the day you're allowed to park there). Weights allows me to do this be configuring how important it is to recognize something (sensitivity) and how important it is to be able to recognize that something isn't something (specificity) (e.g. I want a high specificity on the park sign but the sensitivity isn't that important and I want high sensitivity on the stop sign)

score 0 · Answer 2 · answered Nov 05 '18 at 05:40

Due to how ROC is calculated, the higher your True Positives are (aka the more accepting your model is), the more likely False Positives are. So for your first question, I think what it means is to have a model that is precise enough to not inflate the false percentages but you won't know that because you're looking at a trapezoid.

For part 2, KS (Kolmogorov-Smirnov) statistics is a good alternative

Why AUC is not a good performance metric for a classification model?

2 Answers2