2

I'm learning about performance measures for binary classifiers. Reading about the AUC-ROC score I came across the article Measuring classifier performance: a coherent alternative to the area under the ROC curve, Hand (2009). The author claims that:

..the AUC is equivalent to averaging the misclassification loss over a cost ratio distribution which depends on the score distributions. Since the score distributions depend on the classifier, this means that, when evaluating classifier performance, the AUC evaluates a classifier using a metric which depends on the classifier itself. That is, the AUC evaluates different classifiers using different metrics. It is in that sense that the AUC is an incoherent measure of classifier performance.

and furthermore:

..this is effectively what the AUC does—it evaluates different classifiers using different metrics. It is as if one measured person A's height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used

(emphasis added)

Given that the usage of the AUC-ROC score is pretty widespread, this seems like a bold claim. If this is true, then using AUC-ROC to compare the performance of different classifiers is completely wrong. The author proposes a new (better?) performance metric called the $H$ measure. Unfortunately I can't entirely follow the maths involved in the article.

Is this author correct? Should we ditch the AUC-ROC completely in favour of this $H$ measure?


Add

Just realized there's even an R implementation of this measure: https://cran.r-project.org/web/packages/hmeasure/index.html

Gabriel
  • 3,072
  • 1
  • 22
  • 49
  • 2
    "Given that the usage of the AUC-ROC score is pretty widespread, this seems like a bold claim." Well, by far the majority of people have no idea what p-values are or how to interpret them, but they form the basis of probably millions of papers. – Forgottenscience Jun 23 '20 at 15:10
  • I guess that's a fair point. – Gabriel Jun 23 '20 at 15:13
  • 4
    Check out what Frank Harrell says about proper scoring rules and the use of AUC. The gist of his thoughts on AUC is that it is useful to see if a classifier is performing decently (e.g. "Hey, AUC=0.8...we must be pretty good" or "Yuck, AUC=0.55...back to work!") but is not to be used for comparing classifiers. Perhaps the H measure is a proper scoring rule, though there are easier ones than what I see in Hand's papers, such as Brier score and log loss (cross-entropy). – Dave Jun 23 '20 at 15:14
  • This bit interests me "*is not to be used for comparing classifiers*" because that is precisely what I am doing. Do you have a link to Harrell's article (?) by any chance? – Gabriel Jun 23 '20 at 15:16
  • 2
    Here are a couple of links. I can't remember where Harrell said that AUC was fine for assessing one model but not comparing models, though. https://www.fharrell.com/post/class-damage/ https://stats.stackexchange.com/questions/339919/what-does-it-mean-that-auc-is-a-semi-proper-scoring-rule https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email But it does appear that Harrell is right that the machine learning world often makes a mistake in using threshold-based (or other improper) scoring rules. – Dave Jun 23 '20 at 15:25
  • 2
    Regarding the "widespread" part: accuracy is probably the most widespread classifier evaluation measure, which doesn't stop it from being [crappy](https://stats.stackexchange.com/q/312780/1352). – Stephan Kolassa Jun 23 '20 at 15:25
  • 2
    [Here is a sampling of Frank Harrell on (AU)ROC.](https://stats.stackexchange.com/search?tab=votes&q=user%3a4253%20roc) – Stephan Kolassa Jun 23 '20 at 15:31

0 Answers0