14

Can someone explain what area under the curve means for someone with absolutely no stats knowledge? For example, if a model claims an AUC of 0.9, does that mean that it makes an accurate prediction 90% of the time?

Alexis
  • 26,219
  • 5
  • 78
  • 131
Forest
  • 243
  • 6

4 Answers4

17

AUC is difficult to understand and interpret even with statistical knowledge. Without such knowledge I'd stick to the following stylized facts:

  1. AUC close to 0.5 means a model performance wasn't better than randomly classifying subjects. It wasn't better than a silly random number generator to mark the samples as positive and negative.
  2. AUC is used by some to compare models.
  3. Higher AUC suggests better demonstrated performance in classification.
  4. AUC is a noisy metric
  5. Max AUC is 1, for a classification model that is never wrong
  6. Although technically Min AUC is 0, it makes little sense to have AUC lesser than 0.5. AUC zero means that by a simple switch from positive to negative label you get to a perfect classification
Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • Thank you! That does help. Hearing that it's hard to understand even with stats knowledge probably helps explain why I was struggling to find a clear answer before. :-) – Forest Oct 12 '21 at 21:11
  • 1
    most people don't sweat about the subtleties, and use it to compare models. in some fields it's an accepted metric by peers – Aksakal Oct 12 '21 at 21:33
  • 4
    I'd add to the list of facts that the maximum possible AUC, for a model that is *never wrong* is 1.0. – Gregor Thomas Oct 13 '21 at 04:17
  • 1
    @GregorThomas That phrasing is a tad strong. Sycorax posted a comment to a deleted answer that I will quote here. "A model that ranks all positives higher than all negatives has AUC 1, but is not required to be 100% correct. An example of this is a model that assigns all positives a predicted probability of 0.49, and all negatives a predictive probability 0.48. This model is only "correct" for the proportion of negatives (because they score below 0.5), but has AUC of 1 nevertheless." – Dave Oct 13 '21 at 14:19
  • 4
    @Dave An AUC represents a model, but not one specific classifier - a classifier requires a threshold and is represented by one point on the ROC curve. An AUC of 1 indicates that *if you set the threshold properly*, the resulting classifier is never wrong. You can still pick a sub-optimal threshold and get less than 100% sensitivity/specificity, but there would be no reason to. There's no reason a probabilistic classification threshold must be set at 0.5 - in your example, if you set the threshold at 0.485, the classifier is indeed 100% accurate. The model just isn't well calibrated. – Nuclear Hoagie Oct 13 '21 at 15:01
  • I'd phrase (5) slightly differently: 'If the classifier is "never wrong," then the AUC is 1.' But the AUC can be 1 even if the model is "always wrong" about one of the classes, for instance if the predicted probabilities of the positive class are 0.49 but the predicted probabilities of the negative class are 0.48. In this case, the positives are all misclassified using the cutoff 0.5 because 0.49 < 0.5, but the AUC is still 1 because all the positives are ranked higher than all of the negatives. Clearly, you could choose a different cutoff, but that needs to be explicit. – Sycorax Oct 13 '21 at 15:34
  • 1
    @Sycorax Agree, I think the disconnect is between a "model" and a "classifier". A classifier *does not have an AUC*, it has a sensitivity and specificity. A classifier is represented by one point on the ROC curve, but you can't compute area under a point. Point 5 could be phrased as "the AUC is 1 for a model that ranks samples perfectly". – Nuclear Hoagie Oct 13 '21 at 17:05
  • @Dave & Sycorax I see your points, and of course you are techincally correct, but I strongly disagree about the need to include them in this context. Communicating the concept of AUC to someone "with no stats knowledge" as "stylized facts", as this answer is doing, should not include technical details of what can go wrong with choosing a cut-off. I'm imagining a manager who wants a high level understanding of a model comparison slide in a business presentation - the points in the original answer provide a perfect level of detail for that audience.... – Gregor Thomas Oct 13 '21 at 17:50
  • ...The missing detail was the upper bound - it's important to communicate that *the upper bound is 1*--it's helpful to know what the bounds of a metric are--but it's also worth noting that for real world problems that upper bound isn't really attainable, so presenting a model with an AUC of, say, .95 can be seen as "pretty darn good" not "still working on that last 0.05 for the model is ready". Sure, the phrasing could be adjusted, but as stated I think it concisely conveys that point. – Gregor Thomas Oct 13 '21 at 17:51
  • "1. AUC close to 0.5 means a model performance wasn't better than randomly classifying subjects. It wasn't better than a silly random number generator..." This is only true for balanced data. For imbalanced data - i.e., most datasets - a classifier with an AUC of 0.5 can be much better than random. Consider data with prevalence (proportion of positive class) of 0.7. A classifier which assigns probability > 0.5 to each sample has an AUROC of 0.5. Yet it has an accuracy of 70%, much better than the random classifier, which has 50% accuracy by definition. – ljubomir Oct 19 '21 at 23:32
  • @ljubomir for unbalanced data the random classifier would be randomly marking 70% as positive. – Aksakal Oct 19 '21 at 23:44
  • @Aksakal: maybe you are referring to weighted guess classifier, which is not truly random because it is prevalence-aware: [link](https://blog.revolutionanalytics.com/2016/03/classification-models.html) A truly random guess classifier has a probability of error of 50% for binary classification – ljubomir Oct 20 '21 at 00:32
8

To keep things reasonably simple, an AUC of 0.9 would mean that if you randomly picked one person/thing from each class of outcome (e.g., one person with the disease and one without), there is a 90% chance that the one from the class of interest (the group being modelled, here those with the disease) has the higher value (or this could be a lower value if the thing of interest was associated with the reference or default class).

So if the AUC for predicting "being male" versus "being female" using height was 0.9, this would mean that if you took a random male and a random female, 90% of the time, the male would be taller.

user215517
  • 337
  • 1
  • 5
  • Came here to give this answer. Sure, it's an area under a curve, but you don't need to understand that curve to give this explanation. I'd only add that the scale of AUC is practically from 0.5 to 1 - if you can't even get a 50% chance then your model is worse than random guessing. – Michael Lugo Oct 13 '21 at 14:30
  • Are you sure you’re not talking about 1-specificity? AUC is difficult to pin to ratios and percentages – Aksakal Oct 13 '21 at 19:11
  • @Aksakal A proof, more complex than was asked for, of this relationship is given [here](https://stats.stackexchange.com/questions/190216/why-is-roc-auc-equivalent-to-the-probability-that-two-randomly-selected-samples) – user215517 Oct 13 '21 at 20:44
  • @user215517 the answer talks about AUC being proportional. Also when constructing ROC you run through all threshold, so the meaning of “90% of time” needs to be defined – Aksakal Oct 13 '21 at 20:51
  • @Aksakal Perhaps we're reading things differently, but from the question linked to: "The second [interpretation] is that the AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example, i.e. P(score(x+)>score(x−))." which is what I was (hoping I was) saying above. "90% of the time" has a standard frequentist interpretation here. The answer that says "AUC is proportional to the number of correctly ordered pairs" is saying the same thing. The proportion of correctly ordered pairs is the AUC. – user215517 Oct 13 '21 at 21:09
  • @Aksakal The probabilistic interpretation is standard. It follows from a [simple change of variable](https://stats.stackexchange.com/questions/180638/how-to-derive-the-probabilistic-interpretation-of-the-auc?rq=1). – Hasse1987 Oct 14 '21 at 00:00
  • 1
    @MichaelLugo If your AUC is less than 0.5, then you should pursue the [Constanza method](https://www.youtube.com/watch?v=cKUvKE3bQlY). – Acccumulation Oct 14 '21 at 19:51
5

A classifier is a criterion to assign an individual to a category ("positive" or "negative") depending to some of its characteristics.

Some classifiers will provide each individual with a number between $0$ and $1$, with $0$ being "totally sure it's negative" and 1 being "totally sure it's positive". We usually take $0.5$ as the threshold between what we take as "positive" and what we take as "negative", but this is not always the case.

Taking a low threshold will result in more true positives but also more false ones. Taking a higher threshold will reduce the number of false positives, but we'll also leave as negative some of the cases that where actually positive (thus less true postiives as well). So in the end, since no classifier is perfect, it will be a compromise between the two.

Each point in the ROC curve represents the rates of true and false positives for each of the possible thresholds we could choose. The AUC is the area below that curve. A high AUC indicates that the model can get a good FPR (false positive rate) without losing too much TPR (ture positive rate) and vice-versa.(Note that the area below the ROC curve will be big if you get a high TPR already for an FPR close to 0).

SIMPLIFIED EXAMPLE: let's say you want to use a person's height to determine whether they're a man or a woman. Your classifier will choose some height $X$ and predict that everyone above height $X$ is male and everyone below it is female.

If you choose a very high $X$, like $1.90$m, you will hardly ever mislabel a woman as male, but you will also "miss" many men. On the other hand, if you pick a low $X$ like $1.50$m, you will correctly identify almost all men, but you will also classify a lot of women as male. For each $X$ you can choose, you'll get different true and false positive ratios, but it's ultimately a kind of arbitrary choice depending of what type of error worries you the most.

In this context, we could plot the ROC curve with the different TPRs and FPRs, then the AUC would give us an idea of how good of a classifier we can hope to get using height (as opposed to some other classifier we could have thought of using something like weight, age, blood pressure...). (See user215517's answer)

David
  • 2,422
  • 1
  • 4
  • 15
0

Following up on the comment from @Nuclear Hoagie, the ROC curve for a model is generated by evaluating classifiers using a sequence of thresholds for declaring positive or negative. The AUC represents the area under the curve over the entire range of possible thresholds. Often, only a restricted range of thresholds is really of interest. When this is the case, AUC may not be the best way to compare models.

hsafer
  • 1
  • 2