Worse AUC but better metrics (Recall, Precision) on a classification problem - How can this happen?

Question

I have two models on which I calculate train and test performances. They both are the same algorithm (lightgbm), same hyper-params, only the data differ (the second one has the data from the first one plus some more).

The first model returns 0.73 AUC with 0.32 Precision and 0.44 Recall (all these on the test set). The second model returns a considerably lower AUC (0.65) but with higher Precision (0.35) and Recall (0.66). Isn't that a paradox? Oh and if that matters, I should mention that the problem is imbalanced (10% class 1).

It may be that the high-AUC model outperforms the low-AUC at the threshold you’re using (probably $0.5$), but check other thresholds. // Performance metrics do not have to agree; that’s why we have multiple metrics. // The best metrics tend to be so-called strictly proper [tag:scoring-rules] like log loss (cross-entropy loss) and Brier score, both of which perform fine when the classss are imbalanced. Frank Harrell has written about this one his blog. // Log loss and Brier score, both of which are strictly proper scoring rules, won’t even always agree! — Dave, Oct 07 '21 at 15:20
Is the second dataset less imbalanced than the first? Also, you should be using area under the precision recall curve instead of just precision and recall at whatever threshold the default is. — hahdawg, Oct 07 '21 at 22:07
@DemetriPananos 2500 observations the first one, 4000 the second one — Georgios Sarantitis, Oct 08 '21 at 08:47

Dave · Answer 1 · 2021-10-07T15:45:43.080

Remember that ROC curves are constructed by considering all thresholds, while metrics like accuracy, sensitivity, specificity, precision, and recall only use one threshold. When you configure your software to calculate the precision and recall of the models when the threshold is changed, I would expect you to find that the high-AUC model tends to outperform the low-AUC model.

However, it usually is preferable to evaluate the probability predictions, rather than applying thresholding. Two common ways of doing this are called log loss ("cross-entropy loss" in a lot of neural network circles) and Brier score. Frank Harrell has two good blog posts about this topic.

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Classification vs. Prediction

Stephan Kolassa wrote a nice answer to a question of mine that gets at this topic, too.

Note that strictly proper scoring rules like log loss and Brier score need not agree about which model performs better (fairly easy to simulate), so it should not be expected that AUC and precision or AUC and recall agree on the better model, either.

Are there any publications I could cite to support this point about using proper scoring metrics. I'm having a hard time convincing others. — N Blake, Oct 07 '21 at 16:30
That first link from Harrell's blog links to a Journal of Statistical Software paper, ["Evaluating Probabilistic Forecasts with scoringRules"](https://www.jstatsoft.org/article/view/v090i12). Perhaps start there and consult the references inside. The [Wikipedia article on scoring rules](https://en.wikipedia.org/wiki/Scoring_rule) should have some references, too. — Dave, Oct 07 '21 at 17:19

Worse AUC but better metrics (Recall, Precision) on a classification problem - How can this happen?

1 Answers1