When is a dataset “too imbalanced” for AUC ROC and PR is preferred?

Question

I’ve read that precision-recall (PR) curves are preferred over AUC-ROC curves when a dataset is imbalanced as there’s more of a focus on the model’s performance in correctly identifying the minority/positive class.

At what point (rule of thumb?) does it make more sense to primarily use PR to evaluate a classifier instead of AUC-ROC score? I imagine if the dataset has 40% positive class, AUC is still appropriate? But what about at 30% or 20% positive class? What level is considered “imbalanced” where PR is preferred?

"Unbalanced" datasets are not a problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) However, precision and recall are: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) (everything said about accuracy at that thread also applies to precision and recall). — Stephan Kolassa, May 11 '20 at 04:42
@StephanKolassa so what’s the rule of thumb? I read the links and most of the examples were 1% positive class and 99% negative class. Are you suggesting that’s the answer? — Insu Q, May 11 '20 at 12:31
No. Per my question and my answer to the accuracy question, there is no problem with unbalanced data, unless you use inappropriate quality measures like accuracy. Use an appropriate *probabilistic* model, and "unbalance" will naturally be expressed as low probabilities. — Stephan Kolassa, May 11 '20 at 14:20
@StephanKolassa I might not have asked my question correctly. I know there’s no problem with unbalanced data. A lot of real-world data is unbalanced. My question is, is there a point in that level of unbalance where using PR curves makes more sense than using AUC? If you have too few positive examples in a dataset, the AUC can appear to be high and when you look at the PR curve, it’s obvious there’s room for improvement. When your dataset has 49% positives and 51% negatives, technically it’s unbalanced but AUC is fine to use. When it’s 5% positives, you probably want to look at a PR curve. — Insu Q, May 11 '20 at 14:30
I advocate not using precision/recall at all. See the links above for my argument. [This may be helpful for context.](https://stats.meta.stackexchange.com/q/5000/1352) — Stephan Kolassa, May 11 '20 at 14:43

score 0 · Answer 1 · answered May 12 '20 at 19:01

0

Agree with the comments, I have used AUC ROC for binary classification with a class imbalance of 5% positive and 95% negative. I was actually able to get a pretty good model still.

answered May 12 '20 at 19:01

Stochastic

799
1
6
28

The concordance probability (AUROC) is not used for _classification_ (forced choice) but rather for assessing the pure predictive discrimination of a continuous _prediction_. And as you said it is unaffected by extreme imbalance. – Frank Harrell Nov 24 '20 at 12:43

score 0 · Accepted Answer · answered Nov 24 '20 at 11:09

0

Context

The imbalance depends on the dataset size also.

A model with 5-10% positive class and 90-95% negative class with 50 or 500 samples is different from a model that has 10'000 samples.

Opinion

A model seeing 1 positive sample and trying to learn from it is different from seeing hundreds of positive samples (even if they represent only 5% of the whole data).

Anyway, as anything between 20-40% positives is considered imbalanced, too imbalanced is around 5-10%, and extremely imbalanced is below 5%.

Resampling

Multiple resampling methods exist, however, it is very tricky on whether or not they improve your model, since an increase in the recall, causes also a huge decrease in precision in most of the times (if you oversample the minority).

answered Nov 24 '20 at 11:09

ombk

116
2

Consideration of imbalance means that you probability don't have a proper accuracy scoring rule in your mind. Take a look at https://fharrell.com/post/class-damage – Frank Harrell Nov 24 '20 at 12:45
@FrankHarrell please provide an answer to the post. – ombk Nov 24 '20 at 12:47
I posted my comments as a comment rather than an answer. – Frank Harrell Nov 24 '20 at 13:21
@FrankHarrell what if our data is not linearly separable and we are just using a basic model like logistic regression. – ombk Nov 24 '20 at 13:24
Not clear on what that means. Extremely easy to relax linearity assumptions - see my [RMS course notes](https://hbiostat.org/rms). – Frank Harrell Nov 24 '20 at 21:08

When is a dataset “too imbalanced” for AUC ROC and PR is preferred?

2 Answers2