What does high auc score but poor f1 indicate for imbalanced dataset?

Question

I am working on a binary classification with an imbalanced dataset of 977 records (77:23 class ratio). My label 1 (POS not met) is the minority class.

Currently without any over/under sampling techniques (as it is not recommended), I get the below performance (using `class_weight parameter though)

And my roc_auc score is 0.8024156371012354

Based on the above results, to me, it feels like my model still doesn't perform well on POS not met which is our label 1.

However, my auc is 80% which feels like a decent figure to look at.

Now my question is as follows

a) Irrespective of business decision to keep/reject the model, based on the above metrics alone, how do I know my model is performing?
I read that AUC talks about discriminative ability between the positive and negative classes. **Does it mean my model decision threshold should be 0.8?
** While my model is good at discriminating between positive and negative, it is bad at identifying not met as not met (recall only 60%). But my dataset is imbalanced though. Would auc still apply?

b) Is my dataset imbalanced first of all?
What is considered imbalance? 1:99 or 10:90 or 20:80 etc?
Is there any measure (like correlation coefficient) that can indicate the imbalance level?

c) Based on above matrix, how should I interpret the f1-score, recall and auc together?
What does high auc but poor f1 mean?

update

I used the code below (from online) to get the best f1 at different thresholds

I’ve changed some of the tags to reflect the question at hand, rather than the fact that you are interested in neural networks and `sklearn` software, as those are unrelated to the question you asked. — Dave, Feb 23 '22 at 11:03

score 1 · Accepted Answer · answered Feb 23 '22 at 10:55

1

Out of context, it’s really hard to say how good performance is. While your AUC around $0.8$ could be quite good, it could be that your performance is rather pedestrian or that even a value of $0.55$ is excellent.

A key point to remember for the $F1$ score is that it requires a threshold, while AUC is calculated over all thresholds, and your software is using a default threshold of $0.5$ that might be wildly inappropriate. You might find it informative to write a bit of code that calculates the $F1$ over a range of thresholds, something like:

for threshold in (0.1, 0.2, 0.3,…0.8, 0.9):
    Map probability outputs to categories
    Calculate F1

I suspect you will find a better $F1$ score at a different threshold. With a few more lines, you can plot the $F1$ as a function of the threshold, which you might find useful.

This question at least alludes to the decision-making aspect of the problem, too, where a hard decision about a category must be made. I, along with plenty of high-reputation members here, would argue that, unless you know what you gain from correct classifications and what you lose from incorrect classifications, you have no business making hard categorical predictions and should be predicting probabilities. Nonetheless, the code exercise above should show that you can tweak the threshold as needed to make hard classifications once the gains and losses are known.

I will close with some of the usual links I post on this topic. Having seen several of your posts here and on Data Science, I highly recommend Frank Harrell’s blog posts.

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

https://www.fharrell.com/post/class-damage/

https://www.fharrell.com/post/classification/

https://stats.stackexchange.com/a/359936/247274

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

answered Feb 23 '22 at 10:55

Dave

28,473
4
52
104

Thanks for the help. Upvoted. What if my optimization kf thresholds results in 0.5 as the best threshold? – The Great Feb 23 '22 at 11:19
@TheGreat Beat in what sense, highest $F1$ value? – Dave Feb 23 '22 at 11:21
Through our previous discussion and also with others, I am not doing over sampling anymore. I am not sure what part of my post gave you an indication that I am still doing that or treatimg imbalance as a problem – The Great Feb 23 '22 at 11:22
yes, 0.5 gives highest f1 – The Great Feb 23 '22 at 11:22
You’re still thinking in terms of hard classifications. Please read Harrell’s blog for more information on the benefits of considering the probabilities. // If $0.5$ gives you the best $F1$ score, so be it, though I have my doubts that this will occur. Even if it does, I have my doubts about how useful $F1$ is as a measure of mode performance. – Dave Feb 23 '22 at 11:27
So, you suggest that i create roc auc curve, lift chart and leave it that...no need to evaluate model by confusion matrix. Is that what you mean? – The Great Feb 23 '22 at 12:18
A confusion matrix requires a threshold and hard categorical predictions. Once you read Harrell’s blog posts I linked, you’ll have a better idea of why those are less useful than they first appear. – Dave Feb 23 '22 at 12:41
I updated the post above with `f1 score` for different thresholds. You can see that it gets best `f1-score` at only 0.45-0.5. would be useful to know why you had doubts yesterday (as mentioned earlier by you) on whether best `f1` being at 0.5 threshold. – The Great Feb 24 '22 at 05:09

What does high auc score but poor f1 indicate for imbalanced dataset?

1 Answers1