Performance Metrics for Imbalanced Classification

Question

I'm trying to fit multiple Stochastic Gradient Descent models to a dataset where the target (binary target, 0 or 1) is very imbalanced, i.e the success rate is about 0.0001.

Out of all the models I've trained, I would like to select the best model based on the validation log-loss and validation AUC. Unfortunately, the average values of the test log-loss (0.001) and the test AUC (0.99) don't allow me to differentiate the models (as all the values are almost the same).

Are these metrics (AUC and LogLoss) good performance metrics for a highly imbalanced classification task? What metrics would allow me to differentiate the models and choose the best one?

Thanks

How is the model going to be used? This is the biggest question that should influence which metric you should use to select your model — TBSRounder, Jun 09 '17 at 17:50
After training the model, I'll be using the predicted probabilities. I won't be using 0s or 1s. Does it answer your question? — Aymen, Jun 09 '17 at 17:54
Log-loss measures how well the probabilities reflect the data they were trained on, so if you're going to be using the probabilities directly, that's the way to go. You should expect that the values of log loss only differ slightly on an absolute scale, as you are attempting to predict a very rare event, meaning all your predicted probabilities would (and should) be small. This isn't a problem for *comparing* models, but you will want to bootstrap to make sure any differences you observe are consistent across bootstrap samples. — Matthew Drury, Jun 09 '17 at 17:55
Thanks. What do you mean by bootstrap? Cross-validated on multiple (different) test sets? — Aymen, Jun 09 '17 at 17:57
Yah, log losses don't really make sense on an absolute scale, they are for comparing different models of the same process. — Matthew Drury, Jun 09 '17 at 17:58
The bootstrap is a technique for estimating the sampling distribution of quantities you are estimating. In this case, you're fitting models and estimating their log loss to compare models. You want to be sure that anything you observe is not due to chance, but is consistent across different training sets. You can search the sites for the bootstrap and you'll find a lot of good content. — Matthew Drury, Jun 09 '17 at 18:05
Related: https://stats.stackexchange.com/questions/222558/classification-evaluation-metrics-for-highly-imbalanced-data — Anton Tarasenko, Jan 18 '18 at 17:07

score 1 · Answer 1 · answered Jun 09 '17 at 18:20

1

I think the best way to see performance of the classification with highly imbalanced classes is look at precision-recall curve. You can also use area under this curve as metric.

answered Jun 09 '17 at 18:20

Nik

136
4

1

In fact, better than the precision-recall curve is the precision-recall gain (PRG) curve; see [here](http://www.cs.bris.ac.uk/~flach/PRGcurves/) for details and implementations. – darXider Jun 09 '17 at 19:49

score 0 · Answer 2 · answered May 05 '18 at 23:29

The beginning of an answer.

There are some usual metrics for imbalanced data sets. Some of them

AUC
mean accuracy - mean of the positive an negative accuracy
f-measure
gmeans - geometric means of the positive and negative accuracy
mathews correlation coefficient https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
KS statistics https://www.machinelearningplus.com/machine-learning/evaluation-metrics-classification-models-r/ (item 7)

Unfortunately I do not know of any intuitions on why to choose one or other metrics.

It would be more useful to explain how these metrics are affected (or not) by class imbalance. — Black Milk, Nov 05 '19 at 18:25

Performance Metrics for Imbalanced Classification

2 Answers2

Linked