Evaluating Unbalanced Multiclass Classifiers: Which Tests to Use?

Question

I am looking for some comprehensive instructions and ideally out of the box solutions (ideally for python) for evaluating different classifiers (which are already trained) for a multiclass classification problem on an unbalanced dataset.

To illustrate further: I have about a dozen classifiers that are trained on the same unbalanced dataset of a hand full of categories. Now I would like to

1) compare the classifiers against the ground truth:

How well do they perform on classifying on a per class basis (compared to a chance based model) and what is a sensible average of the per class performances?

2) compare the classifiers against each other:

Are they significantly different in what they classify data instances as? Are they significantly different in their overall performance (e.g. in accuracy per class)?

I looked into many test statistics now, some are

overall accuracy (bad for imbalanced datasets)
Cohen's kappa
Chi square goodness of fit
McNemar
AUROC
Brier score
Youden Index
Informedness
F-Score

I encountered different accounts whether these are suited for the imbalanced multiclass scenario and under which conditions they can be used, however. Most of the guides and explanations I read limited themselves to cases of binary classification.

I found the pycm package though, which computes many statistics (and most of the above), also for multiclass problems. But the documentation is kind of sparse, and I am not sure if it handles the unbalanced multiclass scenario correctly.

Now I am looking for some clear instructions on which tests I can apply to my case or how I need to format my data to be suited for some given test (I read about binarization of multiclass labes and "one vs all" a couple of times, for example, but these involved retraining the models (e.g. here), which is not an option for me.).

edit:

I am not asking about why accuracy is not a good metric. I am asking for which tests are suited for unbalanced multiclass.

It's best to use a proper scoring rule, such as the Brier score. See [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) Then unbalanced classes are a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Apr 14 '19 at 15:15
Possible duplicate of [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models) — Stephan Kolassa, Apr 14 '19 at 15:15
@StephanKolassa This is not a duplicate of that question, though. I am not asking about accuracy in particular and I know why it is not a good measure. — lo tolmencre, Apr 14 '19 at 15:18
[My answer at the proposed duplicate](https://stats.stackexchange.com/a/312787/1352) explains why one should use proper scoring rules to evaluate probabilistic predictions. It seems to be exactly what you are looking for. As per my comment above, proper scoring rules have no problems whatsoever with unbalanced classes. If "use a proper scoring rule, and here are reasons why" is not an answer, could you perhaps explain why? — Stephan Kolassa, Apr 14 '19 at 18:04
I would like to know what tests (scoring rules in this case) are suited for unbalanced multiclass scenarios. I had read your answer on that thread even before I posted my question. But it did not help me identifying which scoring rules are suited. For example AUROC is for dichotomous variables, whilst being a scoring rule. My question in that case would be: Can I make AUROC work for the multiclass case and how. Furthermore: What scoring rules do I need to apply to test the things I mentioned under (1) and (2). Are scoring rules even suited for comparing classifiers amongst themselves? — lo tolmencre, Apr 15 '19 at 06:08
Scoring rules all work on probabilistic predictions and actual outcome, by evaluating the predicted probability for a test sample to show the outcome actually observed. (The different scoring rules like the log score or the Brier score then proceed by transforming this probability in different ways.) Thus, they have no problems with unbalanced classes - as long as the class membership probability predictions are correct, the scoring rule will pick this up, no matter whether it's 1% or 50%. ... — Stephan Kolassa, Apr 15 '19 at 06:40
... So scoring rules will always compare your predictions against reality, and you can compare the scores between models. (Usually, lower scores are better, but some people use the opposite convention - check the definition of the score you are using.) You can of course always look at the scores your models achieve on specific target classes and compare scores only on specific classes. ... — Stephan Kolassa, Apr 15 '19 at 06:42
... I don't know of a way to make AUROC work for multi-class classification. Or ROC curves in general. The horizontal axis would need to turn into a $k-1$-dimensional space if you have $k$ classes, and you would evaluate over a simplex in that space. Anyway, [AUROC is not very good for distinguishing models](https://stats.stackexchange.com/a/384194/1352). Hope that helped! — Stephan Kolassa, Apr 15 '19 at 06:43
[This post by Frank Harrell may be helpful.](http://www.fharrell.com/post/class-damage/) Or [Gneiting & Katzfuss (2014)](https://doi.org/10.1146/annurev-statistics-062713-085831). Or [Merkle & Steyvers (2013)](https://doi.org/10.1287/deca.2013.0280). These two articles are mostly about numerical predictions, not classification, but everything holds with probabilistic predictions instead of predictive densities. — Stephan Kolassa, Apr 15 '19 at 06:53
@StephanKolassa Okay, thanks! What if I want to directly compare two models, though? Can I - instead of the ground truth classes - pass the predicted classes of a model to the scoring function? So that then I would do `score(predictedClassesModelA, predictedProbsModelB)` instead of `score(groundTruthClasses, predictedProbsModelB)`? Is that valid? Because directly comparing two models can be done with McNemar (at least for dichotomous variables, not sure about multiclass) and Cohen's Kappa. And I would like to end up with a test statistic like the kappa coefficient or a p-value. — lo tolmencre, Apr 15 '19 at 07:25
No, scoring rules always measure the agreement between a (probabilistic) prediction and an actual observation. You can use them to compare which one of two (or more) models predicts your holdout test sample better. — Stephan Kolassa, Apr 15 '19 at 09:01
Okay, and for determining significance in the differences in some given attribute I need to go back to ordinary test statistics? Which would lead me back to my original question: I also need to know if a given model is actually *significantly* better/ worse (in some attribute, say precision) than some other model or if it makes significantly different judgements (McNemar and Chi Square). — lo tolmencre, Apr 15 '19 at 13:48
I don't know of any work being done on significance testing for scoring rules. A simple way would be to bootstrap it. — Stephan Kolassa, Apr 15 '19 at 14:19
In each bootstrap replicate, divide your sample randomly into a training and a testing set. Fit models to the training set, probabilistically predict the test set, record the value of the scoring rule. Do this 1,000 times. You get 1,000 bootstrap replicates of the scoring rule value of each model. Then you can see whether, e.g., one model consistently outperforms another one, by having a lower scoring rule value on 95% of the bootstrap replicates. — Stephan Kolassa, Apr 15 '19 at 15:06
Ah, I see. Unfortunately not practical for me. One training takes 12 hours. Can you recommend any of the ordinary tests for this other than Kappa? Regarding McNemar and Chi Square goodness of fit I am not sure if they are even suited to my case. — lo tolmencre, Apr 15 '19 at 15:11
Ah. No, I can't really think of anything. Maybe you could do model selection on a smaller subsample? — Stephan Kolassa, Apr 15 '19 at 16:35
You mean smaller subsample of the training data to reduce training times in the bootstrapping? — lo tolmencre, Apr 16 '19 at 03:59
Yes, exactly. Use a smaller sample to select the model through bootstrapping, then fit the selected model on the whole sample. (Maybe re-run the model selection multiple times on different subsamples, to check whether the selected models vary a lot. If so, they are probably all equally good, or bad.) — Stephan Kolassa, Apr 16 '19 at 06:12
Ok, I can try that. Do you know of an implementation of the Brier score for Python? I looked into the code of `sklearn.metrics.brier_score_loss`. But under the hood it does a one-vs-all test. Is that the only way to do the brier score? If you have $n$ classes, do you need to do $n$ tests, where in test $i$ you compare class $i$ against classes $\{1 \leq j \leq n | j \neq i\}$ in a binary way? Or is there a version of the brier score that does not require this binarization? — lo tolmencre, Apr 16 '19 at 10:55
When I manually compute it with `np.mean(np.sum((probs - targets)**2, axis=1))` where `targets` is a vector of one hot vectors: `[[0. 1. 0. 0. 0.] [1. 0. 0. 0. 0.]]` and `probs` is a vector of vectors summing to one: `[[0.07 0.41 0.35 0.11 0.06] [0.03 0.33 0.29 0.03 0.32]]`, I get a value between 0 and 2, it seems. That is mentioned on the Wikipedia page for Brier score. But that does not mean what I am doing there is valid. — lo tolmencre, Apr 16 '19 at 10:56
Your second comment looks good. I'm not familiar with sklearn and don't quite understand how it would do a one-vs-all test. — Stephan Kolassa, Apr 16 '19 at 15:27
Okay thanks. sklearn takes one of $n$ classes and treats it as class 1 and all other classes as class 0. Then it apparently does the binary Brier score on the input binarized like this. — lo tolmencre, Apr 16 '19 at 17:25
@StephanKolassa I have read the original Paper by Brier now, that you gave me in the other thread. In there, he does not mention that the score is robust against imbalanced classes. Then I found some blog article stating "[...] the average Brier score will present optimistic scores on an imbalanced dataset, rewarding small prediction values that reduce error on the majority class.". That sounds like contradicting what you said. I assume I am just missing something... Can you comment on that? The source is https://machinelearningmastery.com/how-to-score-probability-predictions-in-python/ — lo tolmencre, Apr 17 '19 at 15:46
Also, is there a convention for which scores are "excellent", "good", "fair", "poor" etc? — lo tolmencre, Apr 17 '19 at 15:49
I just skimmed the page you link. I don't quite see the problem. Looking at [this plot](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/06/Line-Plot-of-Predicting-Brier-for-Imbalanced-Dataset.png) of the Brier score against predicted probabilities for an unbalanced 10:1 dataset, the score is minimized by a prediction of 0.1, which is exactly as it should be. I don't see how this is "optimistic". The Brier score is known to be proper, so it will be minimized in expectation by predicting the actual probabilities. — Stephan Kolassa, Apr 17 '19 at 15:52
And no, there is no overall consensus on what is a "good" score. What is achievable depends on your specific application. Some questions are easily answered, so you should aim for a small score, but others are hard, and a small score may simply not be achievable. [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/q/222179/1352) — Stephan Kolassa, Apr 17 '19 at 15:53

Evaluating Unbalanced Multiclass Classifiers: Which Tests to Use?

edit:

0 Answers0

Linked