What are the best metrics to evaluate a multiclass classification model?

Question

I have a multiclass classification problem I have 5 classes and data highly unbalnced

class 01: 6
class 02: 100
class 03: 9300
class 04: 200
class 05: 34

I have used K fold cross validation with k=10 and 5 algorithms: - Logistic regression - Linear discriminant analysis - K nearest neighbor - CART - Naive bayes

and I had this results

Algo: mean of accuracy , Std of accuracy (where std means Standard Deviation = sqrt(Variance))

LR: 0.489479 (0.095705)
LDA: 0.901222 (0.001977)
KNN: 0.949483 (0.002300)
CART: 0.939122 (0.002691)
NB: 0.950761 (0.002713)

My question is: is accuracy enough to choose the best model between those 4 ? or should I use precision, recall, f1 score, AUC .....

None of them. [Use proper scoring rules.](https://stats.stackexchange.com/a/312787/1352) — Stephan Kolassa, May 27 '20 at 12:39
@StephanKolassa thanks for answering my question. I know now that accuracy is not always the best metric to evaluate a model but what do you mean by proper scoring rules. and how can I use it? — Nour elhouda Khettache, May 27 '20 at 16:55
From his first link: "Strictly proper scoring rules are scoring rules that are only minimized in expectation if the predictive density is the true density." Two examples are Brier score and crossentropy loss. What I don't totally follow @StephanKolassa is what to do when the proper metric disagrees with accuracy and there is a discrete decision to make. — Dave, May 27 '20 at 16:57
@NourelhoudaKhettache: [the Wikipedia page](https://en.wikipedia.org/wiki/Scoring_rule) gives you a first introduction. [Our tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info) gives more information. — Stephan Kolassa, May 28 '20 at 05:47
@Dave: I would argue that your question mixes different concepts, namely *classification/class prediction* and *decision*. (Using accuracy also conflates the two.) You should first aim at calibrated and sharp probabilistic (!) predictions, and proper scoring rules will help you there. *Then* you can base a decision on this probabilistic prediction, also taking costs of wrong actions into account. Conversely, costs of decisions have no role in the prediction part. [I illustrate the difference here.](https://stats.stackexchange.com/a/312124/1352) — Stephan Kolassa, May 28 '20 at 05:50
@StephanKolassa I posted a question about this topic: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email. I think the conversation I have with my boss at the end is a real issue for a professional statistician, and having an answer to that question from her would be a valuable post on CV. I’ve look at your post and Doc Harrell’s blog post you linked there, but I still don’t have an answer for management about why we should prefer the lower exam score just because the test taker was confident in answering (so to speak). — Dave, May 28 '20 at 10:24
@Dave: I have answered that question, let's continue the discussion there. I'd just like to comment on "prefer the lower exam score just because the test taker was confident in answering". The metaphor doesn't work. We prefer a lower probabilistic prediction (i.e., a more confident prediction of "FALSE") *conditional on it being correct*. If the instance turned out to be TRUE, then of course we would prefer a less confident "FALSE" prediction, i.e., a higher $\hat{p}$. — Stephan Kolassa, May 28 '20 at 15:10

What are the best metrics to evaluate a multiclass classification model?

0 Answers0