Can log loss be an evaluation metric for classification models?

Question

I read several posts online about evaluation metrics for classification models. Only accuracy, precision, recall, F-1 score, ROC, AUC, Confusion matrix are mentioned. However, I found a couple of Kaggle competitions use log loss as the evaluation metric. For example, Dogs vs. Cats Redux: Kernels Edition.

It is the *preferred* metric and is one example of a strictly proper scoring rule. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Jul 10 '21 at 05:40
@Dave: do you want to post your comment(s) as an answer? [Better to have a short answer than no answer at all.](https://stats.meta.stackexchange.com/a/5326/1352) Anyone who has a better answer can post it. — Stephan Kolassa, Jul 10 '21 at 08:20

Dave · Accepted Answer · 2022-03-02T22:41:54.893

YES, this is a reasonable evaluation metric. In particular, log loss is a strictly proper scoring rule that is maximized in expected value by the true probability values.

A standard reason to prefer a metric like accuracy is that it seems easy to interpret. "I got an accuracy of $95\%$, so that's like an $\text{A}$ in school, and I am happy." I would argue that accuracy has to be evaluated in context. A standard way that an accuracy of $95\%$ might be poor performance is if $99\%$ of the cases belong to one class, which means that you could get a higher accuracy just by predicting the majority class every time.

Consequently, I do not see accuracy as easy to interpret, and I do not buy the argument to prefer accuracy over a strictly proper scoring rule like log loss due to the ease with which accuracy can be interpreted.

Can log loss be an evaluation metric for classification models?

1 Answers1