Scoring a model with "distance from truth"

Question

Scoring classification model performance often seems somewhat abstract (looking at you AUC scores...). There's always accuracy score, which has the advantage of being nice and easy to comprehend and which is great for explaining how well the model will work to someone else (like say, the people who are actually going to use the predictions it makes). I intuitively expect there to be a common similar method for probability predictions, for example a simple "average distance from truth" along the lines of:

| Truth | Prediction | Score |
| ----- | ---------- | ----- |
|   1   |     0.97   |  0.03 | 
|   0   |     0.35   |  0.35 |
|   1   |     0.76   |  0.24 |
|   0   |     0.42   |  0.42 |

With the score for the model as a whole being taken as the average of those scores; 0.26 in this case. That's pretty easy to manually do, but it surprises me that a) this isn't a common scoring metric and b) there doesn't seem to be any in-built methods in the scikit-learn api.

So my question is this: is "average distance from truth" a useful scoring metric and if the answer is no, why not?

I might be misinterpreting your question but I think you are missing out the existence of *[scoring rules](https://en.wikipedia.org/wiki/Scoring_rule)*. What you describe is extremely similar to the *[Brier score](https://en.wikipedia.org/wiki/Scoring_rule#Brier/quadratic_scoring_rule)* - `scikit-learn` has it. In general though: most scoring rules are derived from Decision Science/Forecasting approaches so core CS/ML practitioner are not directly exposed to them as part of their academic training. — usεr11852, Feb 23 '19 at 23:11
@usεr11852 thanks for your comments. Brier score is certainly the most similar to the metric I'm asking about here. In your comment below you note that this isn't a proper scoring rule and shouldn't be preferred - I take your point but I still think that the metric has use, again particularly in explaining model performance to someone who's going to use the predictions but without knowing how they're calculated. — Dan Scally, Feb 24 '19 at 07:42
No problem Dan. I see your point. For more details you might want to see: https://stats.stackexchange.com/questions/20581/ — usεr11852, Feb 24 '19 at 09:10

score 1 · Accepted Answer · answered Feb 23 '19 at 23:44

1

The metric you describe is in fact very common: It's mean absolute error, or MAE. In scikit learn you can find it in the metrics submodule.

Usually it's used for regression tasks, not for classification, thus you might not have encountered it. Still, when it does get used to compare classification algorithms there are certain caveats, for example:

For example, it has similar problems like accuracy when used with unbalanced datasets in that it will produce high scores for algorithms that just predict the majority class (and thus are not useful).
The MAE doesn't tell you if your classifier better at predicting positives or negatives.

So to answer your question: It is a common, useful scoring metric, but less often used for classifiers (more common for regressors).

answered Feb 23 '19 at 23:44

Denwid

702
5
14

Please note that while MAE might appear intuitive, it is *not* a proper scoring rule so it should not be first choice metric. – usεr11852 Feb 23 '19 at 23:57
Ah! Do you know I searched for ages thinking this _must_ exist, but always with the "classifier" term, how irritating. Thank you for the answer @Denwid. – Dan Scally Feb 24 '19 at 07:32
Brier score (aka MSE, mean squared error) is a proper scoring rule that can easily be used instead of MAE. (IMHO, while still not a proper scoring rule, MAE is already better than accuracy, though.) – cbeleites unhappy with SX Feb 25 '19 at 12:05

score 1 · Answer 2 · edited Jun 11 '20 at 14:32

In addition to @Denwid's answer:

Switching from MAE to MSE (mean squared error) will give you a proper scoring rule.
You can then take its root (=> root mean squared error) to get a figure of merit in the original predicted unit for easier interpretation.
The problems @Denwid refers to with unbalanced data and not giving the information whether the loss stems from false positives or false negatives has less to do with the choice of the loss function (0/1 loss for accuracy, mean absolute error or mean squared error): it is a concequence of "throwing" a loss function onto your whole data set - which will be problematic even for seemingly harmless figures of merit like total accuracy unless you make sure the relative class frequencies match those of your application scenario.

But: You can use MSE loss on subgroups of your data to derive MSE-figures of merit that are analogous to sensitivity, specificity, predictive values etc.
In case you are working in R, my package softclassval does provide such functions. We discussed details in our paper C. Beleites, R. Salzer and V. Sergo: Validation of Soft Classification Models using Partial Class Memberships: An Extended Concept of Sensitivity & Co. applied to Grading of Astrocytoma Tissues, Chemom. Intell. Lab. Syst., 122 (2013), 12 - 22. AAM on arXiv: 1301.0264

Scoring a model with "distance from truth"

2 Answers2