1

Scoring classification model performance often seems somewhat abstract (looking at you AUC scores...). There's always accuracy score, which has the advantage of being nice and easy to comprehend and which is great for explaining how well the model will work to someone else (like say, the people who are actually going to use the predictions it makes). I intuitively expect there to be a common similar method for probability predictions, for example a simple "average distance from truth" along the lines of:

| Truth | Prediction | Score |
| ----- | ---------- | ----- |
|   1   |     0.97   |  0.03 | 
|   0   |     0.35   |  0.35 |
|   1   |     0.76   |  0.24 |
|   0   |     0.42   |  0.42 |

With the score for the model as a whole being taken as the average of those scores; 0.26 in this case. That's pretty easy to manually do, but it surprises me that a) this isn't a common scoring metric and b) there doesn't seem to be any in-built methods in the scikit-learn api.

So my question is this: is "average distance from truth" a useful scoring metric and if the answer is no, why not?

Dan Scally
  • 228
  • 2
  • 13
  • I might be misinterpreting your question but I think you are missing out the existence of *[scoring rules](https://en.wikipedia.org/wiki/Scoring_rule)*. What you describe is extremely similar to the *[Brier score](https://en.wikipedia.org/wiki/Scoring_rule#Brier/quadratic_scoring_rule)* - `scikit-learn` has it. In general though: most scoring rules are derived from Decision Science/Forecasting approaches so core CS/ML practitioner are not directly exposed to them as part of their academic training. – usεr11852 Feb 23 '19 at 23:11
  • @usεr11852 thanks for your comments. Brier score is certainly the most similar to the metric I'm asking about here. In your comment below you note that this isn't a proper scoring rule and shouldn't be preferred - I take your point but I still think that the metric has use, again particularly in explaining model performance to someone who's going to use the predictions but without knowing how they're calculated. – Dan Scally Feb 24 '19 at 07:42
  • No problem Dan. I see your point. For more details you might want to see: https://stats.stackexchange.com/questions/20581/ – usεr11852 Feb 24 '19 at 09:10

2 Answers2

1

The metric you describe is in fact very common: It's mean absolute error, or MAE. In scikit learn you can find it in the metrics submodule.

Usually it's used for regression tasks, not for classification, thus you might not have encountered it. Still, when it does get used to compare classification algorithms there are certain caveats, for example:

  • For example, it has similar problems like accuracy when used with unbalanced datasets in that it will produce high scores for algorithms that just predict the majority class (and thus are not useful).
  • The MAE doesn't tell you if your classifier better at predicting positives or negatives.

So to answer your question: It is a common, useful scoring metric, but less often used for classifiers (more common for regressors).

Denwid
  • 702
  • 5
  • 14
  • Please note that while MAE might appear intuitive, it is *not* a proper scoring rule so it should not be first choice metric. – usεr11852 Feb 23 '19 at 23:57
  • Ah! Do you know I searched for ages thinking this _must_ exist, but always with the "classifier" term, how irritating. Thank you for the answer @Denwid. – Dan Scally Feb 24 '19 at 07:32
  • Brier score (aka MSE, mean squared error) is a proper scoring rule that can easily be used instead of MAE. (IMHO, while still not a proper scoring rule, MAE is already better than accuracy, though.) – cbeleites unhappy with SX Feb 25 '19 at 12:05
1

In addition to @Denwid's answer:

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133