Summarising Precision/Recall Measures in Multi-class Problem

Question

I have a hierarchical multi-class classification system, that classifies records into about 500 different categories. I want to summarise the performance of the classifier in a simple way.

A measure of accuracy on validation data is easy to implement: correctly coded/all coded. For each class, we can look at binary measures of precision and recall to summarise the performance relative to that class.

However, there doesn't seem to be a generally accepted way to combine binary precision and recalls into summaries of precision and recall across the entire set of classes. There appear to be a few ways to approach this summary:

Take a simple average (arithmetic/geometric/harmonic) of each class's precision/recall.
Take a weighted average (weighted by number of examples, etc) of each class's precision/recall.
Use bookmaker's informedness/markedness which seems to have a natural generalisation in the multiclass context.

Are there advantages to using one of these approaches particularly? Is there a generally accepted way to do this that I've just been missing?

Potential duplicate of https://stats.stackexchange.com/questions/51296/how-do-you-calculate-precision-and-recall-for-multiclass-classification-using-co — Brandmaier, Jun 30 '17 at 07:37
@Brandmaier Thanks for your comment. That's not really the same question - the question there is about computing the binary precision/recalls, which I'm comfortable with. I'm asking for good practice in summarising all of the binary precisions/recalls into a single measure over all of the classes. — RoryT, Jul 03 '17 at 00:23

score 2 · Answer 1 · answered Jul 06 '17 at 12:20

As far as I know there isn't a "de facto" way of calculating precision and recall for multi-class classification.

Your approaches are what I too would try:

Class-wise harmonic mean.
Class-wise weighted harmonic mean (if the classes are imbalanced). With a weight equal to the class imbalance (i.e. class weight = number of class examples / number of total examples)
Class-wise geometric mean (another approach if the classes are imbalanced).

There are also other metrics to evaluate the performance of your mode, besides precision and recall:

Multi-class version of ROC curve. A tutorial is also available.
Generalized F1-score, G-mean, etc.

Summarising Precision/Recall Measures in Multi-class Problem

1 Answers1