Multi-class evaluation: found different macro F1 scores, which one to use?

Question

I want to evaluate my multi-class classifier against a gold reference and obtain a single score that reflects its performance. In my data I have many classes that are important but rare, so I was recommended to use macro F1.

However, I am confused now, since this paper^* shows that two different macro F1 formulas are known and that the scores can differ by 0.5. These are the two formulas:

1. average over individual (class-wise) F1 scores

2. F1 score over precision and recall averages

First question: Do I understand it correctly that they show that it's better to use the 1. formula over 2.?

Second question: I also do not fully understand if they mean that these scores can differ by 0.5 on a scale [0,100], which would be pretty negligible, or 0.5 on a scale [0,1], which would be kind of extreme.

^* Opitz, J. and Burst, S., 2019. Macro F1 and Macro F1. arXiv preprint arXiv:1911.03347.

score 0 · Accepted Answer · answered Jun 15 '20 at 08:12

After having read the paper again in-depth, I found the answers to my questions:

Answer to first question Yes, it's (much) better to use the 1. formula, that is, calculate the macro F1 as average over class-wise F1 scores. The other macro F1 formula (harmonic mean over class-wise precision and recall averages) can lead to overly-optimistic scores because it is quite easily biased by specific particularities of the error type distribution. The 1. formula does not suffer from this issue.

Answer to second question Quite surprisingly to me, the maximum difference is 0.5 on a scale [0,1], or, equivalently, 50 on a scale [0,100]. This means that the two macro F1 formulas can lead to extremely different scores when evaluating classifiers.

Multi-class evaluation: found different macro F1 scores, which one to use?

1 Answers1

Linked