I want to evaluate my multi-class classifier against a gold reference and obtain a single score that reflects its performance. In my data I have many classes that are important but rare, so I was recommended to use macro F1.
However, I am confused now, since this paper* shows that two different macro F1 formulas are known and that the scores can differ by 0.5. These are the two formulas:
1. average over individual (class-wise) F1 scores
2. F1 score over precision and recall averages
First question: Do I understand it correctly that they show that it's better to use the 1. formula over 2.?
Second question: I also do not fully understand if they mean that these scores can differ by 0.5 on a scale [0,100], which would be pretty negligible, or 0.5 on a scale [0,1], which would be kind of extreme.
* Opitz, J. and Burst, S., 2019. Macro F1 and Macro F1. arXiv preprint arXiv:1911.03347.