Use F1 or maximum F1 for model comparisons?

Question

I am comparing a ML classifier to a bunch of other benchmark F1 classifiers by F1 scores. By AUPRC, my classifier does worse than other benchmark methods. When I compared F1 score, however, I got a curious result that my classifier does better than the other methods. After looking at the precision recall curve, I realized that indeed my model was performing better in precision/recall at the probability threshold where the F1 score was being evaluated by default(classifier probability > 0.5), but performed worse at other thresholds.

When I report these results, should I be looking at maximum possible F1 score when comparing between models? Please note that I am working with an imbalanced dataset, and the precision/recall tradeoff matters for my use case (hence the F1 as the performance metric).

On this site you will find many cautions against using things like F1 scores to compare models. It would be better to specify the relative benefits and costs of different types of correct and incorrect classifications to help choose a corresponding probability threshold, and then see how the expected net benefit differs among the models. Is there really no benefit in identifying true negatives, which are [ignored in F1 calculations](https://en.wikipedia.org/wiki/F1_score#Criticism)? Adding some information to your question about costs and benefits might help provide a more useful answer. — EdM, Apr 22 '20 at 20:30
Notwithstanding what @EdM said (+1); this shows why $F_1$ score is a potentially misleading measurement to compare classifier performance. Given we care to obtain the maximum possible $F_1$ score (if indeed that is relevant for us) there is no reason to use a fixed threshold. I would focus on having a careful way of picking the threshold used to compute $F_1$ (e.g. via using a separate fold to pick the optimal threshold). — usεr11852, Apr 23 '20 at 13:57

EdM · Answer 1 · 2020-04-29T17:28:29.917

Even if you are looking for rare events in an unbalanced data set such that a focus on precision and recall is warranted and true negatives aren't of interest, you should be wary of using the $F_1$ score as your way to compare models.

The $F_1$ score is a specific weighting (harmonic mean) of precision and recall predicated on equal importance of precision and recall. If you really and truly consider them equally important, then fine. More generally, if "recall is considered $\beta$ times as important as precision" then you should use the $F_\beta$ score instead:

$$F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}. $$

This reduces to $F_1$ for equal weighting of precision and recall. So even in this circumstance it still is important to consider the (often hidden) tradeoffs in choosing a performance measure.

Once you have taken into account the relative importance of precision and recall for your application with some $F_\beta$ score, I agree with @usεr11852 that you should then find (in some careful, generalizable way) the predicted-probability cutoff that maximizes the score. So for comparing among models it would seem best to use the maximum score achievable. If you choose $\beta =1$, then that's the maximum $F_1$ score.

Use F1 or maximum F1 for model comparisons?

1 Answers1