Harmonic is used in F1 score because it is a conservative metric: How does it help being conservative?

Question

I was reading Jurafsky 3rd edition, page 12-13 chapter 4

Can you explain why is it good to weigh more the smaller of the two items namely $\frac{1}{Precision}$ or $\frac{1}{Recall}$?

Here is the link to the book chapter(freely available from the official author).

It isn't. [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) The same problems apply to sensitivity and specificity, and indeed to all evaluation metrics that rely on hard classifications, so also to all $F\beta$ scores. Instead, use probabilistic classifications, and evaluate these using proper [tag:scoring-rules]. — Stephan Kolassa, Mar 17 '21 at 06:50
@StephanKolassa.: Yes that's a fair point. But I also don't understand in this context how it weighs more the smaller of the two? (Did you get that part?) — user27286, Mar 17 '21 at 07:04
Actually, yes, I did see that. It looks like there are really two questions in your mind: (1) how does the harmonic mean "weigh the smaller number more heavily" than the arithmetic mean, and (2) why is this a good idea? As to (1), I would say that is an unfortunate choice of words. It's just a reformulation of the [HM — Stephan Kolassa, Mar 17 '21 at 07:11
... As to (2), the book gives no reason, and none come to my mind immediately. It may be a kind of "precautionary principle" - we care about both precision and recall, but in summarizing them we would rather be conservative and be closer to the *smaller* of the two so we are not led into overoptimism if the larger one is "really" large. (Note that the "weighting the smaller of the two" refers to precision and recall, not their reciprocals as you write at the end of your question.) — Stephan Kolassa, Mar 17 '21 at 07:14
@StephanKolassa.: I see. This is a good explanation and burst my wrong idea of "reciprocals being weighed". If you get time, write an answer and let me thank you for your time by accepting the answer. — user27286, Mar 17 '21 at 07:17

score 1 · Accepted Answer · answered Mar 17 '21 at 07:25

It looks like there are really two questions in your mind:

How does the harmonic mean "weigh the smaller number more heavily" than the arithmetic mean?
Why is this a good idea?

My thoughts:

I would say that is an unfortunate choice of words. It's just a reformulation of the HM<GM<AM inequality, there is no "weighting" involved. (All means can be weighted, but that's a separate question.)
As to this, the book gives no reason, and none come to my mind immediately. It may be a kind of "precautionary principle" - we care about both precision and recall, but in summarizing them we would rather be conservative and be closer to the smaller of the two so we are not led into overoptimism if the larger one is "really" large. (Note that the "weighting the smaller of the two" refers to precision and recall, not their reciprocals as you write at the end of your question.)

Finally, in my opinion we should not care about any $F\beta$ score at all, see Why is accuracy not the best measure for assessing classification models? The same problems apply to sensitivity and specificity, and indeed to all evaluation metrics that rely on hard classifications, so also to all $F\beta$ scores. Instead, use probabilistic classifications, and evaluate these using proper scoring rules.

Harmonic is used in F1 score because it is a conservative metric: How does it help being conservative?

1 Answers1