Average precision/recall of multiple classifiers on the same dataset?

Question

If I have some N human classifiers that can only predict in terms of 0 or 1 (not probabilistic, and also disregarding their own uncertainty. They either know or they don't), and each yield different precision/recall metrics for the same dataset, is there anything wrong with saying the average precision is then simply the arithmetic mean?

$$\frac{0.95 + 0.82 + 0.92}{3} = 0.90$$

Or is it "better" to calculate a final score from the majority vote (and why? Doesn't this add extra steps to calculate model uncertainty)?

You mean, other than the "general" problems with accuracy (and precision)? [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) More to the point, what is your goal in averaging accuracies? If you have multiple classifiers, you could let them vote (or - much better - combine their probabilistic predictions) and typically get better results. — Stephan Kolassa, Sep 25 '19 at 13:03
Hm. Thanks for the edit. I was going to answer that there is nothing wrong with the "average precision" - until I saw your last paragraph. What is "besser" depends on what you want to do. Averaging precisions is not the same as counting votes to classify (and possibly calculating the precision of this ensemble classifier). You can average precisions in the way you propose. But what is your purpose in doing so? Is this just for information? — Stephan Kolassa, Sep 25 '19 at 13:13
The purpose is basically to compare humans to a trained classification model, and also get an estimate of the variability between different humans doing the classification. — komodovaran_, Sep 25 '19 at 13:28

Jonathan Moore · Answer 1 · 2019-09-25T13:24:08.733

If the 3 classifiers all make the same mistakes, then their joint precision is approximately the mean of their individual precisions.

If the 3 classifiers all make completely different sets of mistakes, then their joint precision may be much higher than the mean. Imagine the case where each individual mistake is only made once by a single classifier, whereas the other two classifiers don't make that mistake. In that case, you could argue the joint precision is 100%, depending on how you are defining 'average precision', even though all the classifiers have less than 100% precision.

The difference between the mean of the individual precisions, and the precision of the ensemble should be informative as to the degree of independence between the mistakes made by the different classifiers.

Average precision/recall of multiple classifiers on the same dataset?

1 Answers1