Weighting for precision and recall

Question

I want to integrate the notion of weighting into an evaluation. I am wondering if it is appropriate/correct to calculate precision and recall scores by adding a weighting on true positives, false positive and false negatives. In my case, these have a ranking and/or associated value. For example:

           rank  value  test1   test2   test3
           1     99.3           x
correct    2     87.2    x      x
           3     66.9    x      x
           -                        
incorrect  4     33.1                   x
           5     12.8    x              x

Values from ranks 1 to 3 are correct, while those in ranks 4 and 5 are incorrect. My question is - can we calculate TP, FP and FN by summing the ranks of the corresponding items for a given test result? For this toy example this would yield:

test1: TP = 87.2 + 66.9 = 154.1, FP = 12.8, FN = 99.3
       precision = 154.1 / (154.1 + 12.8) = 0.92
       recall = 154.1 / (154.1 + 99.3) = 0.61
       f-score = 2 (0.92 * 0.61) / (0.92 + 0.61) = 0.73

test2: TP = 99.3 + 87.2 + 66.9 = 253.4, FP = 0, FN = 0
       precision = 253.4 / (253.4 + 0) = 1.0
       recall = 253.4 / (253.4 + 0) = 1.0
       f-score = 2 (1.0 * 1.0) / (1.0 + 1.0) = 1.0

test3: TP = 0, FP = 33.1 + 12.8 = 45.9, FN = 99.3 + 87.2 + 66.9 = 253.4
       precision = 0 / (0 + 45.9) = 0
       recall = 0 / (0 + 253.4) = 0
       f-score = 2 (0 * 0) / (0 + 0) = 0

Which all seems fine, but is it mathematically sound? Are there cases where this would fall apart or provide unreliable results?

Do not use accuracy to evaluate a classifier: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) The *exact* same arguments apply to precision and recall. — Stephan Kolassa, Jul 08 '19 at 10:24
Thanks very much for the links. I see in your linked post that you say "A scoring rule is a mapping that takes a probabilistic prediction qˆ and an outcome y to a loss...". Maybe I should have specified that I am not evaluating a probabilistic classifier, but rather the presence or absence of certain ranked words (correct/incorrect) in several lists (the tests in my example), so I'm not sure if it can somehow be considered as making a probabilistic prediction, nor what scoring rule would be appropriate for my task - any suggestions welcome. — ongenz, Jul 08 '19 at 12:12
If I understand correctly, then you have documents that should contain certain words (and should not contain others), and your KPI would measure whether a mandatory word is contained or not, and whether a forbidden word is contained or not, and you plan on using precision/recall to do this. Correct? If so, I think we might be better able to help you (possibly) if you could give us a little more information on what your documents/lists and words actually are, and what you plan on doing with your KPI. — Stephan Kolassa, Jul 08 '19 at 12:21
That's the general idea I guess. I posted a "dummy" example with a better explanation of my problem [here](https://stats.stackexchange.com/q/414731/217138). — ongenz, Jul 08 '19 at 13:21

Weighting for precision and recall

0 Answers0