How to calculate F1, Precision, and Recall for Multi-Label Multi-Classification

Question

I have a predictive model as follows

Sample1	Sample2	Sample3	Sample4	Red	Yellow	Blue	Green	White	Black	Orange
65	21	55	40	0	0	1	0	1	0	0
31	40	44	30	0	0	0	0	0	0	0
33	44	56	66	1	0	0	1	0	0	1
63	77	57	43	0	0	0	0	0	0	0
33	26	54	45	1	1	0	1	0	0	0
84	23	44	43	0	0	1	0	0	1	1
24	31	56	30	0	1	0	1	0	0	0

Where the colours are the targeted values

I found that Recall score for :

Reds only was 70%
Yellows only was 50%
Greens only was 80%
Whites only was 60%
Blacks only was 70%
Orange only was 80%

I found that Precision for :

Reds only was 60%
Yellows only was 40%
Greens only was 70%
Whites only was 50%
Blacks only was 60%
Orange only was 90%

I found that F1 for :

Reds only was 65%
Yellows only was 45%
Greens only was 75%
Whites only was 55%
Blacks only was 65%
Orange only was 85%

How can I find Recall, Precision, and F1 score for all colours?

score -1 · Answer 1 · answered Nov 25 '21 at 06:46

All of F1, recall, precision (and others) rely crucially on two-class classification. Essentially, they need a notion of true/false positive/negative, which only makes sense if you have one target class and "everything else".

Thus, in a multiclass scenario, you can assess (say) the F1 score of classifying one of your class, which then is the target class, and everything else is the non-target class. And of course you can do this exercise with every separate one of your classes. This is exactly the output you are getting.

Thus, there is simply no notion of "overall" F1, precision, recall etc. What you can do is to calculate the averages of these KPIs, possibly weighted by how often the target class appears in your test set.

Note that every single criticism of accuracy at the following thread applies equally to F1, precision, recall etc.: Why is accuracy not the best measure for assessing classification models? Specifically, optimizing any of these will give you biased predictions of the true probabilities of class memberships, and suboptimal decisions, and the same applies to optimizing weighted or unweighted averages of these KPIs. Instead, use probabilistic classifications and assess these using proper scoring rules - and note also that proper scoring rules have no problems whatsoever with multiclass situations.

Sample1	Sample2	Sample3	Sample4	Red	Yellow	Blue	Green	White	Black	Orange
65	21	55	40	0	0	1	0	1	0	0
31	40	44	30	0	0	0	0	0	0	0
33	44	56	66	1	0	0	1	0	0	1
63	77	57	43	0	0	0	0	0	0	0
33	26	54	45	1	1	0	1	0	0	0
84	23	44	43	0	0	1	0	0	1	1
24	31	56	30	0	1	0	1	0	0	0

Sample1	Sample2	Sample3	Sample4	Red	Yellow	Blue	Green	White	Black	Orange
65	21	55	40	0	0	1	0	1	0	0
31	40	44	30	0	0	0	0	0	0	0
33	44	56	66	1	0	0	1	0	0	1
63	77	57	43	0	0	0	0	0	0	0
33	26	54	45	1	1	0	1	0	0	0
84	23	44	43	0	0	1	0	0	1	1
24	31	56	30	0	1	0	1	0	0	0

How to calculate F1, Precision, and Recall for Multi-Label Multi-Classification

1 Answers1

Sample1	Sample2	Sample3	Sample4	Red	Yellow	Blue	Green	White	Black	Orange
65	21	55	40	0	0	1	0	1	0	0
31	40	44	30	0	0	0	0	0	0	0
33	44	56	66	1	0	0	1	0	0	1
63	77	57	43	0	0	0	0	0	0	0
33	26	54	45	1	1	0	1	0	0	0
84	23	44	43	0	0	1	0	0	1	1
24	31	56	30	0	1	0	1	0	0	0