What test is appropriate for comparing the difference in accuracy between recognition tools?

Question

I have developed a tool that recognizes a set of six classes, and then tested and evaluated its ability to recognize these classes by using the F-score (aka F-measure). I then tested two other tools that recognize the same set of six classes and evaluated them on the same test set on which my tool is evaluated. The following table shows the f-measures calculated for the tools in the six classes (not the actual values):

        | My tool | Tool A  | Tool B
------------------------------------
class 1 |  0.431  |  0.297  |  0.327
class 2 |  0.388  |  0.348  |  0.334
class 3 |  0.979  |  0.826  |  0.790
class 4 |  0.290  |  0.389  |  0.238
class 5 |  0.990  |  0.730  |  0.642
class 6 |  0.886  |  0.516  |  0.566

Since the tools are all tested on same test set, I cannot use Kruskal-Wallis, or Mann–Whitney U test, as the data fails the independence assumption (as suggested by the accepted answer of THIS question). What test should I use to check if my tool is significantly different/better (pair-wise) than the other two tools?

EDIT: I need the test to be performed (pair-wise) based on accuracy measures, like the f-measure, not the raw result data used to calculate them.

EDIT 2: The problems I am trying to solve are binary classification problems, and I have six of such problems each of which is handled and tested separately. Each class type has its own and separate test dataset, which is used to test how accurate the three tools are in recognizing the corresponding class-type.

I encourage readers who are searching for an answer for the same I question I asked above to read the following article: Demšar, J., 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, pp.1-30. — PatternRecognition, Mar 03 '16 at 18:55

score 1 · Answer 1 · answered Dec 29 '15 at 17:47

1

You can probably produce a list of pairs with (estimated, original classes) for each tool instead of just F-measure. In that case I would suggest to use McNemar significance test to check improvements. Good paper which describe how to perform this test is

Salzberg, On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data mining and knowledge discovery 1,(1997)

answered Dec 29 '15 at 17:47

Dejan

377
1
2
11

Thanks for your answer. What you suggest is, unfortunately, not possible for some problem-specific reasons. – PatternRecognition Dec 29 '15 at 18:00
@PatternRecognition, it would probably help if you could provide whatever specific details about your situation exist that preclude possible solutions. – gung - Reinstate Monica Dec 30 '15 at 01:43
@gung Thanks for your comment. One main reason is simply that this pair-information has not been recorded, and re-runing the tests is not possible time and resource-wise. So, I need to do whatever I can with the reported f-measures. – PatternRecognition Dec 30 '15 at 09:11
One must add that the McNemar test is only for **binary** classifiers, but the OP states that the problem is **multi-class** – Jacques Wainer Jan 02 '16 at 14:34
@JacquesWainer We can check statistical improvements with McNemar test for each class separately, by making corresponding contingency tables. Moreover, we can make overall contingency table (including all classes) and check statistical overall improvements in multi-class case. – Dejan Jan 02 '16 at 15:00
@Petar I understand using McNemar for each class, but what test to use on the full confusion matrix/contingency table? – Jacques Wainer Jan 02 '16 at 16:53
@JacquesWainer The problem is, in fact, binary classification, and I have six of such binary classification problems each of which is handled and tested separately. – PatternRecognition Jan 02 '16 at 17:22
@PatternRecognition Then in your table, row names should be dataset1-6 instead of class1-6. Right? When you write class1-6 it suggests that you solve multi-class problem where, for example, each row represents F-measure of one vs. rest binary classifier. Anyway, in multi-class or in your case, it is possible to create 2x2 contingency matrix for all datasets (# of classes Tool1 and Tool2 classified correctly on all datasets,...) and perform McNemar test. – Dejan Jan 02 '16 at 20:31
@Petar I have clarified this in **EDIT 2**. Unfortunately, I have no option but to rely solely on the f-measures for comparions. – PatternRecognition Jan 02 '16 at 20:57

score 1 · Accepted Answer · answered Jan 02 '16 at 15:44

1) In your approach, you want to perform significant test on some quality metric (in this case F-measure) for each of the classes of the problem. The much more common approach is to perform the statistical significance test on sub-samples of the data. In this case you would peroform a k-fold, and measure a quality metric for each test fold, for each of the three tools.

One problem is that you want the quality metric to be F-measure and it is not obvious how to extend the F-measure idea for a multi-class problem (but I am sure someone already solved this problem). Once you solve how to use a single quality measure for each sub-sample and each tool, the canonical thing to do is to perform a Friedman test (your data is paired), and if the test result in a p-value low enough (traditionally 0.05 or below) then perform a Wilcoxon signed-rank pairwise comparison followed by a p-value adjustment (for the multiple comparisons). In this case you want to know whether your tool is better than the other two, so there are only 2 pairwise comparisons, so I would use the Bonferroni correction (instead of using 5% as the p-value threshold, use 2.5%).

2) But from a statistical significance point of view, there is nothing wrong with performing the test on the data as you presented. Each toll generates 6 sets of measures and you can perform the procedure above on these sets of measures: Friedman + Wilcoxon signed rank + Bonferroni adjustent. I do not see the problem you stated that the 6 sets of measures for each toll are not independent. At a first approximation the 6 values for each toll are independent. In terms of your table, the data in each column are independent (the lines are not - they are paired and that is why you should use Friedman and Wilcoxon signed-rank).

Thanks for your answer. Since the problems I am trying to solve involve six binary classification problems each of which is handled and tested separately, as I have clarified in a comment above, does this change your answer? — PatternRecognition, Jan 02 '16 at 17:26
That is point 2) above. As far as I know you can go ahead and use Friedman+wilcoxon singed rank + bonferroni on the F measures for the 6 classes. — Jacques Wainer, Jan 02 '16 at 21:09
Thank you very much. One last question, please. If I find my tool better than one tool but not the other, I think my conclusion can be as such, cannot it? If it can, do I have to apply Bonferroni correction in such a case to draw such a conclusion? — PatternRecognition, Jan 03 '16 at 18:30
The Friedman will tell you if all are the same or not. If the Friedman has low p-value than it is not true (or it is unlikelly) that all are the same. Then you need the two wilcoxon comparisons (your tool vs tool A and your tool vs tool B) - but for these comparisons you need to use a p-value of 0.025 and not 0.05 - that is the bonferroni correction. With that threshold it may be the case that one comparison is below the threshold and thus the difference is significant, and the other comparison not, and the difference is not significant — Jacques Wainer, Jan 05 '16 at 12:56

What test is appropriate for comparing the difference in accuracy between recognition tools?

2 Answers2

Linked