I have developed a tool that recognizes a set of six classes, and then tested and evaluated its ability to recognize these classes by using the F-score (aka F-measure). I then tested two other tools that recognize the same set of six classes and evaluated them on the same test set on which my tool is evaluated. The following table shows the f-measures calculated for the tools in the six classes (not the actual values):
| My tool | Tool A | Tool B
------------------------------------
class 1 | 0.431 | 0.297 | 0.327
class 2 | 0.388 | 0.348 | 0.334
class 3 | 0.979 | 0.826 | 0.790
class 4 | 0.290 | 0.389 | 0.238
class 5 | 0.990 | 0.730 | 0.642
class 6 | 0.886 | 0.516 | 0.566
Since the tools are all tested on same test set, I cannot use Kruskal-Wallis, or Mann–Whitney U test, as the data fails the independence assumption (as suggested by the accepted answer of THIS question). What test should I use to check if my tool is significantly different/better (pair-wise) than the other two tools?
EDIT: I need the test to be performed (pair-wise) based on accuracy measures, like the f-measure, not the raw result data used to calculate them.
EDIT 2: The problems I am trying to solve are binary classification problems, and I have six of such problems each of which is handled and tested separately. Each class type has its own and separate test dataset, which is used to test how accurate the three tools are in recognizing the corresponding class-type.