For a seminar at university I'm reviewing a paper that proposes a new feature selection algorithm. The authors also evaluated their algorithm by applying it beside two other feature selection algorithms on 17 datasets. In order to compare them, they used the resulting feature subsets for classification and documented the accuracy and AUROC of the different algorithms.
In order to see if the presented algorithm is doing better than the other two, I'm separating their results into three groups (one for each algorithm) with each 17 instances (since they tested them for 17 datasets). Then I use the Kruskal-Wallis test to check if there is a significant difference in the distributions of the given quality measures for the algorithms.
My question is: is this approach valid? If not, how would you approach this task?