Comparing classification results from models selected differently with Kruskal-Wallis

Question

For a seminar at university I'm reviewing a paper that proposes a new feature selection algorithm. The authors also evaluated their algorithm by applying it beside two other feature selection algorithms on 17 datasets. In order to compare them, they used the resulting feature subsets for classification and documented the accuracy and AUROC of the different algorithms.

In order to see if the presented algorithm is doing better than the other two, I'm separating their results into three groups (one for each algorithm) with each 17 instances (since they tested them for 17 datasets). Then I use the Kruskal-Wallis test to check if there is a significant difference in the distributions of the given quality measures for the algorithms.

My question is: is this approach valid? If not, how would you approach this task?

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

3

Since the models are tested on the same datasets, they wouldn't be independent. (Try correlating the AUCs from the different models, most likely the dataset with the highest value from one model will be the dataset with the highest value from another model.) I might try Friedman's test instead of the KW.

However, you won't have much power this way, which is especially problematic if you want to argue for the null hypothesis (which is already problematic, see my answer here: Why do statisticians say a non-significant result means “you can't reject the null” as opposed to accepting the null hypothesis?). The reason is that the AUC has aggregated (thrown away) a lot of information about all of the instances in the dataset and their relative rankings into a single value. You could get somewhat more information from the model outputs by using something like the Brier score instead of the AUC. Depending on how much work you want to do (I suspect it would be a lot), you could try to work with all the observations from each dataset in a hierarchical model rather than aggregating at all.

edited Apr 13 '17 at 12:44

Community

1

answered Jan 02 '15 at 19:06

gung - Reinstate Monica

132,789
81
357
650

First of all, thanks for your answer. Since the only data I have is what the authors published in their paper, I won't be able to do much in-depth work on this. I'll check friedman's test for now – deemel Jan 02 '15 at 20:02
Friedman's test is an acceptable solution. I would be wary of asserting the null, though. You might just look at descriptive statistics, plots, etc, to characterize how the distributions differ. – gung - Reinstate Monica Jan 02 '15 at 20:07
@gung I have a [question](http://stats.stackexchange.com/questions/188615/what-test-is-appropriate-for-comparing-the-difference-in-accuracy-between-recogn) related to your statement that "*Since the models are tested on the same datasets, they wouldn't be independent*". Could you please have a look at it? – PatternRecognition Dec 29 '15 at 17:33

Comparing classification results from models selected differently with Kruskal-Wallis

1 Answers1

Linked