How to interpret the result of Friedman's test?

Question

I uploaded the data and then I used bootstrapping to have 10 different samples from the original data but with the same length as the original data. For each sample, I used 7 distance metrics, and I calculated accuracy and other performance measures.

First, I am trying to compare 7 different accuracies using Friedman test.

Accuracies Matrix
           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,] 0.9753954 0.9771529 0.9789104 0.9789104 0.9806678 0.9771529 0.9806678
 [2,] 0.9736380 0.9806678 0.9806678 0.9806678 0.9841828 0.9771529 0.9771529
 [3,] 0.9753954 0.9841828 0.9806678 0.9771529 0.9806678 0.9771529 0.9718805
 [4,] 0.9771529 0.9859402 0.9789104 0.9789104 0.9824253 0.9824253 0.9841828
 [5,] 0.9736380 0.9806678 0.9771529 0.9824253 0.9824253 0.9806678 0.9771529
 [6,] 0.9701230 0.9789104 0.9736380 0.9806678 0.9841828 0.9824253 0.9753954
 [7,] 0.9912127 0.9912127 0.9859402 0.9859402 0.9859402 0.9841828 0.9824253
 [8,] 0.9789104 0.9806678 0.9859402 0.9859402 0.9841828 0.9806678 0.9789104
 [9,] 0.9806678 0.9841828 0.9876977 0.9824253 0.9841828 0.9859402 0.9841828
[10,] 0.9789104 0.9771529 0.9753954 0.9789104 0.9666081 0.9613357 0.9630931

I got the following result:

    Friedman rank sum test

data:  Datam
Friedman chi-squared = 16.252, df = 6, p-value = 0.01246

That means there is a significant difference between the accuracy groups. So I used the function posthoc.friedman.nemenyi.test from R's PMCMR package to determine which pairs are significantly different and I got the following:

    Pairwise comparisons using Nemenyi multiple comparison test 
             with q approximation for unreplicated blocked data 

data:  Accuracies Matrix 

     [,1]  [,2]  [,3]  [,4]  [,5]  [,6] 
[1,] 0.088 -     -     -     -     -    
[2,] 0.310 0.998 -     -     -     -    
[3,] 0.185 1.000 1.000 -     -     -    
[4,] 0.027 1.000 0.958 0.991 -     -    
[5,] 0.804 0.830 0.987 0.946 0.576 -    
[6,] 0.987 0.436 0.804 0.645 0.207 0.996

P value adjustment method: none

How to interpret the result of posthoc.friedman.nemenyi.test?

What are the actual data? How is the `Accuracies Matrix` set up? — gung - Reinstate Monica, Sep 26 '18 at 19:53
@gung♦, I calculated the accuracies from different model then I put them in a matrix. — jeza, Sep 26 '18 at 20:03
@gung♦, yes it is accuracies outputs from different models. — jeza, Sep 26 '18 at 20:29
What are the models? What are the models' accuracies? Are they classifying objects that they get right or wrong? Something else? What is your situation? What are your data? What are you trying to do? — gung - Reinstate Monica, Sep 26 '18 at 20:32
So you have 7 knn classifiers based on different distance metrics & you want to assess which metric leads to the best model, is that correct? If so, you shouldn't be doing this. — gung - Reinstate Monica, Sep 26 '18 at 20:41
@yes this what I want. I am also using other performance measures such as the Brier score, F1 score. — jeza, Sep 26 '18 at 20:45
It isn't just about Brier vs accuracy (which is certainly relevant). You don't want to compare aggregate percentages, you want to compare at the level of the individual patterns (by analogy, see my answer [here](https://stats.stackexchange.com/a/89415/7290)). — gung - Reinstate Monica, Sep 26 '18 at 20:48
@ gung♦, you mean that I need to compare Mcnemar's results. If yes how to interpret it. for example Mcnemar's Test P-Value : 0.1227 — jeza, Sep 26 '18 at 20:55
I suspect what you need is to fit a [Rasch model](https://en.wikipedia.org/wiki/Rasch_model), although I don't know that material. Friedman's test (followed by post-hoc tests) is going to be incredibly low-powered. Given that Brier scores are more informative intrinsically, you might as well just skip this. — gung - Reinstate Monica, Sep 26 '18 at 20:55
You could do a set of McNemar's tests, but you'll need 21 of them. — gung - Reinstate Monica, Sep 26 '18 at 20:56
In the accuracies matrix, you have 10 models for 10 different distance measures. What are the 7 columns? According to the link there are 150 patterns in the validation set. — gung - Reinstate Monica, Sep 26 '18 at 20:59
@ gung♦, for example, the first value in first row and column is the accuracy for a first distance and first model, the second column is for another distance metric and so on. — jeza, Sep 26 '18 at 21:03
Those McNemar's tests are different from the McNemar's tests you need. But I gather now you are combining 10 different models (what, different values for k?) w/ each of 7 different distance metrics. Is that right? That's different again. — gung - Reinstate Monica, Sep 26 '18 at 21:08
@gung♦, what I did as following: I uploaded the data then I used bootstrapping to have 10 different samples from the original data but the same length of the original. For each sample, I used 7 distance metrics, and I calculated accuracy and other performance measures. That is it. — jeza, Sep 26 '18 at 21:14
I see, that makes a little more sense, but I don't think that's really worth doing. You should have enough variability in the original data. You don't need to try to estimate a sampling distribution that way. I would use the original dataset & use more standard methods to evaluate the models. — gung - Reinstate Monica, Sep 27 '18 at 00:33
@gung♦, because my task is to see if there is a significant difference between accuracies within distance metrics. So does my job ok and if yes how to interpret the result I mentioned in the question? — jeza, Sep 27 '18 at 10:10
Are the data logically paired? That is, each row makes sense as a block? — Sal Mangiafico, Sep 30 '18 at 15:16
@Sal Mangiafico, bootstrabing was used to generate samples from this data. DATA — jeza, Sep 30 '18 at 15:20
In any case, Friedman's Test would only make sense if the data are logically paired. That is, if it makes logical sense to treat each row as a block. For example, if each block represents one participant or one time point. — Sal Mangiafico, Sep 30 '18 at 15:34

Sal Mangiafico · Accepted Answer · 2018-11-08T17:14:12.153

The problem is that the row and column labels for the matrix make the results difficult to understand. ‡

In the following, since there are no column labels, the columns will be labeled V1 to V7 by default. This will make it easy to evaluate the comparisons between them.

if(!require(PMCMR)){install.packages("PMCMR")}

Input =("
0.9753954 0.9771529 0.9789104 0.9789104 0.9806678 0.9771529 0.9806678
0.9736380 0.9806678 0.9806678 0.9806678 0.9841828 0.9771529 0.9771529
0.9753954 0.9841828 0.9806678 0.9771529 0.9806678 0.9771529 0.9718805
0.9771529 0.9859402 0.9789104 0.9789104 0.9824253 0.9824253 0.9841828
0.9736380 0.9806678 0.9771529 0.9824253 0.9824253 0.9806678 0.9771529
0.9701230 0.9789104 0.9736380 0.9806678 0.9841828 0.9824253 0.9753954
0.9912127 0.9912127 0.9859402 0.9859402 0.9859402 0.9841828 0.9824253
0.9789104 0.9806678 0.9859402 0.9859402 0.9841828 0.9806678 0.9789104
0.9806678 0.9841828 0.9876977 0.9824253 0.9841828 0.9859402 0.9841828
0.9789104 0.9771529 0.9753954 0.9789104 0.9666081 0.9613357 0.9630931
")
Matrix = as.matrix(read.table(textConnection(Input)))

Matrix

   ###             V1        V2        V3        V4        V5        V6        V7
   ### [1,] 0.9753954 0.9771529 0.9789104 0.9789104 0.9806678 0.9771529 0.9806678
   ### [2,] 0.9736380 0.9806678 0.9806678 0.9806678 0.9841828 0.9771529 0.9771529
   ### [3,] 0.9753954 0.9841828 0.9806678 0.9771529 0.9806678 0.9771529 0.9718805
   ### [4,] 0.9771529 0.9859402 0.9789104 0.9789104 0.9824253 0.9824253 0.9841828
   ### [5,] 0.9736380 0.9806678 0.9771529 0.9824253 0.9824253 0.9806678 0.9771529
   ### [6,] 0.9701230 0.9789104 0.9736380 0.9806678 0.9841828 0.9824253 0.9753954
   ### [7,] 0.9912127 0.9912127 0.9859402 0.9859402 0.9859402 0.9841828 0.9824253
   ### [8,] 0.9789104 0.9806678 0.9859402 0.9859402 0.9841828 0.9806678 0.9789104
   ### [9,] 0.9806678 0.9841828 0.9876977 0.9824253 0.9841828 0.9859402 0.9841828
   ### [10,] 0.9789104 0.9771529 0.9753954 0.9789104 0.9666081 0.9613357 0.9630931

library(PMCMR)

posthoc.friedman.nemenyi.test(Matrix)

   ###  Pairwise comparisons using Nemenyi multiple comparison test 
             with q approximation for unreplicated blocked data 

   ### data:  Matrix 

   ###    V1    V2    V3    V4    V5    V6   
   ### V2 0.088 -     -     -     -     -    
   ### V3 0.310 0.998 -     -     -     -    
   ### V4 0.185 1.000 1.000 -     -     -    
   ### V5 0.027 1.000 0.958 0.991 -     -    
   ### V6 0.804 0.830 0.987 0.946 0.576 -    
   ### V7 0.987 0.436 0.804 0.645 0.207 0.996
   ###
   ### P value adjustment method: none

The output above is a table of p-values, each comparing two groups. If you are using p = 0.05 as your cutoff, the only significant comparison is V1 vs. V5 (p = 0.027). The rest of the p-values are all greater than 0.05.

It may be useful to translate this matrix of p-values to a compact letter display. In this output, groups sharing a letter are not significantly different. For this I'll use the fullPTable function in the rcompanion package † and multcompLetters from multcompView.

if(!require(multcompView)){install.packages("multcompView")}
if(!require(PMCMR)){install.packages("PMCMR")}
if(!require(rcompanion)){install.packages("rcompanion")}

library(PMCMR)
library(rcompanion)
library(multcompView)

PT  = posthoc.friedman.nemenyi.test(Matrix)$p.value
PT1 = fullPTable(PT)
PT1
library(multcompView)
multcompLetters(PT1)

   ###    V1   V2   V3   V4   V5   V6   V7 
   ###   "a" "ab" "ab" "ab"  "b" "ab" "ab"

V1 and V5 are the only two groups not sharing a letter.

Addition: PMCMRplus package

There are a few different post-hoc tests available for Friedman's test in PMCMRplus package. Functions begin with frdAllPairs. The Nemenyi test appears to produce results similar to those above. For this example, it was necessary to add row labels to the matrix.

if(!require(PMCMRplus)){install.packages("PMCMRplus")}

library(PMCMRplus)

rownames(Matrix) = LETTERS[1:10]

frdAllPairsNemenyiTest(Matrix)

   # Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way balanced complete block design
   # 
   #    V1    V2    V3    V4    V5    V6   
   # V2 0.088 -     -     -     -     -    
   # V3 0.310 0.998 -     -     -     -    
   # V4 0.185 1.000 1.000 -     -     -    
   # V5 0.027 1.000 0.958 0.991 -     -    
   # V6 0.804 0.830 0.987 0.946 0.576 -    
   # V7 0.987 0.436 0.804 0.645 0.207 0.996
   # 
   # P value adjustment method: single-step

‡ Note: This answer addresses the primary question: conducting and interpreting Nemenyi test. It does not weigh in on the discussion in the comments, as to whether the generation of this data makes sense or if Friedman's test is the applicable test in this case.

† Caveat: I am the author of this package.

many thanks for your answer. Is this only what I can say about the significant pairs. In other words, is V5 greater than V1 or what? what should I say about the significant pairs? — jeza, Sep 29 '18 at 23:49
To make things simple, yes, V5 is greater than V1. But that's not really the correct interpretation of Friedman test and post-hocs. A better interpretation is something like, within each row, V5 tends to be greater than V1. This considers the ranks of the data and not the values themselves. — Sal Mangiafico, Sep 30 '18 at 00:08
Two reasons: 1) Friedman test doesn't look at the actual values, but the values ranked relative to each other; 2) Friedman test evaluates the differences between groups within each block. So... saying V1 is greater than V1 is one way to report the results simply, but it's not really what the test is testing for. — Sal Mangiafico, Sep 30 '18 at 14:10
"So... saying V1 is greater than V1 is one way to report " is it a typo (two V1) — jeza, Sep 30 '18 at 14:14
last thing, is there way to compare the significant pairs. I mean for example compare their median so I can make sure of the result? — jeza, Sep 30 '18 at 14:52
Friedman test doesn't really compare medians. If you wanted to compare medians, you would use a different test. That being said, it makes sense to report medians with Friedman, at least in most usual circumstances. The medians make sense with your data: `apply(Matrix, 2, FUN = median)` and `plot(apply(Matrix, 2, FUN = median))` — Sal Mangiafico, Sep 30 '18 at 15:12
Ah, ok what I will get If medians are calculated. I mean, then I can say which is greater or what? — jeza, Sep 30 '18 at 15:16
I don't follow the question. You conducted the Nemenyi to compare the groups, didn't you? If you want to compare medians per se, you would probably want to use a different test. — Sal Mangiafico, Sep 30 '18 at 15:28
yeah I understand you, I mean what the benefits of comparing the median in this case. just for more information. Thanks — jeza, Sep 30 '18 at 15:32
Reporting the medians is useful for your audience to get some sense of the differences among the groups. Likewise, showing a histogram for each, or a box plot for each is useful. None of these captures the way Friedman test is treating the data as blocked, or the way it is handling ranks, so in extreme cases these methods could be misleading. But they're usually pretty helpful for the audience. — Sal Mangiafico, Sep 30 '18 at 15:46
here it becomes difficult to interpret for me, what do you think `"abc" "ad" "d" "d" "abd" "abcd" "bc" "c"` — jeza, Oct 31 '18 at 11:36
The underlying interpretation is "Treatments sharing a letter are not significantly different." In this case, the interpretation seems difficult. It may help if you order your treatments by median or another relevant measure, and rerun the analysis, so that those with "a" are e.g. the greatest, and so on. — Sal Mangiafico, Oct 31 '18 at 12:20
I discover a mistake in the formula of posthoc.friedman.nemenyi.test. Then, I contacted the author of PMCMR. He told me this no longer be used and he guides me to his new version package PMCMRplus. However, I tried this but I do not know how it works? Could you please help me, and edit your answer for example so I can understand. Thanks — jeza, Nov 08 '18 at 13:49

How to interpret the result of Friedman's test?

1 Answers1