McNemar-Bowker exact test in R: difficulty interpreting the results

Question

I am trying to use the McNemar-Bowker test to test the difference in performance for 2 classifiers. Since my input matrix is sparse and the sum of some of the symmetric cells is less than 10, I am trying to use the exact McNemar-Bowker test using nominalSymmetryTest in this way:

data <- c( 0,0,0,0,0,0,0,0,0,0,
       23,253,35,0,0,0,0,0,0,0,
       9,299,1510,329,7,0,0,0,0,0,
       0,1,289,1193,136,3,0,0,0,0,
       0,0,35,403,4437,338,1,0,0,0,
       0,0,0,15,70,692,114,7,1,0,
       0,0,0,0,3,50,87,18,0,0,
       0,0,0,0,1,14,57,35,15,1,
       0,0,0,0,2,2,16,12,1,0,
       0,0,0,1,0,3,31,33,12,3)
rownames <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
colnames <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
ada_cat <- matrix(data, nrow = 10, ncol = 10, byrow = TRUE)
                  #dimnames=list(rownames,colnames))

nominalSymmetryTest(ada_cat,
                    digits = 3,
                    MonteCarlo = TRUE,
                    exact = TRUE,
                    ntrial = 100000)

The results look like this:

  $Global.test.for.symmetry
  Dimensions p.value
1    10 x 10      NA

$Pairwise.symmetry.tests
    Comparison  p.value p.adjust
1    1/1 : 2/2 2.38e-07 9.12e-07
2    1/1 : 3/3  0.00391 7.49e-03
3    1/1 : 4/4     <NA>       NA
4    1/1 : 5/5     <NA>       NA
5    1/1 : 6/6     <NA>       NA
6    1/1 : 7/7     <NA>       NA
7    1/1 : 8/8     <NA>       NA
8    1/1 : 9/9     <NA>       NA
9  1/1 : 10/10     <NA>       NA
10   2/2 : 3/3 2.12e-53 4.88e-52
11   2/2 : 4/4        1 1.00e+00
12   2/2 : 5/5     <NA>       NA
13   2/2 : 6/6     <NA>       NA
14   2/2 : 7/7     <NA>       NA
15   2/2 : 8/8     <NA>       NA
16   2/2 : 9/9     <NA>       NA
17 2/2 : 10/10     <NA>       NA
18   3/3 : 4/4    0.117 1.92e-01
19   3/3 : 5/5 1.51e-05 3.86e-05
20   3/3 : 6/6     <NA>       NA
21   3/3 : 7/7     <NA>       NA
22   3/3 : 8/8     <NA>       NA
23   3/3 : 9/9     <NA>       NA
24 3/3 : 10/10     <NA>       NA
25   4/4 : 5/5 1.11e-31 8.51e-31
26   4/4 : 6/6  0.00754 1.33e-02
27   4/4 : 7/7     <NA>       NA
28   4/4 : 8/8     <NA>       NA
29   4/4 : 9/9     <NA>       NA
30 4/4 : 10/10        1 1.00e+00
31   5/5 : 6/6  3.3e-43 3.80e-42
32   5/5 : 7/7    0.625 7.99e-01
33   5/5 : 8/8        1 1.00e+00
34   5/5 : 9/9      0.5 6.76e-01
35 5/5 : 10/10     <NA>       NA
36   6/6 : 7/7 6.33e-07 2.08e-06
37   6/6 : 8/8    0.189 2.90e-01
38   6/6 : 9/9        1 1.00e+00
39 6/6 : 10/10     0.25 3.59e-01
40   7/7 : 8/8 7.24e-06 2.08e-05
41   7/7 : 9/9 3.05e-05 7.02e-05
42 7/7 : 10/10 9.31e-10 5.35e-09
43   8/8 : 9/9    0.701 8.49e-01
44 8/8 : 10/10 4.07e-09 1.87e-08
45 9/9 : 10/10 0.000488 1.02e-03

$p.adjustment
  Method
1    fdr

$statistical.method
         Method
1 binomial test

I am having difficulties in understanding them. Can I conclude from these results that the difference in classifier performance is statistically significant/non-significant?

Thank you!

score 1 · Accepted Answer · answered Jun 07 '21 at 23:20

ILR.

At the time of writing, the nominalSymmetryTest function doesn't return an exact test for the omnibus test. The webpage (https://rcompanion.org/handbook/H_05.html) has not yet been updated to reflect this. So that may be confusing, but the documentation for the function is correct.

As @IanCampbell noted, the McNemar-Bowker test will fail when there are zeros in certain places in the data matrix. If you switched to the exact=FALSE option, the function would invoke the mcnemar.test function, and still give you an NA result.

So what do you have? You have the results of the 2 x 2 McNemar tests from your larger data table. Since you have used the exact=TRUE option, these are actually calculated using the binomial.test function. In your case, for example, you have a significant difference from "1" to "2" (or vice-versa; I don't if rows or columns might be a "before" or "after"), and from "1" to "3", and so on.

I don't know of a statistical test like McNemar that will work with a matrix with so many zeros like this.

I also note that you have a large sample size. I would advise you to not rely too much on p-values, but to also look at some form of effect size statistic. If you are relying on the 2 x 2 tables, it's easy to calculate the odds ratio. For example, if the change from "5" to "3" is 35 and the change from "3" to "5" is 7, the odds ratio is 5 or 0.2, depending.

As a final comment, the fact that your categories are labeled "1" to "10" makes me suspect that you should be treating these data as ordered categorical (ordinal) and not nominal categorical, and you should be using a totally different test. (But I don't know without more information).

Thank you so much for the explanation provided and for the heads-up regarding the p-value reliability! This has been very helpful! — ILR, Jun 09 '21 at 16:50

score 0 · Answer 2 · answered Jun 07 '21 at 19:31

As noted in the R Companion:

For a 2 x 2 table, the most common test for symmetry is McNemar’s test. For larger tables, McNemar’s test is generalized as the McNemar–Bowker symmetry test. One drawback to the latter test is that it may fail if there are 0’s in certain locations in the matrix. McNemar’s test may not be reliable if there are low counts in the “discordant” cells. Authors recommend that these cells to sum to at least 5 or 10 or 25. [Emphasis added]

You're getting NaN statistics and P-values because the following line of stats::mcnemar.test evaluates some elements to Inf:

y[upper.tri(x)]^2/x[upper.tri(x)]
# [1]  21.0434783   7.1111111 207.0928144         Inf   0.0000000   2.4611650         Inf         Inf  17.3571429 131.2727273         Inf         Inf         Inf
#[14]   6.7222222 174.7279412         Inf         Inf         Inf         Inf   0.2500000  24.2012195         Inf         Inf         Inf         Inf   0.0000000
#[27]   1.7142857  19.2533333         Inf         Inf         Inf         Inf   0.5000000   0.0000000  14.0625000   0.1481481         Inf         Inf         Inf
#[40]   0.0000000         Inf   1.3333333  29.0322581  28.2647059  10.0833333

This answer suggested an approach by Evans and Hoenig. This approach is implemented in the fishmethods package. I use igraph to make the matrix transformation convenient.

library(fishmethods)
library(igraph)
pairs <- get.edgelist(graph.adjacency(ada_cat,mode = "directed"))
compare2(pairs,cont.cor = FALSE, twovsone = FALSE, plot.summary = FALSE, barplot = FALSE)
#$thecall
#compare2(readings = get.edgelist(graph.adjacency(ada_cat, mode = "directed")), 
#    twovsone = FALSE, plot.summary = FALSE, barplot = FALSE, 
#    cont.cor = FALSE)
#
#$McNemar
#     Chisq       pvalue
#1 59.25231 1.387779e-14
#
#$Evans_Hoenig
#     Chisq df pvalue
#1 140.1184  4      0
#
#$difference_frequency
#      difference frequency percentage
# [1,]         -6         1        0.0
# [2,]         -5         0        0.0
# [3,]         -4         5        0.0
# [4,]         -3        34        0.3
# [5,]         -2       126        1.2
# [6,]         -1      1215       11.5
# [7,]          0      8211       77.5
# [8,]          1       985        9.3
# [9,]          2        19        0.2
#[10,]          3         1        0.0
#
#$sample_size
#[1] 10597

Thank you for your answer, Ian! It's good to know that there are other alternatives out there! — ILR, Jun 09 '21 at 16:51

McNemar-Bowker exact test in R: difficulty interpreting the results

2 Answers2