I want to compare the diagnostic efficiencies of two methods using overall correct classification rates (i.e., (sensitivity+specificity)/2
, and assuming approx. equal samples for positive and negative cases), in paired samples. Several papers advocate the use of McNemar (e.g. this and this, see also this answer), although some seem to find fault with this approach (e.g. this). However, these papers are very technical and mostly address sensitivity and specificity separately rather than together in a single overall accuracy rate as I want to.
Assuming that using McNemar test for the comparison is correct (but tell me if not), I would also like to get confidence intervals for the overall accuracy (i.e., proportion) difference. Functions for McNemar typically give odds ratios or similar, but that doesn't seem practical for my purpose.
Now, as far as I understand, the McNemar test boils down to a simple proportion test, and therefore I may use the confidence intervals from that proportion test. Consider the following example.
repn = 6
outcomes_A = rep(c(0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0), repn)
outcomes_B = rep(c(1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), repn)
mean(outcomes_A) # proportion (of correct classifications): 0.6470588
mean(outcomes_B) # proportion (of correct classifications): 0.8235294
mydat = xtabs(~outcomes_A + outcomes_B)
mydat
outcomes_B
outcomes_A 0 1
0 6 30
1 12 54
Now, using mcnemar.test(mydat)
gives the same results as using prop.test(mydat[1,2], mydat[2,1]+mydat[1,2])
, but the latter actually provides confidence intervals. However, this is not the confidence interval I want: this confidence interval is for the proportion difference between the "P01
" and "P10
" observations in the conventional matrix input (as above in mydat
). However, I think I can correct this by adjusting the numbers using the proportion of "P01
" and "P10
" observations to all observations.
res = prop.test(mydat[1,2], mydat[2,1]+mydat[1,2])
discr = mydat[2, 1] + mydat[1, 2]
# starting from (res$estimate * discr - (1 - res$estimate) * discr) / sum(mydat)
(res$estimate * 2 - 1) * discr / sum(mydat) # difference
(res$conf.int[1] * 2 - 1) * discr / sum(mydat) # lower CI limit
(res$conf.int[2] * 2 - 1) * discr / sum(mydat) # upper CI limit
Which gives a difference of .1764706, 95% CI [.042812, .2781084]. Reassuringly, the calculated difference of .1764706 is identical with the actual proportion (i.e. accuracy rate) difference calculated from the original data: 0.8235294 - 0.6470588 = 0.1764706
I also compared the outcome of this procedure of mine to a MedCalc example, and the CI seems very close, though not the same – but I'd guess it's just because of a slightly different McNemar test variation.
mydat = matrix(c(212, 256, 144, 707), nrow = 2, ncol = 2)
mydat
# p1 = sum(mydat[1,]) / sum(mydat)
# p2 = sum(mydat[,1]) / sum(mydat)
# p1
# p2
# p2-p1
# mcnemar.test(mydat)
res = prop.test(mydat[2,1], mydat[2,1]+mydat[1,2])
discr = mydat[2, 1] + mydat[1, 2]
(res$estimate * 2 - 1) * discr / sum(mydat) # difference
(res$conf.int[1] * 2 - 1) * discr / sum(mydat) # 0.05492 vs. MedCalc's 0.0555
(res$conf.int[2] * 2 - 1) * discr / sum(mydat) # 0.11324 vs. MedCalc's 0.1143
Nonetheless, I'm not sure whether all this reasoning is correct and sensible. So my eventual question: is this an appropriate and ideal procedure for comparing overall accuracy rates and getting their confidence intervals? (And if not, what would be better?)
(Note: I'm well aware of the possibility of comparing AUCs, but that is not what I want; AUC is again not as practical a measure as simple accuracy rate.)