I have two groups, say $\text{G}$ and $\text{H}$, and they each perform a classification task. Let's say that I get the following confusion matrices
$$\text{G's Performance}\\ \begin{array} &&\text{A-True} & \text{B-True} & \text{C-True}\\ \text{A-predicted} & 41 & 7 & 13\\ \text{B-predicted} & 3 & 40 &14\\ \text{C-predicted} & 6 & 3 & 13 \end{array} $$
$$\text{H's Performance}\\ \begin{array} &&\text{A-True} & \text{B-True} & \text{C-True}\\ \text{A-predicted} & 13 & 6 & 3 \\ \text{B-predicted} & 14 & 41&7 \\ \text{C-predicted} & 13 & 3&40 \end{array} $$
My first thought was to do a comparison of accuracy, but both $\text{G}$ and $\text{H}$ have the same accuracy, but while $\text{G}$ struggles to classify category $\text{C}$, $\text{H}$ struggles to classify category $\text{A}$. Testing for unequal accuracy misses this difference.
What I would like to do is hypothesis test that both of my groups, $\text{G}$ and $\text{H}$, are classifying the same way. I do not care how either is performing for this purpose, just if they are classifying differently.
Gung's answer here seemed pretty promising, and I think adapting it to my task warrants its own post instead of discussing in the comment. Here is my go at the code.
library(MASS)
tab <- array(c(41,3,6,7,40,3,13,14,13,13,14,13,6,41,3,3,7,40), dim=c(3,3,2))
tab <- as.table(tab)
names(dimnames(tab)) = c("predicted", "actual", "classifier")
dimnames(tab)[[1]] = c("A", "B", "C")
dimnames(tab)[[2]] = c("A", "B", "C")
dimnames(tab)[[3]] = c("G", "H")
m1 = loglm(~classifier + actual*predicted, tab)
m2 = loglm(~actual*predicted, tab)
anova(m2, m1)
##########
LR tests for hierarchical log-linear models
Model 1:
~actual * predicted
Model 2:
~classifier + actual * predicted
Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
Model 1 49.24281 9
Model 2 49.24281 8 0.00000 1 1
Saturated 0.00000 0 49.24281 8 0
Is it the $49.24281 $ and $p\approx 0$ that tells me that $\text{G}$ and $\text{H}$ have significantly different confusion matrices? If so, is this the usual parameter inference that including another variable (classifier) as a predictor results in model much closer in fit to the saturated model? Gung's code left off a response variable. If I want to do this analysis with the usual glm
function, what is the response variable?
(Next up will be what to do when there are multiple factors with levels (I think I get how to do that with more dimensions to the array) and$-$yikes$-$what to do if there is a continuous covariate, so I want to make sure I understand what is going on with this more basic example.)