What is an appropriate method for comparing relative improvement in model performance across different problems?
For example, say I have three different datasets/problems a, b, c, and two models for each problem _1 and _2 and my respective ROC AUC scores are:
a_1: 0.55 ; a_2: 0.65
b_1: 0.70 ; b_2: 0.80
c_1: 0.85 ; c_2: 0.95
Given the ROC AUC represents the probability that a classifier scores a positive case higher than a negative case, is it valid to use the odds ratios between model_1 and model_2 to describe them.
See:
odds <- function(p){
(p) / (1 - p)
}
mods1 <- c(0.55, 0.7, 0.85)
mods2 <- c(0.65, 0.8, 0.95)
odds(mods2) / odds(mods1)
#> [1] 1.519481 1.714286 3.352941
Say for instance:
- "The odds c_2 scores a positive case with a higher score than a negative case is 3.35 times that of c_1."
- "The odds b_2 scores a positive case with a higher score than a negative case is 1.71 times that of b_1."?
- "The odds a_2 scores a positive case with a higher score than a negative case is 1.52 times that of a_1."?
Would rescaling to make 0.5 the minimum value (as this represents random scores) represent a more appropriate baseline? E.g.
mods1_rescaled <- (mods1 - 0.5) / 0.5
mods2_rescaled <- (mods2 - 0.5) / 0.5
odds(mods2_rescaled) / odds(mods1_rescaled)
#> [1] 3.857143 2.250000 3.857143
In which case c and a are tied as the problems where model_2 represents the greatest magnitude of improvement over their counterparts (at 3.86). What would be an elegant way of restating the new interpretation?
What's an appropriate method for comparing the relative improvement of model_2 over model_1 across these contexts? It seems you would need to add-in some notion of variability / confidence. What test statistic or method is appropriate to say "The relative improvement of model_2's ROC AUC is the greatest in problem [a/b/c/...]" ?