Compare and quantify relative improvement in ROC AUC scores?

Question

What is an appropriate method for comparing relative improvement in model performance across different problems?

For example, say I have three different datasets/problems a, b, c, and two models for each problem _1 and _2 and my respective ROC AUC scores are:

a_1: 0.55 ; a_2: 0.65
b_1: 0.70 ; b_2: 0.80
c_1: 0.85 ; c_2: 0.95

Given the ROC AUC represents the probability that a classifier scores a positive case higher than a negative case, is it valid to use the odds ratios between model_1 and model_2 to describe them.

See:

odds <- function(p){
  (p) / (1 - p)
}

mods1 <- c(0.55, 0.7, 0.85)
mods2 <- c(0.65, 0.8, 0.95)

odds(mods2) / odds(mods1)
#> [1] 1.519481 1.714286 3.352941

Say for instance:

"The odds c_2 scores a positive case with a higher score than a negative case is 3.35 times that of c_1."
"The odds b_2 scores a positive case with a higher score than a negative case is 1.71 times that of b_1."?
"The odds a_2 scores a positive case with a higher score than a negative case is 1.52 times that of a_1."?

Would rescaling to make 0.5 the minimum value (as this represents random scores) represent a more appropriate baseline? E.g.

mods1_rescaled <- (mods1 - 0.5) / 0.5
mods2_rescaled <- (mods2 - 0.5) / 0.5

odds(mods2_rescaled) / odds(mods1_rescaled)

#> [1] 3.857143 2.250000 3.857143

In which case c and a are tied as the problems where model_2 represents the greatest magnitude of improvement over their counterparts (at 3.86). What would be an elegant way of restating the new interpretation?

What's an appropriate method for comparing the relative improvement of model_2 over model_1 across these contexts? It seems you would need to add-in some notion of variability / confidence. What test statistic or method is appropriate to say "The relative improvement of model_2's ROC AUC is the greatest in problem [a/b/c/...]" ?

I'm voting to close this question as unclear. What problem are you trying to solve? How does this proposed method relate to solving that problem? — Sycorax, Jul 31 '18 at 00:41
I added in a line at the top to try clarify my question as "What is an appropriate method for comparing relative improvement in model performance across different problems?" The rest of my question describes a scenario and how you might compare improvement in the model performance metric, AUC, and asks if this is an appropriate approach or what would be. I've often heard you can't compare performance metrics across problems. I'm wondering though if it's appropriate to compare relative improvement in performance metrics and if the method I describe is appropriate. — Bryan Shalloway, Aug 01 '18 at 02:15
Great! This makes more sense and I've voted to reopen. I'm not sure that it's common to compare models in this way. Usually, if you want to do a statistical test to compare two models, you use one of the methods provided in this thread. https://stats.stackexchange.com/questions/358101/statistical-significance-p-value-for-comparing-two-classifiers-with-respect-to/358598#358598, i.e. to compare the AUC statistics directly, rather than compare the odds ratios. — Sycorax, Aug 01 '18 at 04:06
Thanks, and thanks for the reference. That thread is more concerned w/ calculation. Also my question is not the typical problem of multiple models for a single problem and deciding which model to choose. Instead it's how to compare model improvement between multiple problems (each w/ multiple models). — Bryan Shalloway, Aug 01 '18 at 15:24

Compare and quantify relative improvement in ROC AUC scores?

0 Answers0

Linked