What statistical tests to compare two AUCs from two models on the same dataset?

Question

Let say I build two machine learning classifiers, A and B, on the same dataset.

I obtain the ROC curves for both A and B, and the AUCs value.

What statistical tests should I use to compare these two classifiers. (Let say A is the one I innovate, and B is a baseline model).

Thanks!

ROC AUC is a Mann-Whitney U statistic, so those confidence intervals are directly relevant here. More discussion in the answers to and comments on this thread: http://stats.stackexchange.com/questions/189411/did-i-just-invent-a-bayesian-method-for-analysis-of-roc-curves — Sycorax, May 26 '16 at 03:04
Thanks @GeneralAbrial . I read the post but I am not quite sure about it. So the Mann-Whitney U statistical test is the way to go? — RockTheStar, May 26 '16 at 18:23
The Mann-Whitney U statistic seems like a fairly straightforward statistical hypothesis test: $H_0$ the AUCs equal, $H_1$ they are unequal. — Sycorax, May 26 '16 at 18:33
I am not sure if mann-whitney U statistics is the right one to go. — RockTheStar, May 26 '16 at 21:49
(several years late) this is closely related: https://stats.stackexchange.com/questions/358101/statistical-significance-p-value-for-comparing-two-classifiers-with-respect-to/358598#358598 — Sycorax, Feb 07 '22 at 16:05

score 1 · Answer 1 · answered May 26 '16 at 01:32

1

Personally I suggest using a randomized permutation test

Area under curve (AUC) is just one test statistic. You have probably seen that the statistic of A is better than that of B. So it's already established that AUC of A is better than AUC of B. But what is not established is whether this superiority is due to systematic difference, or due to sheer dumb luck.

Therefore, now the question is: is the difference (regardless of which is better than the other) big enough to warrant assuming that the difference is due to systematic differences between methods A and B? In other words:

What is the probability of you observing that A is better than B under the null hypothesis (which states that A and B have no systematic differences).

Generally, if you go with a randomized permutation test, the procedure to estimate the probability above ($p$ value) is:

Calculate AUC of A vs. B (which I assume you already did).
Create C_1, such that C_1 is a pair-wisely randomly shuffled list of scores from A and B. In other words, C_1 is a simulation of what a random non-systematic difference looks like.
Measure AUC of C_1.
Test if AUC of C_1 is better than AUC of A. If yes, increment counter $damn$.
Repeat step 2 to 4 $n$ many times, but instead of C_1, use C_i where i $\in \{2, 3, \ldots, n\}$. Usually $n=1000$, but since it's asymptotically consistent, you are free to put larger values of $n$ if you have enough CPU time to go higher.
Then, $p = \frac{damn}{n}$.
If $p \le \alpha$, then the difference is significant. Usually $\alpha = 0.05$. Else: we don't know (maybe we need larger data).

answered May 26 '16 at 01:32

caveman

2,431
1
16
32

Thanks! Several questions. (1) What the proper name for this method? (2) in step 2, what does is mean by "C_1 is a pair-wisely randomly shuffled list of scores from A and B. "? A and B are just a single value respectively. – RockTheStar May 26 '16 at 18:24
(1) Approximate Randomization is what I learned this from. It seems that this is also called a randomized permutation test (as we find C_i that is a random permutation instead of exhaustive). (2) Sorry, some notation abuse here: A and B are method names, but here I used them as arrays. Arrays of what? Arrays of this: since you are doing AUC, you must be calculating the ROC curve, which means that you have the score (sensitivity/specificity) per threshold. So here I assumed that A and B are arrays of such sensitivity/specificity numbers, and C_i is just a randoml pairwise mix between A and B. – caveman May 26 '16 at 23:48
1

Thanks. Really, can I do that "since you are doing AUC, you must be calculating the ROC curve, which means that you have the score (sensitivity/specificity) per threshold. So here I assumed that A and B are arrays of such sensitivity/specificity numbers, and C_i is just a randoml pairwise mix between A and B." I do have such array (that how the ROC is formed), but the statistical test you mention is good for this kind of comparison? – RockTheStar May 27 '16 at 00:48
Let's wait for gurus and hear their opinion. Feel free to invite them. Basically I've seen Approximate Randomization being used for measuring whether difference in accuracy between A and B is significant. But I havent' seen for AUC. I am not perfectly confident, but at the same time I can't see any reason why wouldn't it work for AUC as the method seems to not be specific to accuracy (as far as I have noticed). Either way let's wait for gurus. – caveman May 27 '16 at 01:06
1

Cool. Who is Gurus? – RockTheStar May 27 '16 at 07:43
Gurus are the smart people here. Maybe General Abrial is one. Whuber and Glen_b are. I am sure there are much more. I'm new here so I don't know much people. – caveman May 27 '16 at 13:48
Ok! I hope he will see this and I am looking forward to it. How can I let him know about this questions – RockTheStar May 27 '16 at 21:22

score 1 · Answer 2 · answered Feb 07 '22 at 15:59

DeLong (1988) proposed a statistical test for comparing two AUCs, which, like other hypothesis testing methods, depends on sample size and variance.

It was shown that the empirical AUC is equivalent to the Mann-Whitney two-sample statistic. This key insight allowed deriving the asymptotics of AUC and the statistical test.

For details,

DeLong, Elizabeth R., David M. DeLong, and Daniel L. Clarke-Pearson. "Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach." Biometrics (1988): 837-845.
NCSS software tutorial

What statistical tests to compare two AUCs from two models on the same dataset?

2 Answers2