One model performs better than the other. How to measure if it is statistically significant?

Question

So, let's say that I train two models on the same dataset. I run the experiment once and I get the following results:

Using a Neural Network I get an AUC ROC of 0.941.
Using Random Forest I get an AUC ROC of 0.947.

However, both algorithms have some random processes inside, and therefore if I would run the experiment again, the results may vary slightly.

My question is: how should I measure/evaluate the statistical significance of this improvent? When is it safe to claim that one algorithm is doing better than the other?

Also, I have read a lot of Machine Learning papers where they do not measure the statistical significance between the results obtained by the proposed model and the baseline model. So I guess that when the difference is big enough, there is no need to evaluate its statistical significance? If so, when is the difference considered big enough? I'd love to see what the community thinks about this issue.

Thanks a lot!

Check out [Diebold-Mariano test](http://www.ssc.upenn.edu/~fdiebold/papers/paper113/Diebold_DM%20Test.pdf) for testing whether a difference between two forecasts is statistically significant. (Of course, that does not answer your question what to do *besides* statistical significance testing.) — Richard Hardy, Jan 22 '15 at 10:45
You really should have a look at Demsar's paper: [Statistical Comparisons of Classifiers over Multiple Data Sets](http://www.jmlr.org/papers/v7/demsar06a.html). — Marc Claesen, Mar 08 '15 at 12:27
Some papers don't evaluate statistical significance because the authors don't know how to (like you in this case) and take shortcuts. It's not that common that a result is "obviously" better. — user3780968, Jun 24 '15 at 16:37
Remember that the AUC of the ROC can be thought of as the proportion of pairs of cases for which the true and the modeled orderings agree. So what you have is only a difference of 6 better classifications per 1000 pairs of cases, for these AUC values. Consider whether that's a practically important difference, given the other characteristics of your data and analyses and the costs of different types of mis-classification, not just a statistically significant difference. — EdM, Jun 24 '15 at 17:37

score 2 · Answer 1 · answered Jun 24 '15 at 16:49

"Not statistically significant" means that the outcome you observed would be likely to happen (typically meaning with a probability > 5%) under the hypothesis that the two methods are equally good (null hypothesis).

So the problem is to figure out how likely it would be to observing the result under that hypothesis. In this case it could be due to:

that particular dataset just favors the random forest algorithm
you got lucky with the random processes in the algorithms

For the second issue you can certainly run you experiment multiple times, and see if the random forest method consistently outperforms the other.

If you have a large enough test dataset you could split it randomly and see if your results are consistent across the different subsets.

However, as I hinted at in a comment above, what's important is that you ask yourself the question and indicate (in reporting the results) which steps you took to check for significance. Too many people brush these issues under the carpet or just claim significance with no further details.

Note that some journals have banned certain forms of statistical testing because of misuse.

Whispers · Answer 2 · 2016-06-15T17:51:14.900

2

This paper by Hanley and McNeil (1982) might be helpful in finding a confidence interval between the two AUC's of the different classifiers. http://pubs.rsna.org/doi/pdf/10.1148/radiology.148.3.6878708

The paper claims that you can calculate the SE of the difference of the two AUC metrics $AUC = AUC_1 - AUC_2$ with the following formula.

$$ SE(AUC) = \sqrt{\frac{AUC(1-AUC)+(N_1-1)(Q_1-AUC^2)+(N_2-1)(Q_2-AUC^2)}{N_1N_2}} $$

where $$ Q_1 = \frac{AUC}{2-AUC} $$ $$ Q_2 = \frac{2AUC^2}{1+AUC} $$

edited Jun 15 '16 at 17:51

answered Jun 13 '16 at 21:05

Whispers

31
3

5

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/106523) – gung - Reinstate Monica Jun 13 '16 at 22:08
I hope that this looks better. This is something that took me a long time to find anything out about and would probably be nice to have around. – Whispers Jun 15 '16 at 17:53
The apparent asymmetry of the formula (it ought to give the same result upon permuting the roles of the classifiers), as well as the possible negativity of the argument of the square root (let $AUC$ be small and negative, $N_1$ large, and $N_2$ small), make me suspect there might be typographical errors. However, I cannot match this formula with anything I can find in the paper. Could you indicate where in the paper it appears? – whuber Jun 15 '16 at 18:20
1

@gung "$AUC$" in this answer explicitly is a *difference* of AUCs. – whuber Jun 15 '16 at 20:36
@whuber I am sorry there is a better version of this link http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf and the formula appears in the bottom of table 2. And that is an interesting point you make about the asymmetry, I must have missed the constraints on the arguments. – Whispers Jun 15 '16 at 20:58

score 1 · Answer 3 · edited Apr 13 '17 at 12:44

You cannot determine if two measures are significantly different (in a statistical sense). Statistic significance can only be determined for sets of measures (in this case sets of measures).

This question in Cross-Validated How to statistically compare the performance of machine learning classifiers? is a good start on how to collect a set of measures and which test to use to compare classifiers. But notice that the literature posted in the accepted answer deals with accuracy not AUC. I don't know if using AUC as quality measure changes the tests.

One model performs better than the other. How to measure if it is statistically significant?

3 Answers3