Comparing two classifier accuracy results for statistical significance with t-test

Question

I want to compare the accuracy of two classifiers for statistical significance. Both classifiers are run on the same data set. This leads me to believe I should be using a one sample t-test from what I have been reading.

For example:

Classifier 1: 51% accuracy
Classifier 2: 64% accuracy
Dataset size: 78,000

Is this the right test to be using? If so how do I calculate if the difference in accuracy between classifier is significant?

Or should I be using another test?

Hi @Chris, have you found a test to compare two models for statistical significance ? — Gauranga, Oct 15 '21 at 10:37

Dikran Marsupial · Answer 1 · 2012-04-11T15:11:04.187

19

I would probably opt for McNemar's test if you only train the classifiers once. David Barber also suggests a rather neat Bayesian test that seems rather elegant to me, but isn't widely used (it is also mentioned in his book).

Just to add, as Peter Flom says, the answer is almost certainly "yes" just by looking at the difference in performance and the size of the sample (I take the figures quoted are test set performance rather than training set performance).

Incidentally Japkowicz and Shah have a recent book out on "Evaluating Learning Algorithms: A Classification Perspective", I haven't read it, but it looks like a useful reference for these sorts of issues.

edited Apr 11 '12 at 15:11

answered Apr 11 '12 at 14:28

Dikran Marsupial

46,962
5
121
178

1

I am running 10-fold cross validation to get these results. Does that mean they are actually different data sets. That is the total size, which is split for test/train in cross validation – Chris Apr 11 '12 at 14:43
5

The accuracies for each fold will not be independent, which will violate the assumptions of most statistical tests, but probably won't be a big issue. I often use 100 random training/test splits and then use the Wilcoxon paired signed rank test (use the same random splits for both classifiers). I prefer that sort of test as I often use small datasets (as I am interested in overfitting) so the variability between random splits tends to be comparable to the difference in performance between classifiers. – Dikran Marsupial Apr 11 '12 at 15:07
2

(+1) for Wilcoxon paired signed rank test (and the link to the book ... if the toc can fulfill its promises this book can become a must-read of all MLs :O) – mlwida Apr 12 '12 at 06:57
1

@steffen yes I have been meaning to get a copy of the book for a while; proper performance evaluation is not as common as it should in machine learning work, and has been a component of my research interests for a while. – Dikran Marsupial Apr 12 '12 at 13:03
1

BTW, my copy of the book has arrived, all I need to do now is find the time to read it! – Dikran Marsupial May 01 '12 at 10:52
3

I have also used signed rank tests as well as paired t-tests for comparing classifiers. However each time I report using a one-sided test for this purpose I get a hard time from reviewers so have reverted to using two-sided tests! – BGreene Jul 25 '12 at 15:15
3

Given that OP clarified in the comments that the question was actually about cross-validation, would you perhaps consider expanding your answer to cover that topic? We can edit the Q then. This is an important topic and there are a couple of very related (or even duplicate) questions but none has a good answer. In a comment above you recommend using a paired test on the CV estimates and say that you don't think that non-independence is a big issue here. Why not? It sounds to me like a potentially massive issue! – amoeba Nov 05 '15 at 23:45
1

[cont.] Here is one of the related threads http://stats.stackexchange.com/questions/45851 which is self-answered with a couple of references. This paper [Approximate statistical tests for comparing supervised classification learning algorithms](http://web.cs.iastate.edu/~jtian/cs573/Papers/Dietterich-98.pdf) seems very relevant and has almost 2k citations too. It's from 1997, so I am wondering if there are newer authoritative studies and recommendations? Dietterich recommends a 5x2cv procedure that looks a bit strange on the first glance. – amoeba Nov 05 '15 at 23:48
1

[cont.] These two Qs http://stats.stackexchange.com/questions/38012 and http://stats.stackexchange.com/questions/93481 (and probably more) could also be closed as duplicates if some of these threads were to become "exemplary". I encourage you to expand your answer to make it such! :) – amoeba Nov 05 '15 at 23:51
1

[cont.] Apologies for lots of comments, but I just realized that Nadeau's & Bengio's 2003 [Inference for the Generalization Error](http://www.iro.umontreal.ca/~lisa/bib/pub_subject/language/pointeurs/nadeau_MLJ1597.pdf) is probably very relevant. – amoeba Nov 05 '15 at 23:56
1

@amoeba It is not that the lack of independence is not a substantial issue, but that in practice many other factors have a bigger impact, for instance the splitting of the data to form training and test sets means that you can get a big variability in the p-value (as it is a statistic) if you only use one split. So even though there is an independence issue, it is better to re-sample than to use a single training.test split. The Nadeau and Bengio paper is indeed very relevant. – Dikran Marsupial Nov 06 '15 at 11:11
1

But if adjustments are made, then it becomes very difficult to show any statistically significant difference between the performance of good classifiers on small datasets. It seems to me that the paired t-test at least answers the question of whether the difference in performance between different classifiers is significant compared to the difference caused by the random partitioning of the data to form training and test sets, even if it doesn't mean that the difference is significant when evaluated over independent training test datasets, which is of some use. – Dikran Marsupial Nov 06 '15 at 11:17
1

(+1) for introducing the book; it would be very useful for anyone who needs to analyze classifier performance. – Ébe Isaac Nov 14 '17 at 05:34
@DikranMarsupial But in theory the test should not be used, because the violations are violated, right? How can oyu justify using it when asked by a reviewer? How can we know, that the violation of the assumption is not a big deal? – Funkwecker May 08 '21 at 15:31
1

@Funkwecker When presenting the results of the test, just point out the likely violation of the assumptions. The real value in these sorts of NHST is to impose a degree of self-skepticism on the experimenter, and are of greatest value when you get a non-significant result. Pointing out the limitations of the test, where you get a significant result, shows just that sort of self-skepticism. I suspect the assumptions of tests are almost always violated to some extent, e.g. the normal distribution is defined by a limiting case, so unrealistic for it to hold *exactly* in practice. – Dikran Marsupial May 08 '21 at 16:27

score 6 · Answer 2 · answered Apr 11 '12 at 14:16

6

I can tell you, without even running anything, that the difference will be highly statistically significant. It passes the IOTT (interocular trauma test - it hits you between the eyes).

If you do want to do a test, though, you could do it as a test of two proportions - this can be done with a two sample t-test.

You might want to break "accuracy" down into its components, though; sensitivity and specificity, or false-positive and false-negative. In many applications, the cost of the different errors are quite different.

answered Apr 11 '12 at 14:16

Peter Flom

94,055
35
143
276

Agreed - this will clearly be significant. Nitpick: You would use a $z$-test to test two proportions (approximately) - this has to do with the convergence of a binomial distribution to the normal as $n$ increases. See section 5.2 http://en.wikipedia.org/wiki/Statistical_hypothesis_testing – Macro Apr 11 '12 at 14:27
On second thought, a $t$-test may still be asymptotically valid, by the CLT, but there must a reason the $z$-test is usually used here. – Macro Apr 11 '12 at 14:28
2

The accuracy percentage I have put in my question are just an example. – Chris Apr 11 '12 at 14:45
As I understand the question is not that much about the particular numbers provided in the example, but about a universal method allowing to statistically tell whether the performance difference in the two classifiers is significant or not. – Data Man Aug 09 '20 at 11:22

Ébe Isaac · Answer 3 · 2018-09-30T11:44:03.663

Since accuracy, in this case, is the proportion of samples correctly classified, we can apply the test of hypothesis concerning a system of two proportions.

Let $\hat p_1$ and $\hat p_2$ be the accuracies obtained from classifiers 1 and 2 respectively, and $n$ be the number of samples. The number of samples correctly classified in classifiers 1 and 2 are $x_1$ and $x_2$ respectively.

$ \hat p_1 = x_1/n,\quad \hat p_2 = x_2/n$

The test statistic is given by

$\displaystyle Z = \frac{\hat p_1 - \hat p_2}{\sqrt{2\hat p(1 -\hat p)/n}}\qquad$ where $\quad\hat p= (x_1+x_2)/2n$

Our intention is to prove that the global accuracy of classifier 2, i.e., $p_2$, is better than that of classifier 1, which is $p_1$. This frames our hypothesis as

$H_0: p_1 = p_2\quad$ (null hypothesis stating both are equal)
$H_a: p_1 < p_2\quad$ (alternative hypotyesis claiming the newer one is better than the existing)

The rejection region is given by

$Z < -z_\alpha \quad$ (if true reject $H_0$ and accept $H_a$)

where $z_\alpha$ is obtained from a standard normal distribition that pertains to a level of significance, $\alpha$. For instance $z_{0.5} = 1.645$ for 5% level of significance. This means that if the relation $Z < -1.645$ is true, then we could say with 95% confidence level ($1-\alpha$) that classifier 2 is more accurate than classifier 1.

References:

R. Johnson and J. Freund, Miller and Freund’s Probability and Statistics for Engineers, 8th Ed. Prentice Hall International, 2011. (Primary source)
Test of Hypothesis-Concise Formula Summary. (Adopted from [1])

Though I agree that a test for proportions could be used, there is nothing in the original question that suggests a one-sided test is appropriate. Moreover, *"we could say with 95% confidence"* is a common misinterpretation. See e.g. here: https://www.metheval.uni-jena.de/lehre/0405-ws/evaluationuebung/haller.pdf — Frans Rodenburg, Sep 30 '18 at 11:02
Shouldn't $\quad\hat p$ be the average of $\hat p_1$ and $\hat p_2$? So the denominator should be 2n in $\quad\hat p= (x_1+x_2)/2n$. — Shiva Tp, Sep 30 '18 at 10:38
@ShivaTp Indeed. Thanks for pointing the much needed typo correction. Edit confirmed. — Ébe Isaac, Sep 30 '18 at 11:44
I like your answer. I wonder about the following though. With binary classification, in practice it is often the case that classes are inbalanced (positives being the minority) and so True Positives are of greater value than True Negatives. This may lead to conclusion that correctly classifying the minority class is a stronger signal of the classifier performance. I wonder whether in such scenario it would be still correct to use the test of proportions that you proposed? Because this test makes no distinction between TP and TN. It treats all those cases equally. — Data Man, Aug 09 '20 at 11:31

Comparing two classifier accuracy results for statistical significance with t-test

3 Answers3

Linked

Related