How to statistically compare the performance of machine learning classifiers?

Question

Based on estimated classification accuracy, I want to test whether one classifier is statistically better on a base set than another classifier . For each classifier, I select a training and testing sample randomly from the base set, train the model, and test the model. I do this ten times for each classifier. I therefore have ten estimate classification accuracy measurements for each classifier. How do I statistically test whether the $classifier 1$ is a better classifier than the $classifier 2$ on the base dataset. What t-test is appropriate to use?

Did you test the classifiers on the same samples? i.e., sample1, c1(sample1), c2(sample1)? Or did you use different samples for each classifier? — Josephine Moeller, Dec 13 '12 at 20:59
@lewellen: accuracy is a proportion: t-tests are usually *not* appropriate. — cbeleites unhappy with SX, Dec 14 '12 at 00:00
@entropy: Before asking new questions, consider taking the time to go over your old questions and have a look whether you could accept some answers (by clicking that check mark). — cbeleites unhappy with SX, Dec 14 '12 at 00:02
@JohnMoeller: "difference of proportions" would be a search term, independent or dependent we don't know yet. If it's paired: McNemar's test. I'm guessing that t-test means rather small sample size, so possibly normal approximation is not a good idea. I'd go for [Statistical Methods for Rates and Proportions](http://onlinelibrary.wiley.com/book/10.1002/0471445428) to look up details. — cbeleites unhappy with SX, Dec 14 '12 at 00:43
@cbeleites I'm actually interested in the answer to this question. So suppose that I have 30 samples, and I test c1 and c2 on each sample, and record the accuracy for each on each sample. Then you're saying that doing a t-test of the differences of accuracies is *not* the correct thing to do? I thought that the proportion statistic was only appropriate when you're looking at *one* accuracy at a time, i.e., when you're testing the proportion *as* a statistic. It seems that you're saying proportion is the right statistic when you're testing *means* of proportions too. — Josephine Moeller, Dec 14 '12 at 01:04
@JohnMoeller: I'm saying that each accuracy is a proportion. If you want to compare them, use methods for "difference of proportions". I expanded this into an answer to prevent endless comments. — cbeleites unhappy with SX, Dec 14 '12 at 01:31
@JohnMoeller I select a new sample each time. Is this incorrect? — entropy, Dec 14 '12 at 14:13

score 15 · Accepted Answer · edited Oct 20 '16 at 17:13

15

A review and critique of some t-test approaches is given in Choosing between two learning algorithms based on calibrated tests, Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, and On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

edited Oct 20 '16 at 17:13

Martin Thoma

1,449
1
17
30

answered Dec 14 '12 at 15:19

entropy

1,122
2
12
21

Dietterich says: "The binomial distribution can be well-approximated by a normal distribution for reasonable values of $n$." So far, you did not tell us that you have reasonable $n$. @JohnMoeller's 30 cases are IMHO rather few for the normal approximation (at least without any knowledge about $p_1$ and $p_2$). – cbeleites unhappy with SX Dec 14 '12 at 16:11
I have at-least 4000 records per class available in the base dataset, therefore the sample I select can be anything less than this. The biggest drawback with the difference of proportions tests are that they ignore "internal variation of the learning algorithm". I think this is important for a classifier such a a neural network, which I am using. – entropy Dec 14 '12 at 17:36
well, that is a completely different situation from what JohnMoeller chimed in with. If you mean instability of the model by "internal variation": you can measure this. I'll update my answer. – cbeleites unhappy with SX Dec 14 '12 at 17:49
To clarify, 30 is the number of times I select test/train partition sets, *not* the number of test points I select. – Josephine Moeller Dec 14 '12 at 19:43
@JohnMoeller: sorry, I completely misunderstood that (coming from a field where "a sample" is a physical specimen of some sort). – cbeleites unhappy with SX Dec 15 '12 at 20:13
@cbeleites No problem, I realized that I was using the term incorrectly, which is why I clarified. – Josephine Moeller Dec 16 '12 at 20:20

cbeleites unhappy with SX · Answer 2 · 2012-12-14T19:09:21.113

I don't have the Fleiss book at hand, so all this is IIRC.

Answering @JohnMoeller's question in the comments for the moment: the original question is IMHO unanswerable as it is.

So suppose that I have 30 samples, and I test c1 and c2 on each sample, and record the accuracy for each on each sample.

doing this, you end up with a 2 x 2 contingency table giving classifier 1 correct/wrong against classifier 2 correct/wrong. Which is the starting point for McNemar's test. So this is for a paired comparison, which is more powerful than comparing "independent" proportions (which are not completely independent if they come from drawing randomly from the same finite sample).

I cannot look up McNemar's "small print" right now, but 30 samples is not much. So you may even have to switch from McNemar's to Fisher's exact test [or something else] which calculates the binomial probabilities.

Means of proportions:
It doesn't matter whether you test one and the same classifier 10x with 10 test cases or once with all those 100 cases (the 2 x 2 table just counts all test cases).

If the 10 estimates of accuracy for each classifier in the original question are obtained by random hold out or 10-fold cross validation or 10x out-of-bootstrap, the assumption is usually that the 10 surrogate models calculated for each classifier are equivalent (= have the same accuracy), so test results can be pooled*. For 10-fold cross validation you then assume that the test sample size equals the total number of test samples. For the other methods I'm not so sure: you may test the same case more than once. Depending on the data/problem/application, this doesn't amount to as much information as testing a new case.

*If the surrogate models are unstable, this assumption breaks down. But you can measure this: Do iterated $k$-fold cross validation. Each complete run gives one prediction for each case. So if you compare the predictions for the same test case over a number of different surrogate models, you can measure the variance caused by exchanging some of the training data. This variance is in addition to the variance due to the finite total sample size.

Put your iterated CV results into a "correct classification matrix" with each row corresponding to one case and each column to one of the surrogate models. Now the variance along the rows (removing all empty elements) is solely due to instability in the surrogate models. The variance in the columns is due to the finite number of cases you used for testing of this surrogate model. Say, you have $k$ correct predicitions out of $n$ tested cases in a column. The point estimate for the accuracy is $\hat p = \frac{k}{n}$, it is subject to variance $\sigma^2 (\hat p) = \sigma^2 (\frac{k}{n}) = \frac{p (1 - p)}{n}$.
Check whether the variance due to instability is large or small compared to the variance due to finite test sample size.

Ah, ok. It's the last bit that clears things up, at least for me. Thanks. — Josephine Moeller, Dec 14 '12 at 02:02
Thanks for the response. Just not clear on the procedure to follow. You say preform 10-fold cross validation on a single dataset. Measure the accuracy on the hold out sample, that is compute a 2x2 confusion matrix. Add up the ten 2x2 confusion matrices. Preform the McNemar's test on the aggregated 2x2 confusion matrix. — entropy, Dec 14 '12 at 14:41
@entropy: 1. The 2x2 contingency table is not the confusion matrix. 2. new sample each time vs. testing both classifiers on the same testing data: paired tests are more powerful (and possible here). See the updated answer. — cbeleites unhappy with SX, Dec 14 '12 at 15:56
Sorry for that, yes, contingency table. Am I correct to say that McNemar's test translate directly to a multi-class problem also. — entropy, Dec 14 '12 at 17:43
@cbeleites thanks so much for the response!!! I think you have now answered my questions exactly. However, i still don’t understand the exact procedure to follow. Would you mind just elaborating on the last paragraph. — entropy, Dec 14 '12 at 18:35
@cbeleites what exactly do you mean by an "**iterated** k-fold cross validation". Do I still preform the McNemar's test on the pooled test results? How do you include the variance due to instability in the surrogate models in the analysis of significance in difference? — entropy, Dec 15 '12 at 08:30
@entropy: I tried to explain [here](http://stats.stackexchange.com/a/31507/4598) about iterated cv. I'd first of all use it to check stability. If your classifiers are stable: no need to pool over cv runs, they all give the same results. If they aren't, your classifiers are bad because of instability. It doesn't make too much sense to choose them, anyways (you may want to construct a bagged classifer and do McNemars' on out-of-bag test results). McNemar's test after iterated cv: what is the test sample size then? We only know the lower bound, which is the number of independent test cases. — cbeleites unhappy with SX, Dec 15 '12 at 20:27

How to statistically compare the performance of machine learning classifiers?

2 Answers2

Linked