Comparing four classifiers

Question

I have trained and tested four different classifiers, and I would now like to compare them. The classifiers have accuracies 95, 90, 81, 75.

I know that there is no unbiased estimator of the variance of K-fold cross-validation accuracy, which is going to make things difficult.

In this question/answer, it has been assumed that the overlap in training sets does not matter. How large can I expect the resulting bias to be?

Is there some other (simple) way to do this?

score 2 · Answer 1 · answered Jun 02 '15 at 17:52

I have trained and tested four different classifiers, and I would now like to compare them. The classifiers have accuracies 95, 90, 81, 75.

In order to say more, we at the very least need to know the number of test cases (see example below).

I know that there is no unbiased estimator of the variance of K-fold cross-validation accuracy, which is going to make things difficult.

$k$-fold cross validation cannot measure the variance connected to the fact that your particular data set may not be totally representative for the problem at hand. In other words, the assumption that re-sampling is a good approximation to getting actual new samples breaks down. But another part of the variance of the models can be measured by iterated/repeated $k$-fold cross validation: the variance in the predictions caused by exchanging a few training cases (model instability).

The still unknown residual model variance matters if the question at hand is "Which classifier performs best for problems of this type of application?" I.e. in scenarios where the winning classifier is then trained on a completely new sample of the same type.
If the scenario is that from this particular data set a model should be trained and you then need an estimate how well that model predicts unknown cases later on, this never-to-be-known-by-resampling-validation variance is not important - the instability measurements should yield an estimate that in practice tells you all you need to know.

All this is about variance uncertainty on the model. Note that the finite number of cases also means that the testing is subject to variance, which is typically assumed to follow a binomial distribution for performance measures of the fraction-of-tested-cases type, i.e. overall accuracy, sensitivity, specificity, predicitive values, etc. You can account for this variance by constructing binomial confidence intervals. E.g. assuming you observe 17 correct out of 21 tests, the 95% confidence interval for the accuracy is roughly 60 - 93 % (point estimate: 81 %). Here, the actual number of independent (different) cases enters: iterating the cross validation does not change this number!

Practical considerations:

If the confidence intervals calculated just on the finite number of test cases overlap, or a paired test based on the number of test cases (look e.g. into McNemar's test) is not significant, you cannot claim a difference.

In that case, you don't need to worry about the not-yet-accounted for variance in the models as it can never lower the total variance uncertainty.
The other way round: do sample size (or power) calculations, e.g. with the observed accuracies. If these calculations yield sample sizes that are way above what you can get, report that data-driven model comparison is impossible and instead choose the final classifier by your knowledge about the type of classifier, the application and the data at hand.
Use iterated/repeated $k$-fold cross validation to check the stability of the predictions wrt. slight changes in the training set. If you detect instability, in practice you should probably reconsider the way you set up your classifier rather than trying to include this variance estimate into the comparison.
Decide whether you'd need to include the undetected variance due to resampling not being the same as getting new samples into your comparison (scenarios above). If not, you're fine. If yes, report that your results do not include this possible additional variance (you anyways cannot do more).
If you are training all classifiers in question, you can and should set up a paired comparison: for each particular split, calculate and test all classifiers.
This allows far more powerful statistical tests for the comparison.

Mike Hunter · Answer 2 · 2015-06-01T09:32:36.170

1

Unhandled exception's suggestion amounts to a computationally expensive approach to optimizing n-fold cv. While I understand it intuitively, given large n, it could be prohibitive even across a lot of CPUs. But then, I'm with Emmanuel Derman in pointing to the always funky nature of data and, based on that, questioning the virtues of "optimization" a priori and from a philosophical POV. In other words, what are we optimizing -- really?

My suggestion comes from experience with the vicissitudes of "random" assignment to the k-folds. So, yes, use k-fold cv, but the classifiers need to be evaluated across the same fixed k-folds for the comparisons to be valid. It seems obvious but I've made this point with some pretty senior guys and it was a big "aha" since the normative assumption is that the random samples, with or without overlap, are equivalent. They are equivalent only in the theoretical asymptote, not in finite data samples.

And to the point made about overlap of data in the k-folds, I've known people to make categorical assertions that there can be absolutely no overlap in the k-folds for cv to be meaningful. Personally, I don't understand this requirement and would be interested in understanding the motivations for it, either theoretically or practically. Non-overlapping k-folds can be challenging when data is limited. It may not be a motivated statement and more like a rule of thumb being rigidly prescribed. But then, lots of the orthodoxy are unnecessarily rigid in their prescriptions.

@Erik, thanks for the excellent link.

edited Jun 01 '15 at 09:32

answered Jun 01 '15 at 09:25

Mike Hunter

9,682
2
20
43

" the classifiers need to be evaluated across the same fixed k-folds for the comparisons to be valid" => this allows for a paired comparison. Of course, comparisons on different test sets can be done, but they'll have less power. – cbeleites unhappy with SX Jun 02 '15 at 12:05
@cbeleites "Comparisons on different test sets can be done." This is true but for finite data samples you can't answer what the real differences are in classifier performance vs random noise or drift in the data unless those test sets are fixed. I came to this insight somewhat reluctantly after seeing answers change as a function of the underlying, randomly drawn test datasets. – Mike Hunter Jun 02 '15 at 12:25
While a paired design has more power, it is no magic wand. You're still using only the particular training and the particular test sets (= variance). If your answer changes depending on the underlying random draw I'd say you a) need to evaluate way more draws in order to get a good representation of this type of variance and b) you should be very cautious in your conclusions: this variation is just one type of variance that people often forget to measure. – cbeleites unhappy with SX Jun 02 '15 at 16:58
How are you defining a "paired design?" How would "evaluat(ing) way more draws" improve the answer over fixed k-folds, particularly with small *n*? Is it possible to be specific about the "type(s) of variance that people often forget to measure" or that aren't controlled for by k-fold cv? – Mike Hunter Jun 02 '15 at 18:10
paired design: is what you get training all classifiers in question with exactly the same training data and testing them with exactly the same test data - i.e. your "fixed" k-fold. – cbeleites unhappy with SX Jun 04 '15 at 14:52
Evaluating more draws: this covers variance due to model instability wrt. to slight changes in the training data. It is particularly helpful for small $n$ as too small training data leads to such instability. Ideally, you'd want to see no differences between the different draws' predictions for the same test case. If you do detect instability here, chances are that the data set does not represent the problem well because it is too small. – cbeleites unhappy with SX Jun 04 '15 at 14:54

score -1 · Answer 3 · answered Jun 01 '15 at 08:35

-1

I would suggest to compare your classifiers via Leave-One-Out rather than X-Fold Cross Validation. Here you test the entire data, which make comparision more easier...

answered Jun 01 '15 at 08:35

NeuroMorphing

525
2
12

Leave-one-out is K-fold with $K=N$... And what do you mean by testing the entire data? – mmh Jun 01 '15 at 08:38
I know it mmh ;-) – NeuroMorphing Jun 01 '15 at 08:39
Sorry I just don't understand how this answers the question. :) – mmh Jun 01 '15 at 08:39
According to http://ai.stanford.edu/~ronnyk/accEst.pdf Leave-One-Out is almost unbiased and hence should answer the OP's answer " How large can I expect the resulting bias to be?"... – NeuroMorphing Jun 01 '15 at 08:47
@Unhandledexception This does not answer the question and is also bad advice. Leave one out might be unbiased but that is off-set by the high variance. See for example this excellent answer: http://stats.stackexchange.com/questions/61783/model-variance-and-bias-in-cross-validation. – Erik Jun 01 '15 at 08:58
Leave-one-out can actually have a high bias in particular cases. And the variance is typically the same as a single run of k-fold. BUT: you loose the possibility to at least the part of the variance of $k$-fold CV that can be measured. => still bad advise. – cbeleites unhappy with SX Jun 02 '15 at 11:57

Comparing four classifiers

3 Answers3

Practical considerations: