0

I'm trying to build a two class classifier on a dataset of around 570 samples. Im evaluating several classificiation stratigies (LDA, QDA,RDA, logistic, logistic with some additional ellements like splines ...) I have difficulties if I should take out a 25% of my 570 samples as a test set or if should just rely on 10 fold crossvalidation to estimate my test error.

I know this subject has been discussed before here on stackexhange but Im still confused when I read te answers.

What I'm thinking right know is that I have to take out 25 %, keep it aside and build my models using the other 75% of the data (so trainingdata). Also perform 10 fold crossvalidation on this trainingdata and use this cross validation error to decide between several classification methods/models.

After selecting a final model, do a last check with the test data (that has not been used otherwise) to do a last performance check of my data. I can not use this testdata to compare classification methods.

Am I correct here?

best regards

statastic
  • 261
  • 1
  • 10

1 Answers1

1

IMHO the most important point is to realize that you have two different (acutally independent) issues:

  1. If you want to do data-driven optimization, you need a nested (aka double) testing set-up. The inner testing does the optimization, then an outer validation estimates the selected model's performance.

  2. For both the inner and the outer testing steps, you can choose any suitable validation strategy, i.e. independent/held out test set or resampling (all flavours of cross validation or out-of-bootstrap etc.)

To me it does not sound particularly convincing to argue that 1/4 of the data set is large enough to give you a good (precise) independent test set estimate, while you cannot afford that for the inner (optimization) test. And I find it particularly unconvincing if a back-of-the-envelope calculation tells me that the precision of the final outer test is too low to acutally distinguish the performances among which the selection took place. OTOH, that seems to be a frequently used setup.

Personally, unless computation times are prohibitive (months), I'd go for iterated $k$-fold cross validation or out-of-bootstrap for inner as well as outer testing.

  • There are genuine advantages of independent test sets, e.g. you basically cannot measure drift and the applicability of the model for future cases with anything but cases that were measured later. But that requires that the independent test set was not just set aside randomly. In practice it would then come from a validation study.

  • Do make sure your inner testing has the power (precision/variance is the bottleneck) to actually distinguish performance changes meaningfully for the optimization.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • So if I get it correctly, you suggest to do a k fold cross validation and on each fold do another k fold cross validation. On this inner k fold cross validation you do all the model comparisons to see which on perform best. The performence/generalization error of this best model is then checked on the outer cross validation. One thing I still don't get. Is when you have decided on your final model and (depending on the model) hyper parameters. On which data are the parameters then estimated in the end? (on all data, on the average of the parameters in each fold?) – statastic May 19 '14 at 04:50
  • @statastic: If you observe different "optimal" hyperparameter sets (or no algorithms consistently wins) for different runs/folds of the inner cross validation that means that the optimization is not stable. I guess you could then go e.g. for the algorithm that won most frequently. Personally, I'd tend to conclude that the optimization failed and that external knowledge (which algorithm's known characteristics are most adapted to the problem and data) is needed for the decision. Things may be different when fine-tuning continuous hyperparameters. I've burnt my fingers trying to optimize and ... – cbeleites unhappy with SX May 19 '14 at 11:08
  • ... for the moment think that at least in my field (biospectroscopy/chemometrics) few examples exist where I'd trust the optimization results to be better than a well-build model that was not subject to data-driven optimization. But then the largest study I've had my fingers in only had 80 patients, and that means that comparisons will essentially be guesswork. 500 cases looks much more promising (if they are actually independent cases, and this is not a 10 class problem). – cbeleites unhappy with SX May 19 '14 at 11:11