1

In machine learning, to get an unbiased estimate of model performance, we split data 80:20 into train and test set. We use the training set for model training and model selection according to cross-validation error. After we finalize the model, we use the test set to get how the model will perform in unseen data. cross-validation error is biased estimate since it's used in the model selection process.

However, in case of relatively small data sample (hundreds to maybe a few thousands), after 80:20 split, the test set sample size is small, the test error we get is expected to be HIGHLY variable. Different 'random' split might give quite different test error..

My question is: Does it still make sense to use the test set once to get a final estimate? Does it make more sense to just use the cross-validation error?

zesla
  • 659
  • 1
  • 5
  • 15

1 Answers1

1

If the test set is obtained by randomly splitting off part of the available data (equivalent to stopping a $k$-fold CV after the first fold), then cross validation is better.

Single tess sets become worth while (with smallish sample sizes) when they test aspects of model validity/possible confounding factors that cross validation cannot test, such as the performance on data obtained later on to include possible seasonal or longer term drift, a test set obtained at other geographic locations, a test set obtained from industrial production lines when the model was set up with lab-prepared calibration samples, and so on. It sometimes also makes sense to use a single test set when for this test set independence can be more easily ensured by organizational measures.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133