validation error and test error when limited data is available

Question

In machine learning, to get an unbiased estimate of model performance, we split data 80:20 into train and test set. We use the training set for model training and model selection according to cross-validation error. After we finalize the model, we use the test set to get how the model will perform in unseen data. cross-validation error is biased estimate since it's used in the model selection process.

However, in case of relatively small data sample (hundreds to maybe a few thousands), after 80:20 split, the test set sample size is small, the test error we get is expected to be HIGHLY variable. Different 'random' split might give quite different test error..

My question is: Does it still make sense to use the test set once to get a final estimate? Does it make more sense to just use the cross-validation error?

related: https://stats.stackexchange.com/questions/104713/hold-out-validation-vs-cross-validation — cbeleites unhappy with SX, Aug 11 '20 at 05:48

score 1 · Answer 1 · answered Aug 11 '20 at 05:46

If the test set is obtained by randomly splitting off part of the available data (equivalent to stopping a $k$-fold CV after the first fold), then cross validation is better.

Single tess sets become worth while (with smallish sample sizes) when they test aspects of model validity/possible confounding factors that cross validation cannot test, such as the performance on data obtained later on to include possible seasonal or longer term drift, a test set obtained at other geographic locations, a test set obtained from industrial production lines when the model was set up with lab-prepared calibration samples, and so on. It sometimes also makes sense to use a single test set when for this test set independence can be more easily ensured by organizational measures.

validation error and test error when limited data is available

1 Answers1