Correct way of getting generalizaton performance of a model using the whole dataset

Question

Standard practice is to split data into a train/test set, then use the train set for hyperparameter tuning / model selection, using for example cross-validation over the whole training set. Finally, using a fixed model to check performance on the hold-out test set.

With small datasets this is a big limitation, as a too small test set will be subject to high performance variance, or too small training sets will not be enough to train a decent model.

To solve this, one can also do this process iteratively by then splitting the data again into different train/test splits, and re-do the process. Repeat until we have used all data as test-sets with different models, then average performance over them.

Is this methodology correct / unbiased? Or are there other better alternatives?

score 1 · Accepted Answer · answered May 16 '19 at 15:59

1

What you can do is something called nested cross validation. Instead of creating one train-test split you create K train-test splits (as you would in k-fold cross validation). For each of your K training sets you perform cross-validation to select hyper-parameters (this is called the inner CV loop) and then test on the test dataset (outer loop). You will end up with K error rates which are the average of the K* error rates you got from each inner loop. You then average these K error rates to obtain a final generalization error.

After this is complete it would be a good idea to repeat this process many times and take the average of that value.

answered May 16 '19 at 15:59

astel

1,388
5
17

So what is the difference with non-nested K-fold CV? What would be the use of an inner validation set if there are no hyperparameters to tune? – hirschme May 16 '19 at 16:03
so after some reading I see that non-nested CV is referred to just doing one sweep of CV over the whole dataset, in other words, not having any test set, only reporting performance on the validation set. I don't understand why this is even a thing as it is clearly wrong, or I misunderstood something. So nested CV seems like the only correct way to go – hirschme May 16 '19 at 16:23

score 1 · Answer 2 · answered May 16 '19 at 16:49

You are correct that separating out training and testing sets poses big problems with reasonably sized data sets.

Bootstrapping, sampling with replacement from the data set on hand, is a well established way to accomplish what you want. It's based on the idea that the process of resampling from the data set on hand represents the sampling process that got the data set from the original population in the first place. This allows for correction of bias, estimation of confidence intervals, and documentation of the quality of the model-building process.

This page and this page are two of many on this site that discuss use of bootstrapping for model building and evaluation.

Correct way of getting generalizaton performance of a model using the whole dataset

2 Answers2