Why do we reserve a test set for final model evaluation?

Question

Let's say you train a model following nested k-fold cross-validation and that you end up with one really good model among many that you tried as well as an estimate of its generalization performance. Then one might even choose to retrain this best model on all of the available training data (not reserving any for validation).

Finally literature has always told me to evaluate it on the test set to get some performance metric. Then what? You already arrived at your best model via the cross validation scheme, so what is this performance metric used for?

In a machine learning competition like Kaggle it is useful to sort competitors by their performance on a fixed test set and pick a winner. But if you are developing a model for some practical application within a company, then what does test set performance give you if you already arrived at your best model via cross validation along with an estimate of its generalization performance?

Looking at test set performance and repeatedly tuning the model is risking overfitting the model to the test set. So unless you're doing this risky tuning dance back and forth (which you obviously shouldn't if you don't want to risk optimistically biasing your model), then what is the point of a test set at all?

score 2 · Answer 1 · answered Jan 24 '19 at 20:50

2

And if you're not even doing this (risky) tuning dance back and forth,

... which you really shouldn't. And if you do, you need to obtain another independent test set to:

then what is the point of a test set at all?

estimate generalization performance/error. All previously used data is (more or less directly) known to the model.

Note that the better independent test sets are collected in a fashion to ensure independence: that way you can also check whether hidden structure in your data messed up (= caused data leaks leading to overoptimism) your supposedly-independent splits within the data available during training/model development.

answered Jan 24 '19 at 20:50

cbeleites unhappy with SX

34,156
3
67
133

I'm aware, I was suggesting the use of nested k-fold cross validation which provides an unbiased estimation of generalization performance while tuning hyperparameters. In such a case, what is the point of a test set if you have already arrived at the best model and have an estimate of its performance? – fabiomaia Jan 25 '19 at 00:07
1

@fabiomaia: for a test set of randomly selected data set aside at the beginning of the training process: there is none. But test sets can be used to establish performance under circumstances not covered during training, e.g. cases that are not only unkown but are also new (measured at a later point in time than the training samples, or check the ruggedness/robustness against certain confounders that were not included (in that extent) in the training data. E.g. how far outside specified operation conditions until performance breaks down. See also https://stats.stackexchange.com/a/104750/4598 – cbeleites unhappy with SX Jan 25 '19 at 00:15

score 0 · Answer 2 · answered Jan 25 '19 at 03:59

Let's say you split your entire data into 5 sets $\{A, B, C, D, E\}$. You train and 4-fold cross-validate on $\{A, B, C, D\}$. If that is all you do, you are right that testing on $E$ is not any different, and it'd be more efficient to use all 5 sets for training & validation.

However, if you try multiple models and choose the one that performs best in cross-validation, you are in effect overfitting to $\{A, B, C, D\}$ and performance on unseen data $E$ will be worse. (Model specifications are just parameters in this respect, hence the term hyperparameters.) Granted, if the number of models tested is small, this overfit shouldn't be bad, but for a complex model with many aspects to optimize a held-out testing set will be necessary.

(In fact, it is not necessary to use $E$ as a single test set. One can then do outer cross-validation, so train & validate on $\{A,B,C,E\}$, evaluate on $D$, repeat...)

Why do we reserve a test set for final model evaluation?

2 Answers2