Why do we need a final validation set (training / testing / validation)?

Question

As I researched from documents I found that, when working with classifiers, we generally use 3 sets that are:

A set for training the classifiers
A set for testing the classifiers during development
An untouched testing set that is only used after the development

So my question is: Why do we really need the 3rd set? And also as I see, why the 2nd set gives, in general, too optimistic results?

The idea of the 3rd set is to get a clean, unbiased estimate as to how your classifier performs on new, independent data. Testing on data you used for training or model selection will tend to overestimate how well you will do on new, independent data. — Matthew Gunn, Jan 19 '17 at 20:30
@MatthewGunn thank you for clearification for test set and do you have any idea about why the 2nd one in general gives too optimistic results — noMatter, Jan 19 '17 at 21:12
This is because you can "overfit on cross-validation". The validation set is used for model selection. It is also an "optimisation procedure", same as "fitting a model" in nature but you optimise over models. Then all the performance you obtain is a "post hoc" one --- it is the performance "conditioned on that it is the best". It can be roughly seen as a sample of the random variable "min(loss1,loss2,...,lossN)", not the sample of general unbiased performance of the model. — user112758, Jan 19 '17 at 21:14

score 1 · Accepted Answer · answered Jan 19 '17 at 21:58

The comments give good explanations. I also wanted to point out that what we really want to know is, given some training data $\mathcal{T}$, what is the expected loss $L(f(x),y)$ for our model ($f$) trained on this data (i.e., $\mathbb{E_{\mathcal{T}}}[L(f(X),Y)]$.

However, if we only used cross-validation errors, we don't end up estimating this, but instead estimate the marginal expectation (over all possible training sets) (i.e, $\mathbb{E}_{XY}[L(f(X),Y)])$

Why? Because each CV-fold can be thought of as a bootstrap sample from your available data, which approximates a draw from the overall data population (this is the key idea behind the bootstrap). Therefore, you are training your model on different samples that are approximately from the overall data distribution. Since CV effectively averages over all possible training sets (only approximately, just like in bootstrapping), the average CV loss will not reflect the expected loss for the model trained using only your full training set (i.e. ignoring the loss from training sets you did not actually get, but could have).

So, the only way to really get at conditional expected loss is to train the model on your data and then test it against a lot of new data. The use of a third set of fresh data will help you approximate this.

Why do we need a final validation set (training / testing / validation)?

1 Answers1