Is the algorithm performance on the validation set a good criterion for model selection?

Question

I think this is an "old" question about cross validation. I benefit from reading the post and the thread from How do you use test data set after Cross-validation, What is the difference between test set and validation set?, and Why only three partitions? (training, validation, test). But I still have the following question:

When we got our data set (with both features X and labels Y) for supervise learning and we split it into training, validation, test data set.

We train our models (each model has a different value for its hyper-parameter(s), e.g., the degree of polynomial) on the training data set
We tune the hyper-parameters of those models on the validation set and pick the winner
We apply the tuned model (winner from step 2) on the test data set to get the estimation of the performance of this tuned model in real world. And we stop tuning the hyper-parameter.

Now, let's look at Step 1 and 2. It is clearly that the model with better performance on the training set does not necessarily have a better performance on the validation set, as the model could overfit the training set. So that's why we have this validation set "to prevent overfitting" on the training set. We will come back to this point in a minute.

Now let's look at Step 2 and 3. Say after step 2, we have chosen a model (i.e., a specific value of the hyper-parameter) that has the best performance on the validation set. If we believe the performance score for this model on the validation set is the roughly equal to the actual performance score for this model on new real world data set, then we are usually too optimistic. So that's why we need to apply this winner model from Step 2 on the test set and the performance score obtained on that test set will give us a better estimation of this model's performance on new real world data set.

My question is： In step 2, will the winner that has the best performance on the validation set, also has the best performance on the test set? Here, I am only care about model selection, i.e., ranking different models. At this Step 2, I don't care about whether the performance score on the validation set is a good estimation of the actual performance of my model. To me, this whole procedure sounds like The winner of Step 2, is also overfitting on the validation set, which is nothing different than the logic of overfitting on the training data set. I hope I missed some critical logic here and I hope someone could correct me. Based on this logic, then the winner in Step 2, will not necessarily has the best or just somewhat "better" performance on the test set than other models we've tried. Then the question is Based on this logic, i.e., Step 2 is just overfitting on the validation set, then what is the point of this Step 2, the validation step? And furthermore, what is the criterion that we should follow to select the model that will be used (e.g., applied to the test set)?

I would argue that if we have multiple validation sets and we pick a model that has the best average performance over all of those validation sets than other models, then this winner will almost surely be the winner if you now give me another test set. And this is the idea of n-fold cross-validation. And it uses all the data to train the model, which I think is better than this "splitting the data set into 3 parts." I think in terms of model selection (i.e., just horse racing), n-fold CV is a better approach. If in additional, we want a fair estimate about the true error rate, then we may divide the training data into 2 parts, and apply n-fold CV on one part, then use the other part to estimate the true error rate.

If the training, validation, test set all have the same distribution and they are all large, are we suppose to see the winner of Step 2 will also be the winner of Step 3? Are there any theoretical results for that so as to give this procedure at least some faith?

score 1 · Answer 1 · answered Mar 02 '18 at 01:40

You can only use the validation set once. After that, you are already over-fitting, as you say.

CV is less susceptible to overfitting, simply because more splits, but it will still overfit.

The more splits, the more data, and the fewer hyper-parameter values you choose, the less you will overfit.

As far as theoretical results, Bengio's group came out with a paper recently, related to training deep networks, which is fairly related to what you are asking, https://arxiv.org/abs/1710.05468 "Generalization in Deep Learning", Kawaguchi, Kaelbling, Bengio, 2018.

"In practical deep learning, we typically adopt the training–validation paradigm, usually with a held-out validation set. We then search over hypothesis spaces by changing architectures (and other hyper-parameters) to obtain low validation error. In this view, we can conjecture the reason why deep learning can sometimes generalize well as follows: it is partially because we can obtain a good model via search using a validation dataset. Indeed, as an example, Remark 6 states that if validation error is small, it is guaranteed to generalize well, regardless of its capacity, Rademacher complexity, stability, robustness, and flat minima."

score 1 · Accepted Answer · answered Mar 02 '18 at 07:21

The whole discussion regarding whether one model doing well in one step will perform better than others models in the next step is somewhat besides the point in my opinion. We are not trying to find the one true model with this type of approach. We are following an approach that should give us a model that will (reasonably) predict well (within the class of algorithms/models we are looking at) and for which we know how well it predicts.

So the idea is that - if you have enough data - you should be able to discard hyperparameter settings that lead to bad models (e.g. massively over- or underfit). Then you are left with a bunch of decent models. The more data you have (in test, validation and training) the more you are able to see even smaller differences in predictive performance, but there are diminishing returns. At some point differences will become more or less irrelevant in practice. In any case, the likely problems in generalizing from your present data to future new data will at some point be a much bigger concern.

Here are my thoughts on the detailed questions:

In step 2, will the winner that has the best performance on the validation set, also has the best performance on the test set?

No, there is no guarantee of this. Of course, if after good cross-validation one model that is not much more complex than any other model considered fits way better than any other model considered, then it may be a clear-cut thing. However, in practice you are tuning many things (usually hyperparameters that determine things like how many variables get used in the model, how candidate variables are considered at each branch of a tree, how much do the coefficients of variables get regularized/how often do you split this variable in a tree, interactions/depth of trees etc.) so you might think of this as a multiplicity problem.

I.e. you are comparing lots of different options and picking the winner, but a lot of other options may look nearly as good. Why would you expect the one that is a little nicer looking than the others to be truly better? There is such a thing as chance. Picking the winner is also a factor (besides overfitting, which we may be able to deal with to some extent using cross-validation - this depends a bit on how complex the model is, how much data you have etc.) in why our estimate of the model performance will likely be overoptimistic.

Based on this logic, i.e., Step 2 is just overfitting on the validation set, then what is the point of this Step 2, the validation step?

The only overfitting here is that we are picking the "winner" out of multiple hyperparameter settings. But at least we are not re-estimating everything else about the model. I.e. the degrees of freedom/extent to which we can overfit is reduced (i.e. we can only tune the hyperparameters). This helps a lot.

And furthermore, what is the criterion that we should follow to select the model that will be used (e.g., applied to the test set)?

We want to pick on the validation set so that when it gets to the test set we are truly only getting an honest evaluation of the performance of the model. If the training data and validation data were large enough - or if the model does not allow too much complexity in the first place - you may find that you do not have much of an issue.

In terms of criterion to use on the validation set, this depends on what it is you wish to optimize (absolute error, mean-squared-error, some complex problem specific loss function etc.)

If the training, validation, test set all have the same distribution and they are all large, are we suppose to see the winner of Step 2 will also be the winner of Step 3?

In Step 3 you are only looking at one model, so in the sense there is no winner, just an honest/realistic evaluation of the picked model. Of course, you might wonder whether a different model would have performed better on the test set. Like for the very first question, the answer is: No, there is no guarantee of this.

I agree with you. I especially like what you said "I.e. the degrees of freedom/extent to which we can overfit is reduced (i.e. we can only tune the hyperparameters)". Can I interpret your word in this way: in the training stage, the VC dimension is larger than in the validation stage (e.g., when we tuning hyper-parameters). Hence, this is why, the discrepancy in performance of a given model on training and validation set can be a lot bigger than the discrepancy in performance on validation set and maybe another independent validation set (or we call it the test set?) A concrete example is: — KevinKim, Mar 02 '18 at 14:34
Consider the hyperparameter is the regularization lambda in Ridge Regression. We can plot the in-sample error on the training set as a function of lambda (for each fixed lambda, the weights in the regression is determined by the training set). We have a another validation set, and we apply those models (with fixed lambda and the fixed weights obtained from the training set) on the validation set. Then we plot the out-of-sample error on the validation set as a function of lambda. We can see that the discrepancy between the minimum of the error curve on training and its corresponding error on — KevinKim, Mar 02 '18 at 14:38
the validation set is huge (overfitting, obviously). But if we have another independent validation set (or we call it test set), and we reapply our models (that have been applied to the original validation set) to the new validation set, then we plot the new out-of-sample error curve (obtained from the new validation set) as a function of lambda, we will find that the discrepancy between the minimum of the error curve on the old validation set and its corresponding error on the new validation set, should be much smaller. Is that correct? — KevinKim, Mar 02 '18 at 14:41

Is the algorithm performance on the validation set a good criterion for model selection?

2 Answers2