I think this is an "old" question about cross validation. I benefit from reading the post and the thread from How do you use test data set after Cross-validation, What is the difference between test set and validation set?, and Why only three partitions? (training, validation, test). But I still have the following question:
When we got our data set (with both features X and labels Y) for supervise learning and we split it into training, validation, test data set.
We train our models (each model has a different value for its hyper-parameter(s), e.g., the degree of polynomial) on the training data set
We tune the hyper-parameters of those models on the validation set and pick the winner
We apply the tuned model (winner from step 2) on the test data set to get the estimation of the performance of this tuned model in real world. And we stop tuning the hyper-parameter.
Now, let's look at Step 1 and 2. It is clearly that the model with better performance on the training set does not necessarily have a better performance on the validation set, as the model could overfit the training set. So that's why we have this validation set "to prevent overfitting" on the training set. We will come back to this point in a minute.
Now let's look at Step 2 and 3. Say after step 2, we have chosen a model (i.e., a specific value of the hyper-parameter) that has the best performance on the validation set. If we believe the performance score for this model on the validation set is the roughly equal to the actual performance score for this model on new real world data set, then we are usually too optimistic. So that's why we need to apply this winner model from Step 2 on the test set and the performance score obtained on that test set will give us a better estimation of this model's performance on new real world data set.
My question is: In step 2, will the winner that has the best performance on the validation set, also has the best performance on the test set? Here, I am only care about model selection, i.e., ranking different models. At this Step 2, I don't care about whether the performance score on the validation set is a good estimation of the actual performance of my model. To me, this whole procedure sounds like The winner of Step 2, is also overfitting on the validation set, which is nothing different than the logic of overfitting on the training data set. I hope I missed some critical logic here and I hope someone could correct me. Based on this logic, then the winner in Step 2, will not necessarily has the best or just somewhat "better" performance on the test set than other models we've tried. Then the question is Based on this logic, i.e., Step 2 is just overfitting on the validation set, then what is the point of this Step 2, the validation step? And furthermore, what is the criterion that we should follow to select the model that will be used (e.g., applied to the test set)?
I would argue that if we have multiple validation sets and we pick a model that has the best average performance over all of those validation sets than other models, then this winner will almost surely be the winner if you now give me another test set. And this is the idea of n-fold cross-validation. And it uses all the data to train the model, which I think is better than this "splitting the data set into 3 parts." I think in terms of model selection (i.e., just horse racing), n-fold CV is a better approach. If in additional, we want a fair estimate about the true error rate, then we may divide the training data into 2 parts, and apply n-fold CV on one part, then use the other part to estimate the true error rate.
If the training, validation, test set all have the same distribution and they are all large, are we suppose to see the winner of Step 2 will also be the winner of Step 3? Are there any theoretical results for that so as to give this procedure at least some faith?