Let's suppose we train a model with 10-fold cross validation. For hyperparameter selection, one can take all combinations of hyperparameters using grid-search. My question: can the test accuracy be higher in combination A than B though the validation accuracy is higher in B than A?
-
Since the terminology wrt. validation and testing is not universal (https://stats.stackexchange.com/q/525697/4598) please clarify what is what in your question. – cbeleites unhappy with SX Jun 13 '21 at 12:17
-
Validation accuracy: Accuracy on the separated training dataset which is not available during model training. This dataset (validation set) is 10% randomly chosen in 10-fold CV and used for hyperparameter tuning. Test accuracy: Accuracy on the test dataset which is also neither available during model training nor hyperparameter selection. This dataset is only used for reporting final model accuracy. – hwloom Jun 13 '21 at 12:31
-
1So, since you use the validation set to *tune* your model, i.e. you select situations where the validation result looks *good*, why would you expect it to *not* look better than an independent test of the final model later on? – cbeleites unhappy with SX Jun 13 '21 at 12:33
1 Answers
The purpose of the validation set is to provide a 'view' of how well your model performs on unseen data, but still allows for some tuning of the model through techniques such as early stopping and hyperparameter selection. Although we aren't explicitly training on the validation set the validation set still influences our training process. The test set on the other hand is completely untouched and unseen right until you have finalised and frozen your model.
Since both validation and test sets will contain different examples there is no reason why one should perform better or worse than the other for a particular model. The validation set simply gives an idea of how the model generalises to that unseen data which may or may not be representative of the unseen data of your test dataset and the learning problem in general.
Cross validation makes things a little more tricky, since even though each fold of cross validation has its own unseen data, when taking the average over all folds to determine how well hyper parameters perform still captures characteristics of the training set which, again, may not be the optimal choices for how the model generalises to the test set.
And finally, the quality of your test set may be poor. It may be unrepresentative of the training data, it may be easier to fit than the training data or it may be harder to fit. If you're using a large scale popular dataset which provides separate training and test (and if you're lucky validation) sets you can be fairly confident that the quality of your datasets are good, but you still cannot be certain.

- 809
- 1
- 12