Is it wrong to compare multiple models on the same test set and choose the best model?

Question

Suppose we split a dataset into 3 parts (train, validation, and test). I know that it's important to make sure the test set doesn't influence our decisions during model selection or hyperparameter tuning or else we may end up overfitting the test set and have unrealistic results. So it's clearly wrong if we tested some model then try to change its hyperparameters and train, validate, and test it again on the same test set.

However, what if we trained and validated, for example, 5 different models, and we decided that we won't modify any of them again. Then we tested each of the 5 models on the (same) unseen test set. If we select the model that achieves the best test result, isn't this the same as if we try different hyperparameter combinations and select the one that has the best performance on the test set?

In this sense, is it wrong when research papers propose multiple methods, test them on the same data, and decide that one of them is the best because it has the best performance on the test set? Isn't it supposed that the test set doesn't influence the choice of the best method?

But if this is wrong, how are we supposed to compare methods (from different papers) on the same test set without being biased? I feel there is a contradiction regarding this point.

Have you found an answer to this? I am having a similar question. Essentially this is what happens to any data set we use right? E.g. Anyone can download the MNIST data and create a train, validation and test split. Say we created a model (e.g. linear regression) and it performs poorly on our test MNIST. We learn from this, and create a deep learning model and it performs better now on our test set. But, we cannot report this as our generalization error because in some way we've overfitted on the test set. — woowz, Aug 30 '21 at 06:03
You can see this also in the state of the art benchmarks too e.g. the SQUAD (Stanford Question and answering Data set). Essentially, each time a new model is released we are overfitting on the test set. So we can't report that as our true generalization error. Essentially the test set becomes more like a validation set. Do you agree? — woowz, Aug 30 '21 at 06:09
Unfortunately I haven't found an answer yet. I strongly agree with your opinion regarding the state of the art. I believe the best way to really test the model is to deploy it in a real production environment and see how it works with real data. However, for papers that have not deployed their models, I think the results will always be at the risk of overfitting. — Abdulwahab Almestekawy, Aug 31 '21 at 15:19
I vote to reopen this question since changing a model and selecting a model are not equivalent. I still find the answers to the question helpful and relevant here as well - the questions yet are not identical/duplicate. — Nikolas Rieble, Jan 31 '22 at 19:34

Nikolas Rieble · Answer 1 · 2022-01-31T11:37:02.507

You could consider the model itself a hyperparameter as well. If you optimize the hyperparameter using the test set, and then choose the best model, you overfit with the human in the loop.

I like the sklearn documentation on model selection which sports the following chart:

And futher states:

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

Additionally, I recommend you to read:

Note that while you call your holdout set test, that is what others call evaluation set. In this context, have a look at MFML 069 - Model validation done right

Fantastic response. The statement that evaluating the performance of multiple models being used for model selection represents leakage of the test set is especially insightful. — Frank Harrell, Jan 31 '22 at 13:02

Is it wrong to compare multiple models on the same test set and choose the best model?

1 Answers1