Formal treatment of overfitting on the test set

Question

Assume that I split randomly the data into training and test sets. Suppose that I build a machine learning model using the training set. And suppose that I evaluate the accuracy of the model on the test set and it turned out to be 0.8. I think we can say that the expected accuracy of the model on future (unseen) data is 0.8. But if I try many different algorithms I will begin to overfit on the test set (which is typically the case in many Kaggle competitions). Note that this overfitting will happen even if I use a separate validation set for hyperparameter tuning.

Rather somewhat informally, we can say that for finding the accuracy if one uses the test only once then there is no problem. But if one uses it many times then there will be overfitting on the test set and the accuracy result will not be a good estimate of the true error of the model.

My question is: is there some formal treatment of this rather informal statement above. For example which says that if you try $n$ different models on the test then the estimated accuracy's confidence will decrease as a function of $n$. I'm asking this because I cannot see any way to prevent overfitting on the test set in an offline experimental setup where one tries many different algorithms to find the best model. Thanks.

If you’re doing hyper-parameter tuning using a validation set why would you be evaluating using the test set multiple times? — astel, Oct 14 '21 at 17:45
@astel suppose that I want to try different ML algorithms such as decision trees, linear regression, SVM, and so forth. I use the validation set to tune the hyperparameters of these algorithms. After hyperparameter tuning I need to see the performance on the test set. So if I use 10 different algorithms it means that I need to use the test set 10 times. — Sanyo Mn, Oct 14 '21 at 19:43
Except that isn’t how it works. Algorithm selection and hyper parameter selection are one in the same. Choosing between a decision tree with depth three and one with depth four is no different from choosing between a decision tree with depth three and a logistic regression. You need to combine algorithm selection and hyper parameter selection, ofherwise you would need yet another test set. See this answer https://stats.stackexchange.com/questions/494900/how-to-avoid-overfitting-bias-when-both-hyperparameter-tuning-and-model-selectin/495120?noredirect=1#comment919818_495120 — astel, Oct 14 '21 at 22:54
Treating algorithm selection as hyper-parameter selection makes sense, thank you @astel. — Sanyo Mn, Oct 15 '21 at 07:11

Sextus Empiricus · Answer 1 · 2021-10-15T20:42:51.260

0

The way to test a machine learning model, to determine the expected accuracy on different data, is to use a third set of data that has not been used for training and validating.

What is the difference between test set and validation set?

If you use the test set as some sort of second layer of cross validation, to filter different algorithms according to their performance, then it is not a valid test but acts effectively as a validation set.

This happens indeed with Kaggle competitions or p-values in research, they are susceptible to publication bias.

There is no real formal way to deal with this.

The rigorous way to solve it is re-test it with a fourth test.

A simple way is to set stricter standards (it is one of the reasons for the $5\sigma$ standard: Origin of "5$\sigma$" threshold for accepting evidence in particle physics?).
If this happens in your own research. For selecting algorithms you could save data by selecting the algorithms along with the hyper-parameter tuning. The choice between algorithms could be seen as some sort of hyper-parameter tuning (the choice of model being the parameter).

edited Oct 15 '21 at 20:42

answered Oct 14 '21 at 17:25

Sextus Empiricus

43,080
1
72
161

I know the difference between validation and test set, please check my reply to @astel above. – Sanyo Mn Oct 14 '21 at 19:45
@Sanyo maybe I am not seeing the problem so clearly. If you use the test set as some sort of second layer of cross validation, to filter different algorithms according to their performance, then it is not a valid test. It seems a bit trivial to me and I wonder what needs to be formal about this. The same is true for Kaggle competitions or p-values in research, it is all susceptible to confirmation bias. The way to solve it is re-test it with a fourth test or stricter standards. (For selecting algorithms you could save data by selecting the algorithms along with the hyper-parameter tuning) – Sextus Empiricus Oct 14 '21 at 20:23
So, if the difference between testing and cross validation is clear, then what is unclear? – Sextus Empiricus Oct 14 '21 at 20:24

Formal treatment of overfitting on the test set

1 Answers1