Why is information about the validation data leaked if I evaluate model performance on validation data when tuning hyperparameters?

Question

In François Chollet's Deep Learning with Python it says:

As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.

Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.

BTW: it does not only depend on how often you do this but also on the random uncertainty of your performance evaluation (target functional) during optimization. — cbeleites unhappy with SX, Dec 28 '18 at 19:21
if the valitation results used for optimiziation were perfect (i.e. neither systematic nor random error), optimization would choose the truly optimal model, you wouldn't have any overfitting and another independent perfect validation of the chosen model would yield exactly the same result. The optimization could even tolerate systematic error (bias) as long as it doesn't change with the factors you vary during optimization. Now consider what happens if there is random error (variance uncertainty) on the performance estimate: you get noise on top of the true performane "landscape". — cbeleites unhappy with SX, Jan 03 '19 at 18:17
This noise can make some point (hyperparameter settings) look better than it actually is, so those hyperparameter settings may be chosen accidentally (and erroneously). The probability that this happens increases with a) the number of times you look at such performance values and b) the amount of noise that you have on top of the true performance (compared to the true performance increase). This is not about why reuse of validation results is data leakage, but about how the respective overfitting happens and how serious a problem you should expect - thus only a comment. — cbeleites unhappy with SX, Jan 03 '19 at 18:34

Sycorax · Accepted Answer · 2018-12-30T03:49:48.433

12

Information is leaked because you're using the validation data to make hyper-parameter choices. Essentially, you're creating a complicated optimization problem: minimize the loss over hyper-parameters $\phi$ as evaluated against the validation data, where these hyper-parameters regularize a neural network model that has parameters $\theta$ trained by use of a specific training set.

Even though the parameters $\theta$ are directly informed by the training data, the hyper-parameters $\phi$ are selected on the basis of the validation data. Moreover, because the hyper-parameters $\phi$ implicitly influence $\theta$, the information from the validation data is indirectly influencing the model that you choose.

edited Dec 30 '18 at 03:49

answered Dec 28 '18 at 00:52

Sycorax

76,417
20
189
313

1

In retrospect, this was pretty obvious. But what does "If you do this only once, for one parameter, then very few bits of information will leak" mean then? What is meant there and how does it contrast with the other case in which "you repeat this many times"? – fabiomaia Dec 28 '18 at 19:40
4

Suppose you only try 2 hyper-parameter configurations, measuring performance against the validation data, and pick the best model. There's a smaller chance that, by blind luck, you managed to overfit the validation data. By contrast, suppose you try $2^{10}$ hyper-parameter configurations and pick the best model based on the validation data. There's a larger risk that, purely by blind luck, you've managed to overfit the validation data. See also: "the garden of forking paths" and the discovery of spurious effects. – Sycorax Dec 28 '18 at 19:45
1

That makes perfect sense. The wording in the original book wasn't the best. Thank you! – fabiomaia Dec 28 '18 at 19:49
The wording in the book is excellent. – Michael M Dec 28 '18 at 21:39
2

To you it may seem "excellent" because you likely already know what the author is talking about. The comment by @Sycorax was much more explicit and helpful to me. – fabiomaia Jan 04 '19 at 00:24

Why is information about the validation data leaked if I evaluate model performance on validation data when tuning hyperparameters?

1 Answers1

Linked