5

I am referring to the Training / Validation / Test set for choosing a model while taking care of overfitting.

Here is how the argument goes:-

  1. We train various models on the Training set. (This one is easy) Clearly if there is any noise in this data set, as we add features to the different models we will overfit to the noise in the Training set. This is clear to me.

  2. Then we choose the best model on the Validation set. This will overfit the Validation set. This is not clear to me.

  3. Because we have overfit the best model on the Validation set, to get a sense of the true error which the best model makes we should evaluate the best model on the test set.

My query is : When we are doing (2) we may overfit the validation set only if the validation set has the same noise as the Training set. However, we randomly shuffled the points and put them in the Training / Validation / Test set. It's very unlikely that the Training and Validation set have the same noise ( I think this phenomenon is called twinning). That is why I think we will not overfit the Validation set.

Also another instance when the Validation set can be overfit is if we have a HUGE number of high variance models, then when we choose the best one on the validation set, it will overfit to the noise in the validation set. Suppose I have only say 10 models, then this is also unlikely.

That is why I think that we don't need a test set. I think I have misunderstood this topic. Can someone please clarify where I am wrong?

My apologies for the delay in responding. I would like to clarify my query. We may aim to find the global optimum when using the validation set, however the contours of the functions which have been fit to the training set are not free to learn the noise in the validation set. That is what I am not convinced about. Can you please give me an example, where we overfit the training set and then overfit the validation set ? I'll give you one example. Suppose we are doing k-Nearest neighbours and every item in the training/validation set occurs exactly 2 times. Then we will overfit the training and validation sets and get k=1. The nearest neighbour will perfectly predict any chosen point. However, in this example, we have "twinning". The SAME noise exists in the training and validation set. Can you show me an example where we overfit the training and validation sets but without twinning.

user2338823
  • 219
  • 2
  • 12

2 Answers2

4

One thing that is not widely appreciated is that over-fitting the model selection criteria (e.g. validation set performance) can result in a model that over-fits the training data or it can result in a model that underfits the training data.

This example is from my paper (with Mrs Marsupial)

Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(70):2079−2107, 2010. (www)

Here is a heat-map for the model selection criteria for tuning the hyper-parameters of a Least-Squares Support Vector Machine (or Kernel Ridge Regression model) with different training & validation samples. The training set of 256 patterns is identical each time, but a new sample of 64 patterns is used for each validation set. The criterion is a smoothed error rate on the validation set. You can see there is considerable variation between splits in the optimal hyper-parameters (yellow crosses)

enter image description here

Here are the corresponding models, as you can see there is large variation in whether the models over- or under-fit the data.

enter image description here

If you set the hyper-parameters in the position shown in (d) you tend to get a sensible model for all training-test splits, which suggest the problem is in over-fitting the model selection criterion. In this case, as it is only the validation set that changes, we know this is purely due to over-fitting the validation set during model selection.

Consider a classification task, where there are 1000 binary features and one binary response variable, but they are all generated by flipping a fair coin. We then make ten models, each of which is given a disjoint set of 100 of the attributes. We form a training set of 10 patterns, a validation set of 10 patterns and a test set of 1,000,000 patterns. Note that the validation, test and training sets are entirely independent because all of the data is generated at random with no underlying structure.

  • Each model generates it's output by picking the attribute that is most similar to the response variable for the training set out of the 100 attributes it has to choose from. If it is predicting a sequence of 10 random binary values using 100 similar sequences of 10 random variables, then it is highly likely to have an accuracy on the training set greater than 0.5, just by random chance. But we know the true optimal error rate of 0.5, so we know it is overfitting the training set.

  • We then use the validation set to pick the best model. We will be picking the model that has the highest accuracy on the validation set. Now in this case, it is rather less likely that the best validation set accuracy will be greater than 0.5, but I would still say it is over-fitting the validation set because you would be choosing the model purely on the basis that the randomness of one of the models was a better match than the others for the randomness of the validation data.

  • So what we will end up with is a model that obviously over-fits the training data (many degrees of freedom in selecting the attribute, so accuracy > 0.5), one that probably over-fits the validation data in the sense of accuracy > 0 (fewer degrees of freedom, only 10 models to choose from), but definitely overfits in the sense of the choice being dominated by the noise. But whatever the choice, the test set will show us that the final model is just guessing (which is why you need the test set or nested cross-validation)

The basic point is that if data has been used to optimise the model in any way, then it will give an optimistically biased performance estimate. How biased it is depends on how hard you try to optimise the model (how many feature choices, how many hyper-parameters, how fine a grid you use in gridsearch etc.) and the characteristics of the dataset. In some cases it is fairly benign:

Jacques Wainer, Gavin Cawley, "Nested cross-validation when selecting classifiers is overzealous for most practical applications", Expert Systems with Applications, Volume 182, 2021. (www)

Unfortunately, sometimes it can be as large as the difference in performance between a state of the art classifier and an average one (see Cawley and Talbot).

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 1
    Many thanks for your reply. I need some time to read the 2 papers you have mentioned. I would like to come back for a discussion after that. – user2338823 Nov 15 '21 at 13:56
  • 2
    @user2338823 you have edited some text (additional questions) into this answer, but that makes it very difficult to follow the answer. While reading Dikran Marsupial's answer I suddenly get the text "awesome simulation. I have some queries..." and it makes me wonder what Dikran Marsupial is talking about (but only later I see that it is not Dikran Marsupial talking at that point). – Sextus Empiricus Nov 21 '21 at 14:00
  • @SextusEmpiricus My apologies. I wanted it to be convenient for Dikran Marsupial to read my query. How can I fix my mistake? Perhaps I can annotate my section of the response with my name ? – user2338823 Nov 21 '21 at 14:03
  • Or post at the bottom of his reply with a reference to the place I have inserted my query? – user2338823 Nov 21 '21 at 14:11
  • @user2338823 It is probably best to ask the questions in the comments, rather than editing the answer. In these simulations, the training set of 256 patterns is fixed, so the differences are purely due to the sampling of the 64 pattern validation set. It would not be a good idea to control for duplicates because the aim is to assess the effect of sampling variation, and it is no longer an i.i.d. sample if we had controlled for duplicates. Also there will be no exact duplicates as the data are continuous values sampled from a set of normal distributions. – Dikran Marsupial Nov 22 '21 at 08:40
  • 1
    @DikranMarsupial I will put my queries in the comments in the evening. Please allow me some time. – user2338823 Nov 22 '21 at 08:44
  • @user2338823 no problem, as long as you tag me in the comment I will see it (but I am an examiner for a PhD student this week and have to read the thesis a couple more times, so I may not respond all that quickly) – Dikran Marsupial Nov 22 '21 at 08:46
  • @DikranMarsupial sure, please take your time. – user2338823 Nov 22 '21 at 08:47
  • I should add, that in these simulations there is no reuse of test samples. It is a synthetic problem, so I can generate as much i.i.d. data as I like, so the validation sets in each case were completely separate samples. – Dikran Marsupial Nov 22 '21 at 10:33
  • I am experiencing 2 difficulties. 1. I am not able to post my query in the comment box here as it says it is too long.So I think I should put my query at the END of the post by Dikran Marsupial by editing it ***or*** do I post RIGHT at the end where it says Answer Your Question. 2. How do I do a cross reference to places in Dikran's post. Can someone please clarify? – user2338823 Nov 22 '21 at 12:36
  • I often split long questions/observations over more than one comment block, it doesn't really affect readability. – Dikran Marsupial Nov 22 '21 at 12:39
  • I would like to summarize Figure 6 in the following way: We can have 3 kinds of training-validation splits. Type 1: Train set consists mostly of "easy to classify" examples and so does the validation set. In this case a overly simple model may overfit the validation set and beat complicated models. Type 2: Train set consists of a healthy mix of "easy to classify" egs. and some hard to classify egs. and so did the validation set. The "hard to classify" examples are outliers and there won't be too many of them. In this case a model of mediocre complexity may do very well on the validation set. – user2338823 Nov 22 '21 at 12:45
  • Type 3: Train set consists of a LOT of hard to classify examples and some easy to classify examples. Validation set consists of easy to classify + some hard to classify examples. Now a overly complicated model may beat all models simply because it has seen the hard to classify examples during training. Would that "theory" be correct? – user2338823 Nov 22 '21 at 12:46
  • "Type 1: Train set consists mostly of "easy to classify" examples" in the simulations, the training set is identical in each case (look at some of the outlying patterns and you will see that they are the same in each diagram). The differences are purely due to the different samples making up the validation sets. – Dikran Marsupial Nov 22 '21 at 12:47
  • Regarding the classification task. May I say the following: The train and validation set are small. They may be similar by chance. Hence the model which picks the attribute most similar to the train set will ALSO do well on the validation set and have accuracy >.5 on the validation set purely by chance. – user2338823 Nov 22 '21 at 12:52
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/131660/discussion-between-dikran-marsupial-and-user2338823). – Dikran Marsupial Nov 22 '21 at 12:56
1

Choosing the best model is nothing but like hyper parameter optimization. We’re using training set to learn the parameters and validation set to learn the hyper parameters. In HPO, we typically evaluate the model on the candidate configurations and choose the best. In training, we use fancier stuff like gradient descent, adam optimizer etc. But still, they all aim to find the global optimum. What if we don’t use any of these fancier algorithms and just be able to iterate over the space of possible parameters? How is it different than what we do for the validation set? Thus, during the whole process we actually have a look at the training and validation sets and tune our model/algorithm. Any evaluation over these sets is not an unbiased estimation of the test performance.

gunes
  • 49,700
  • 3
  • 39
  • 75