Difference between train/test and train/validate/test split?

Question

I know this question has been asked here before, but after reading the answers I still dont get the difference.

Consider for instance a lasso penalized linnear regression model.This model has a penalization parameter $\lambda$ that controls the level of shrinkage applied. So (in general) different $\lambda$ values generate different $\beta$ parameters. In this kind of situations, I am used to work with a train/test split, performing cross-validation over the training sample in order to find the penalization parameter that reduces the prediction error, and once I found the optimal $\lambda$ I compute the actual prediction error over the test split.

However, in several papers, people consider a train/validate/test split. I found the following description of this train/validate/test split in The Elements of Statistical Learning book:

The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model

So as far as I can understand, I usually do the validation step estimating the prediction error over different $\lambda$ values to find the optimal one, and then I do the test step obtaining the error of the chosen model. But what am I supposed to do on the training step?

It here says "fit the models". Does this mean using the train step to (roughly talking) obtain the $\beta$ associated to different $\lambda$? But, if so, what would be the difference between the error computed using the validation set and the test set? Neither of these sets would have been used in the creation of the model, so both of them are independent of the training set.

score 3 · Accepted Answer · answered May 15 '18 at 20:37

I would like to add on @Vishal answer.

The accuracy you get on the validity or test set is not the true accuracy, but an estimate of it, and as such it has an uncertainty.

Typically, you will chose the lasso model (trained on the training set) with $\lambda = \lambda_{best}$ giving the best accuracy (or F1score or whatever measure you prefer to rank your models) on the validation set (call it $M_{\lambda_{best}}$), because that is the model with probably the best accuracy. But the estimate of the accuracy you get on the validation set is probably an overestimate. Probably the accuracy was so high because the validation set mostly contained cases that $M_{\lambda_{best}}$ can classify well/better than the other models $M_\lambda$, with $\lambda \neq \lambda_{best}$ (a sort of selection bias). So yes, the accuracy estimation from the validation set is biased.

Instead the accuracy estimate you perform on the test set is unbiased.

But if I choose a model trained on the training set, which is independent of the validation and test sets, why should be the validation set biased and the test set unbiased. I mean, the two sets are independent of the one used to estimate the parameters of the model, so although the accuracy provided by the validation and test set is just an estimation of the true accuracy, both sets should provide similar estimations, right? — Álvaro Méndez Civieta, May 16 '18 at 06:37
because on the validation set you evaluate the accuracy of many models (corresponding in your case to different lambdas), and you choose the one model having the highest accuracy. so in general, the accuracy estimate you get on the validation set is not biased for *all* the models you consider, but it is for the *winning* model. — fabiob, May 16 '18 at 07:03

score 1 · Answer 2 · answered May 15 '18 at 15:55

Here's the basic idea: Let's say you try six different lambda values to generate six different sets of betas (model coefficients). You'd to that on your training set.

Now, in order to evaluate how good each of these six models are, you would proceed to use the validation set. You'd apply each one of the six models on the validation set, and review results to determine which model performed best. This is important because you want to assess models on a sample that it had not seen before. If you don't do this, then you run the risk of over-fitting the models. And since you'd not know the amount of over-fitting present in each model, you might end up picking a model that's not the best.

You might run through the above steps multiple times and explore a different set of lambdas, for instance. But in the end, you would have selected the best model.

Now it's time to see how this model performs in the wild. This is where the test set comes into play. Your chosen model's performance on the test set gives you an idea of how well the model will perform (in future) on a sample that it had never seen before.

PS: Also see my answer on the following link that shed some more light on this: https://stats.stackexchange.com/a/211525/107126

Difference between train/test and train/validate/test split?

2 Answers2