Training a model with the full dataset after selecting it using a test set

Question

Suppose I need to compare 3 different regression models. Suppose that my only purpose is to select the model which predicts the response variable better. Be M1, M2, M3 these three models. So I estimate each model with the training set and then I calculate for each of them the estimated prediction error through the test set. Now suppose that M1 has the lowest estimated prediction error. This makes me select M1 as the best model.

Now I'd tempted to re-estimate M1 with the entire dataset in order to reduce the standard errors of the regression parameters. This is not a correct method, is it?

I'm no expert, but I don't think that would be allowed, as then you would be creating a new model based on all of the data. So you would no longer have a validation set. Might cause over-fitting, right? — Alexander, May 21 '15 at 12:41
Perhaps you should have a look at the regression modeling strategies package `rms` in R, as I think that can be used to bootstrap or otherwise estimate the parameters you are interested in. — Alexander, May 21 '15 at 12:43
I think (first comment of) @Alexander is right. The re-estimated model on the full data could over-fit and it might be no more the best one. From my experience I can say that, if you are interested in prediction than once you've found the best model than you stop. If you need modelling than just use the full dataset and do model selection (e.g. via BIC, AIC etc.) instead. — utobi, May 21 '15 at 12:49
Nearly a duplicate of this http://stats.stackexchange.com/questions/11602. — amoeba, Jan 30 '16 at 21:12

score 2 · Accepted Answer · answered Jan 25 '16 at 10:06

IMHO, this is fine, provided that it is acknowledged that you don't have an unbiased performance estimate for the final model. Whether over-fitting is a problem depends on how many models there are too choose from, how well they are specified by the training data and the variance of the hold-out estimate used to select the best model. In many cases, the over-fitting from the discrete choice between three models is likely to be negligible.

Consider this alternative procedure. We use k-fold cross-validation to estimate the performance of the model. In each fold, we use hold-out set to choose between M1, M2 and M3 in each stage. We then build the final model, using all of the available data, with a hold-out set to choose between M1, M2 and M3. In this case, we build the final model in the same way suggested in the question, but the performance estimate is approximately unbiased (for LOOCV at least), as the choice between M1, M2 and M3 is performed separately in each fold, so the performance estimate includes a component due to the over-fitting caused by choosing between models based on a hold-out set. The procedure set out in the question is just the same, just without the attempt to get a performance estimate.

Training a model with the full dataset after selecting it using a test set

1 Answers1

Linked