Currently I’m focusing on model selection criteria, more specifically: sequential hypothesis testing, information criteria (like BIC and AIC), Lasso. All of these in regression framework. These methods are useful as remedy for overfitting problem and in some ways allow us to manage the trade-off between parsimony and completeness of the models in light of prediction loss function. In other terms these methods permit to manage bias-variance trade-off. Now, in my main reference, these methods are used as “in sample methods” in the sense that the models is estimated on all data. The best model is chosen without out of sample measures.
However the problem at hand (overfitting) is expressed in natural way splitting the sample in two part (in and out). My doubt is related to the fact that, even if the methods above permit a good selection among predictors and then among models, the estimation involve all data. It seems me that in some extent the metrics like MSE result too optimistic. My idea is simply to use the methods above, after split the data. Then to use only “in sample” part for estimation purpose and then compare models performance, in term of loss function like MSE, on data never seen before “out of sample”.
Is it a good idea? If not why? Is not better than estimate on all data?