Overfitting through model selection

Question

I'm asking this question as I found little explanation of this phenomenon otherwhere. I am wondering about how to best deal with overfitting that comes from the model selection itself. Say I want to run some regression on a set of observations. My choice of which model to use (linear, log, exponential) is already in some sense a parameterisation. Even more so if I run several regressions using different models and then choosing the best one. For example, if I am to compare a linear with an exponential model of some sort, am I not (implicitly) doing a regression of the sort:

$y = I(a+bx) + (1-I)(ce^{kx})$

where I is a binary variable that I still determine based on a fitting procedure. Is there a way to quantify (or qualify) to what extent a model may be overfitted because of freedom in model selection?

Related post: [AIC, model selection and overfitting](https://stats.stackexchange.com/questions/211069). Hansen ["A Winner’s Curse for Econometric Models: On the Joint Distribution of In-Sample Fit and Out-of-Sample Fit and its Implications for Model Selection"](http://www.tse-fr.eu/sites/default/files/medias/stories/SEMIN_10_11/ECONOMETRIE/hansen.pdf) (2010) could be relevant. — Richard Hardy, Apr 02 '20 at 15:46
This also relates to the Problem of post-selection inference: e.g. Standard errors are biased if calculated the usual way. There is some literature on that topic — Sebastian, Apr 02 '20 at 15:59

score 1 · Answer 1 · answered Apr 02 '20 at 15:54

A good question. This is sometimes called "overfitting by investigator".

As long as you have plenty of data, separate it into three parts: - training data, used to train each specific model - validation data (also called the dev set) to choose which model to use - test data, used only as the final step, to determine the performance of your chosen model

If you are data poor, this is more difficult.

Another good step is to write a project plan before looking at the data at all. If you can identify an approach at this stage, from prior knowledge, then it avoids the risk you point to.

Overfitting through model selection

1 Answers1