Data splitting in regression

Question

Edit:

The key point I'm attempting to understand is whether during a regression model building exercise, do I need separate datasets to:

search for predictors and settle on a functional form
estimate the parameters (based on the predictors and functional form derived during #1) that will be actually reported along with reporting of fit statistics, residual analysis results, etc as a demonstration of model validity

The passage from ALSM would suggest that if I used only one dataset for both #1 and #2, there would be some bias in the reported coefficient estimates? Is this correct? The comment from "whuber" would suggest no particular problem.

Original Question:

I'm seeking clarification on the meaning and correctness of the following passage I found in:

Kutner, Neter, Nachtsheim, Li. Applied Linear Statistical Models (5th Edition - International Edition). page 375, point #5 in the section "Comments":

If a data set for an exploratory observational study is very large, it can be divided into three parts. The first part is used for model training, the second part for cross-validation and model selection, and the third part for testing and calibrating the final model (Reference 9.10). This approach avoids any bias resulting from estimating the regression parameters from the same data set used for developing the model.

I would like to understand the nature and source of the bias that is mentioned in the second sentence? I am further confused having found the following comment:

NB: After you have settled on a model and confirmed its usefulness with the hold-out data, it's fine to recombine the retained data with the hold-out data for final estimation. Thus, nothing is lost in terms of the precision with which you can estimate model coefficients.

here. This suggests that there is no issue with "estimating the regression parameters from the same data set used for developing the model".

Note: Reference 9.10:

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

Welcome to the site, @user70543. The linked thread may contain the answer you seek. Please read it. If you still have a question afterwards, come back here & edit your Q to say what you have learned & what you still need to know. Then we will be able to provide the information you need rather than just providing information that already didn't help you elsewhere. — gung - Reinstate Monica, Mar 06 '15 at 15:23

charles · Accepted Answer · 2015-03-07T03:10:58.703

I think this is the testimation bias: the overestimation of effects of predictors because selection of the effects withstood a statistical test. When one only include a predictor if it has a large effect (test) in your sample, one will overestimate (bias) the effect of the predictor.
This is usually an issue
- with weak predictors (strong predictors will always be selected)
- small datasets/moderate datasets
- data hungry analysis approaches

Thus pre-specifiying model or at least being aware of the cost of data analysis are often essential. For prediction models shrinkage methods are popular (Steyerberg: Application of Shrinkage Techniques in Logistic Regression Analysis: A Case Study)

Data splitting in regression

1 Answers1