Edit:
The key point I'm attempting to understand is whether during a regression model building exercise, do I need separate datasets to:
- search for predictors and settle on a functional form
- estimate the parameters (based on the predictors and functional form derived during #1) that will be actually reported along with reporting of fit statistics, residual analysis results, etc as a demonstration of model validity
The passage from ALSM would suggest that if I used only one dataset for both #1 and #2, there would be some bias in the reported coefficient estimates? Is this correct? The comment from "whuber" would suggest no particular problem.
Original Question:
I'm seeking clarification on the meaning and correctness of the following passage I found in:
- Kutner, Neter, Nachtsheim, Li. Applied Linear Statistical Models (5th Edition - International Edition). page 375, point #5 in the section "Comments":
If a data set for an exploratory observational study is very large, it can be divided into three parts. The first part is used for model training, the second part for cross-validation and model selection, and the third part for testing and calibrating the final model (Reference 9.10). This approach avoids any bias resulting from estimating the regression parameters from the same data set used for developing the model.
I would like to understand the nature and source of the bias that is mentioned in the second sentence? I am further confused having found the following comment:
NB: After you have settled on a model and confirmed its usefulness with the hold-out data, it's fine to recombine the retained data with the hold-out data for final estimation. Thus, nothing is lost in terms of the precision with which you can estimate model coefficients.
here. This suggests that there is no issue with "estimating the regression parameters from the same data set used for developing the model".
Note: Reference 9.10:
- Hastie, Tibshirani, Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction.