When can you use data-based criteria to specify a regression model?

Question

I've heard that when many regression model specifications (say, in OLS) are considered as possibilities for a dataset, this causes multiple comparison problems and the p-values and confidence intervals are no longer reliable. One extreme example of this is stepwise regression.

When can I use the data itself to help specify the model, and when is this not a valid approach? Do you always need to have a subject-matter-based theory to form the model?

score 10 · Accepted Answer · answered Oct 18 '10 at 22:26

Variable selection techniques, in general (whether stepwise, backward, forward, all subsets, AIC, etc.), capitalize on chance or random patterns in the sample data that do not exist in the population. The technical term for this is over-fitting and it is especially problematic with small datasets, though it is not exclusive to them. By using a procedure that selects variables based on best fit, all of the random variation that looks like fit in this particular sample contributes to estimates and standard errors. This is a problem for both prediction and interpretation of the model.

Specifically, r-squared is too high and parameter estimates are biased (they are too far from 0), standard errors for parameters are too small (and thus p-values and intervals around parameters are too small/narrow).

The best line of defense against these problems is to build models thoughtfully and include the predictors that make sense based on theory, logic, and previous knowledge. If a variable selection procedure is necessary, you should select a method that penalizes the parameter estimates (shrinkage methods) by adjusting the parameters and standard errors to account for over-fitting. Some common shrinkage methods are Ridge Regression, Least Angle Regression, or the lasso. In addition, cross-validation using a training dataset and a test dataset or model-averaging can be useful to test or reduce the effects of over-fitting.

Harrell is a great source for a detailed discussion of these problems. Harrell (2001). "Regression Modeling Strategies."

Accepting, a long time later! Thanks for this detailed overview of the technical issues, and I'll take a look at Harrell's book. — Statisfactions, Sep 20 '11 at 16:44

score 7 · Answer 2 · answered Jul 23 '10 at 06:49

In the social science context where I come from, the issue is whether you are interested in (a) prediction or (b) testing a focused research question. If the purpose is prediction then data driven approaches are appropriate. If the purpose is to examine a focused research question then it is important to consider which regression model specifically tests your question.

For example, if your task was to select a set of selection tests to predict job performance, the aim can in some sense be seen as one of maximising prediction of job performance. Thus, data driven approaches would be useful.

In contrast if you wanted to understand the relative role of personality variables and ability variables in influencing performance, then a specific model comparison approach might be more appropriate.

Typically when exploring focussed research questions the aim is to elucidate something about the underlying causal processes that are operating as opposed to developing a model with optimal prediction.

When I'm in the process of developing models about process based on cross-sectional data I'd be wary about: (a) including predictors that could theoretically be thought of as consequences of the outcome variable. E.g., a person's belief that they are a good performer is a good predictor of job performance, but it is likely that this is at least partially caused by the fact that they have observed their own performance. (b) including a large number of predictors that are all reflective of the same underlying phenomena. E.g., including 20 items all measuring satisfaction with life in different ways.

Thus, focused research questions rely a lot more on domain specific knowledge. This probably goes some way to explaining why data driven approaches are less often used in the social sciences.

score 4 · Answer 3 · edited Apr 13 '17 at 12:44

4

Richard Berk has a recent article where he demonstrates through simulation the problems of such data snooping and statistical inference. As Rob suggested it is more problematic than simply correcting for multiple hypothesis tests.

Statistical Inference After Model Selection by: Richard Berk, Lawrence Brown, Linda Zhao Journal of Quantitative Criminology, Vol. 26, No. 2. (1 June 2010), pp. 217-236.

PDF version here

edited Apr 13 '17 at 12:44

Community

1

answered Oct 19 '10 at 01:50

Andy W

15,245
8
69
191

(+1) Thanks for the link! You may be interested in this related question, http://stats.stackexchange.com/questions/3200/is-adjusting-p-values-in-a-multiple-regression-for-multiple-comparisons-a-good-id. Feel free to contribute. – chl Oct 19 '10 at 11:37
@chl, I don't think I can add anything to the already excellent answers for that question. I actual think Brendan's response is very poignant because I suspect the original poster is really interested in causal inference not solely prediction based on the context of the question. – Andy W Oct 19 '10 at 13:12
Yes, I was thinking of his answer. I have initiated a reflexion on data dredging issue (not exactly about model/variable selection issues or causal inference), but so far receive few responses. If you like to add your own ideas, it would be interesting: http://stats.stackexchange.com/questions/3252/how-to-cope-with-exploratory-data-analysis-and-data-dredging-in-small-sample-stud – chl Oct 19 '10 at 15:38

score 4 · Answer 4 · answered Jul 23 '10 at 00:04

I don't think it is possible to do Bonferoni or similar corrections to adjust for variable selection in regression because all the tests and steps involved in model selection are not independent.

One approach is to formulate the model using one set of data, and do inference on a different set of data. This is done in forecasting all the time where we have a training set and a test set. It is not very common in other fields, probably because data are so precious that we want to use every single observation for model selection and for inference. However, as you note in your question, the downside is that the inference is actually misleading.

There are many situations where a theory-based approach is impossible as there is no well-developed theory. In fact, I think this is much more common than the cases where theory suggests a model.

score 2 · Answer 5 · answered Jul 22 '10 at 12:16

2

If I understand your question right, than the answer to your problem is to correct the p-values accordingly to the number of hypothesis.

For example Holm-Bonferoni corrections, where you sort the hypothesis (= your different models) by their p-value and reject those with a p samller than (desired p-value / index).

More about the topic can be found on Wikipedia

answered Jul 22 '10 at 12:16

Peter Smit

2,030
3
23
36

2

You may want to read this answer to a seperate question and see why adjusting p-values in such a manner may not be the best solution, http://stats.stackexchange.com/questions/3200/is-adjusting-p-values-in-a-multiple-regression-for-multiple-comparisons-a-good-id/3317#3317 – Andy W Oct 19 '10 at 14:02

When can you use data-based criteria to specify a regression model?

5 Answers5

Linked