What terms should I include in a linear regression model?

Question

Consider the following OLS output. My question is: If you have to define a linear equation from the regression, do you then only include significant variables? If so, the equation according to the SPSS output (DV = stockVol0LN) would be:

stockVol0LN = 0.632 * StockVol1LN + 0.278 * StockVol2LN + 0.152 * hbVol0LN - 
              0.050 * hbVol1LN - 0.069 * hbVol2LN - 0.045 * hbAgreeQuality0LN +
              0.064 * wiki0LN - 0.067 * wiki1LN + 0.050 * svi0 + 1.396

enter image description here

My question is: Is this the best equation, or is there another measure besides significance level to consider when deciding what terms you should include in the model?

This is a question about selection. It is a common topic here & elsewhere. You should *not* only include significant variables, to understand why, you may want to read this: [algorithms-for-automatic-model-selection](http://stats.stackexchange.com/questions/20836/20856#20856). You should also search / read around CV under the tags: [tag:feature-selection], [tag:model-selection], & [tag:stepwise-regression], for starters. If there is still something that you want to know after having read through that, edit your Q here to clarify, otherwise this Q should be closed as a FAQ. — gung - Reinstate Monica, Aug 27 '12 at 20:02
"Best" equation for what purpose? Prediction, estimation, understanding, theory-building? — whuber, Aug 27 '12 at 21:27
OK, then in your researches pay special attention to mentions of holding out data to use for verifying the model as well as more sophisticated versions known as "cross validation." Using these approaches will make you somewhat immune from the dangers of many model selection procedures such as stepwise regression. — whuber, Aug 27 '12 at 22:15

score 4 · Accepted Answer · answered Aug 27 '12 at 20:18

Modeling questions should be based first on the science that underlies the question with computer output used as a helpful tool.

The above information in not enough to come to a final model, and may not be enough for the next step.

It is possible for 2 variables to both be non-significant when adjusted for other terms in the model, but be very important when the other is deleted. Consider if you have as predictors height in inches and height in centimeters (with some rounding so that they are not exactly the same). Models that include both predictors will probably tell you that both are redundant given the other, but remove one and the other may be very important (removing both because of high p-values would be a major mistake).

You can also have variables that work synergistically, each by themselves is not very predictive, but together they are. The diameter of the arteries in the neck (which may be related to anurisms and other blood-brain conditions) is related to the difference between systolic blood pressure and diastolic blood pressure. Either measure on its own may have only a weak relationship, but together it can be much stronger.

Also consider this case, you have 2 predictors X1 and X2, the statistical analysis suggests that you only need one of the 2 in your model and that X1 does a slightly better job of predicting your Y variable (maybe an $R^2=0.81$ vs. $R^2=0.80$), but X2 is something quick, easy, and non-invasive to measure (temperature, or blood pressure) while X1 is the result of a lab test that takes several hours on a biopsy that requires major surgery to obtain (and would not be done most times other than to collect X1); which is the better predictor to use?

Before deciding on a final model you need to spend more time deciding why you are doing the modeling (understanding relationships, prediction, etc.) and the science behind the question and data. You should probably spend more time in a regression class or with a textbook (just ignoring the "nonsignificant" predictors is not the same as fitting the model with only the significant predictors).

score 0 · Answer 2 · answered Aug 27 '12 at 19:58

0

One form of model selection is stepwise regression. It looks at a threshold on the p-value for the F test to decide which variables to add to the model and which ones to delete. But it does this repeatedly because the significance level for a regression coefficient will dependenon what other variables are included in the model. Thresholds higher than the conventional 0.05 are often used (0.10 or 0.20 for example). The p-value for the test that the coefficient is different from 0 is most commonly used but not necessarily the way you went about it. Other measures can also be used such as additional variance explained.

answered Aug 27 '12 at 19:58

Michael R. Chernick

39,640
28
74
143

So aside from the fact that performing a stepwise regression might yield better regressors, the linear equation as I defined it in principle is a right way to do it? – Pr0no Aug 27 '12 at 20:01
1

It is is mathematically provable that all the results from a stepwise regression are incorrect. See, e.g., Harrell *Regression Modeling Strategies* pub. by Wiley. – Peter Flom Aug 27 '12 at 20:05
1

@PeterFlom I am not advocating stepwise regression. Frank Harrell has strong negative opinions about using it. But they are opinions. There is nothing correct or incorrect about doing stepwise regression. All model selection procedures have some difficulties associated with them. My point was just to let the OP know that his way does not take account of the effect of other covariates on the p-value. There is no mathematical proof that "all results from a stepwise regression are incorrect." Probably even Frank Harrell would back me up on that. – Michael R. Chernick Aug 27 '12 at 20:11
2

@PeterFlom Harrell's book was published by Springer. Here is a link to it at amazon.com [Regression Modeling Strategies](http://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/1441929185/ref=sr_1_1?s=books&ie=UTF8&qid=1346100618&sr=1-1&keywords=frank+harrell) – Michael R. Chernick Aug 27 '12 at 20:53
1

@MichaelChernick, while I join you in not being as negative about stepwise regression as Peter and Frank Harrell are, I think Peter is probably referring to the fact that the stepwise selection you've described messes up the usual interpretation of the $p$-values that are left after selection. But, depending on _why_ you're doing the stepwise selection, this may not matter e.g. if your goal is prediction and you're deleting terms based on drop in out-of-sample prediction accuracy. – Macro Aug 27 '12 at 21:18
Oops, my bad on the publisher. It's a good book, though, whoever published it. – Peter Flom Aug 27 '12 at 21:24
Harrell (same cite) on stepwise: ". It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are falsely narrow 4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage 6. It has severe problems in the presence of collinearity 9. It allows us to not think about the problem – Peter Flom Aug 27 '12 at 21:31
Oops, now I can't edit that. [citation for above](http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/) – Peter Flom Aug 27 '12 at 21:38
@PeterFlom The idea of stepwise regression is to pick the variables in the model. Since it involves a sequence of repeated tests all these properties about R square, confidence intervals etc are irrelevant. None of these facts validate your remark that "all the results from a stepwise regression are incorrect." – Michael R. Chernick Aug 27 '12 at 22:51
1

Yeah, they pretty much do. Those are the output you get from a stepwise regression (like any other) and they are all wrong, not irrelevant. Unless once you pick the variables you then throw the model out. – Peter Flom Aug 27 '12 at 22:53
Sepwise regression is a variable selection technique not an estimation procedure. That is why are those bad properties are irrelevant. – Michael R. Chernick Aug 27 '12 at 23:03
1

But, 1) Those properties are used by stepwise to make the selection 2) Once you have a model, you have to do something with it. In addition, given complete random noise, stepwise will select variables for the model. If people want an automated selection model, LASSO or LAR are better. But automated methods should be a last resort. Science should lead, as @GregSnow said, above. – Peter Flom Aug 27 '12 at 23:12
@FrankHarrell is a member here; perhaps he'd be interested in this thread. – Peter Flom Aug 27 '12 at 23:14
@PeterFlom I know Frank and I know he participates here. I am not defending stepwuse regression as a variable selection method, just objecting to your statement. I only mentioned it because of a point I was making about the OPs selection method. – Michael R. Chernick Aug 27 '12 at 23:22
Well, maybe we've just gotten all off-track then. Plus we are taking up too much space here. Enough. I've said what I have to say. – Peter Flom Aug 27 '12 at 23:32
@Macro Thanks for your interesting comment. It is nice to have you agree with me for a change. – Michael R. Chernick Aug 28 '12 at 12:36
It has been known to happen, @Michael! :) – Macro Aug 28 '12 at 12:46

What terms should I include in a linear regression model?

2 Answers2

Linked