Should I remove non-significant variables from my regression model

Question

I have run a multiple linear regression using stepwise regression to select the best model, however the best model returned has a non-significant variable. When I remove this the AIC value goes up indicating the model without the significant variable is a worse fit. Should I remove the non-significant predictor or should I leave it in as it is a better model?

What is your goal here? Prediction or explanation? How big is your data? — TrynnaDoStat, Jan 18 '15 at 15:09
Thanks for your answer below. My goal is prediction and my data set has 590 cases. — Poppy, Jan 18 '15 at 16:44
I give a detailed list of the problems with stepwise model building in my answer here: http://stats.stackexchange.com/questions/115843/backward-selection-multivariate-analysis-usind-r/115850#115850. — Alexis, Jan 19 '15 at 01:01
Suppressor variables are often not significant, yet they can affect fit-statistics a lot. Might be interesting to check for suppressor effects (especially with a view to model interpretation). — StatisticsRat, Mar 15 '17 at 16:35

score 10 · Answer 1 · answered Jan 18 '15 at 15:28

Leave it in. The data are incapable of really telling you which model is "better" unless you use AIC in a highly structured way (e.g. on a pre-specified large group of variables), and removing insignificant variables invalidates the estimate of $\sigma^2$ and all $P$-values, standard errors, and confidence limits in addition to invalidating the formula for adjusted $R^2$. Much is written about these issues on this site.

score 5 · Answer 2 · answered Jan 19 '15 at 00:19

NB: A corollary to Frank Harrell's answer is that stepwise variable selection should not be used in the first place. That is, not only is it a mistake to discard that final 'leftover' non-significant covariate, but it was even more wrong to employ an automated procedure (stepwise variable selection) designed to produce a cascade of many such mistakes very quickly in an interdependent and irreproducible fashion.

score 2 · Answer 3 · answered Mar 15 '17 at 16:28

You need to test your model on multiple test datasets. AIC is measure of model fitting not accuracy. Please read section 3.3 ( Subset Selection) in this book -

http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

removing variables is suggested because of 2 reasons-

The first is prediction accuracy: keeping all variables often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy.

The second reason is interpretation. With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the “big picture,” we are willing to sacrifice some of the small details.

Should I remove non-significant variables from my regression model

3 Answers3

Linked

Related