I have run a multiple linear regression using stepwise regression to select the best model, however the best model returned has a non-significant variable. When I remove this the AIC value goes up indicating the model without the significant variable is a worse fit. Should I remove the non-significant predictor or should I leave it in as it is a better model?

- 74,029
- 5
- 148
- 322

- 121
- 1
- 1
- 3
-
4What is your goal here? Prediction or explanation? How big is your data? – TrynnaDoStat Jan 18 '15 at 15:09
-
1Thanks for your answer below. My goal is prediction and my data set has 590 cases. – Poppy Jan 18 '15 at 16:44
-
2I give a detailed list of the problems with stepwise model building in my answer here: http://stats.stackexchange.com/questions/115843/backward-selection-multivariate-analysis-usind-r/115850#115850. – Alexis Jan 19 '15 at 01:01
-
1Suppressor variables are often not significant, yet they can affect fit-statistics a lot. Might be interesting to check for suppressor effects (especially with a view to model interpretation). – StatisticsRat Mar 15 '17 at 16:35
3 Answers
Leave it in. The data are incapable of really telling you which model is "better" unless you use AIC in a highly structured way (e.g. on a pre-specified large group of variables), and removing insignificant variables invalidates the estimate of $\sigma^2$ and all $P$-values, standard errors, and confidence limits in addition to invalidating the formula for adjusted $R^2$. Much is written about these issues on this site.

- 74,029
- 5
- 148
- 322
NB: A corollary to Frank Harrell's answer is that stepwise variable selection should not be used in the first place. That is, not only is it a mistake to discard that final 'leftover' non-significant covariate, but it was even more wrong to employ an automated procedure (stepwise variable selection) designed to produce a cascade of many such mistakes very quickly in an interdependent and irreproducible fashion.

- 2,107
- 9
- 25
You need to test your model on multiple test datasets. AIC is measure of model fitting not accuracy. Please read section 3.3 ( Subset Selection) in this book -
http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
removing variables is suggested because of 2 reasons-
The first is prediction accuracy: keeping all variables often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy.
The second reason is interpretation. With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the “big picture,” we are willing to sacrifice some of the small details.

- 1,029
- 2
- 7
- 23