12

I have run a multiple linear regression using stepwise regression to select the best model, however the best model returned has a non-significant variable. When I remove this the AIC value goes up indicating the model without the significant variable is a worse fit. Should I remove the non-significant predictor or should I leave it in as it is a better model?

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
Poppy
  • 121
  • 1
  • 1
  • 3
  • 4
    What is your goal here? Prediction or explanation? How big is your data? – TrynnaDoStat Jan 18 '15 at 15:09
  • 1
    Thanks for your answer below. My goal is prediction and my data set has 590 cases. – Poppy Jan 18 '15 at 16:44
  • 2
    I give a detailed list of the problems with stepwise model building in my answer here: http://stats.stackexchange.com/questions/115843/backward-selection-multivariate-analysis-usind-r/115850#115850. – Alexis Jan 19 '15 at 01:01
  • 1
    Suppressor variables are often not significant, yet they can affect fit-statistics a lot. Might be interesting to check for suppressor effects (especially with a view to model interpretation). – StatisticsRat Mar 15 '17 at 16:35

3 Answers3

10

Leave it in. The data are incapable of really telling you which model is "better" unless you use AIC in a highly structured way (e.g. on a pre-specified large group of variables), and removing insignificant variables invalidates the estimate of $\sigma^2$ and all $P$-values, standard errors, and confidence limits in addition to invalidating the formula for adjusted $R^2$. Much is written about these issues on this site.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
5

NB: A corollary to Frank Harrell's answer is that stepwise variable selection should not be used in the first place. That is, not only is it a mistake to discard that final 'leftover' non-significant covariate, but it was even more wrong to employ an automated procedure (stepwise variable selection) designed to produce a cascade of many such mistakes very quickly in an interdependent and irreproducible fashion.

David C. Norris
  • 2,107
  • 9
  • 25
2

You need to test your model on multiple test datasets. AIC is measure of model fitting not accuracy. Please read section 3.3 ( Subset Selection) in this book -

http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

removing variables is suggested because of 2 reasons-

The first is prediction accuracy: keeping all variables often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy.

The second reason is interpretation. With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the “big picture,” we are willing to sacrifice some of the small details.

Arpit Sisodia
  • 1,029
  • 2
  • 7
  • 23