10

I've got a linear regression model with the sample and variable observations and I want to know:

  1. Whether a specific variable is significant enough to remain included in the model.
  2. Whether another variable (with observations) ought to be included in the model.

Which statistics can help me out? How can get them most efficiently?

csgillespie
  • 11,849
  • 9
  • 56
  • 85
Wilhelm
  • 730
  • 1
  • 6
  • 10

5 Answers5

27

Statistical significance is not usually a good basis for determining whether a variable should be included in a model. Statistical tests were designed to test hypotheses, not select variables. I know a lot of textbooks discuss variable selection using statistical tests, but this is generally a bad approach. See Harrell's book Regression Modelling Strategies for some of the reasons why. These days, variable selection based on the AIC (or something similar) is usually preferred.

Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
  • 1
    Actually, to the best of my memory, Harrell strongly discourages the use of AIC. I guess cross-validation would probably be the safest method around. – Tal Galili Aug 01 '10 at 01:54
  • 2
    AIC is asymptotically equivalent to CV. See answers to http://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other. I checked Harrell before I wrote that answer, and I didn't see any discouragement of the AIC. He does warn about significance testing after variable selection, with the AIC or any other method. – Rob Hyndman Aug 01 '10 at 02:54
  • @Tal: Perhaps from one of his papers rather than the RMS book, I remember Harrell objecting to the use of AIC for simply choosing among a pool of *many* models. I think his point was that you must add a variable at a time and compare two models methodically or use some similar strategy. (To be clear, this is in line with Rob's answer.) – ars Aug 01 '10 at 05:14
  • 1
    Doing a quick search, I found Harrell writing the following "Beware of doing model selection on the basis of P-values, R-square, partial R-square, AIC, BIC, regression coefficients, or Mallows' Cp." He wrote that on 12/14/08, on a mailing list titled [R] Obtaining p-values for coefficients from LRM function (package Design) - plaintext. I guess I misunderstood his meaning. – Tal Galili Aug 02 '10 at 16:20
  • @Tal. Thanks for finding that. While it sounds like a general statement, I *think* he probably meant "Beware of looking at p-values after doing model selection on the basis of ...." I don't think it makes sense otherwise. Certainly the discussion in that thread was all about p-values – Rob Hyndman Aug 02 '10 at 23:12
  • 3
    @Tal, @Rob: In that thread, he does say "Be sure to use the hierarchy principle". Perhaps of interest, this discussion from medstats (scroll down for Harrell's response): http://groups.google.com/group/medstats/browse_thread/thread/86c44163b849572 – ars Aug 03 '10 at 02:38
4

I second Rob's comment. An increasingly prefered alternative is to include all your variables and shrink them towards 0. See Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

http://www-stat.stanford.edu/~tibs/lasso/lasso.pdf

user603
  • 21,225
  • 3
  • 71
  • 135
  • 1
    Is there some way to quantify what is " increasingly prefered " these days ? – Tal Galili Aug 01 '10 at 01:55
  • I think that it is recognized to be scientifically more correct in many field in the sense that the shrinkage approach is used more in recent applied stat papers than the *.IC approach. That shows a certain -at least tacit- theoretical consensus. – user603 Aug 01 '10 at 12:12
  • 1
    @user603 - you also have the potentially massive computational advantage with the shrinkage approach. No need to search over $2^p$ models – probabilityislogic Jul 02 '11 at 07:12
3

For part 1, you're looking for the F-test. Calculate your residual sum of squares from each model fit and calculate an F-statistic, which you can use to find p-values from either an F-distribution or some other null distribution that you generate yourself.

Eric Suh
  • 424
  • 2
  • 3
1

Another vote for Rob's answer.

There are also some interesting ideas in the "relative importance" literature. This work develops methods that seek to determine how much importance is associated with each of a number of candidate predictors. There are Bayesian and Frequentist methods. Check the "relaimpo" package in R for citations and code.

Andrew Robinson
  • 739
  • 5
  • 9
1

I also like Rob's answer. And, if you happen to use SAS rather than R, you can use PROC GLMSELECT for models that would be done with PROC GLM, although it works well for some other models, as well. See

Flom and Cassell "Stopping Stepwise: Why Stepwise Selection Methods are Bad and What you Should Use" presented at various groups, most recently, NESUG 2009

Peter Flom
  • 94,055
  • 35
  • 143
  • 276