-1

Suppose I run a bidirectional stepwise in R with the model:

step(glm(y ~ a + b + c + d, poisson))

And the result may be:

y ~ d + c
null deviance: 263.6
100 residual degrees of freedom
residual deviance: 132.9
AIC: 648.3

I read that if you run the line:

1-pchisq(residual deviance, residual df)

and the result is "significant" (below 0.05), you need a better model.

But, if the stepwise() choose the better model using the Akaike criterion, it means that I can't have a better model? what if I don't have any other variable or arrange of variables?

The "best" model chosen by the stepwise it is not necessarily a good model? How can I know this?

Maybe is a very basic question, but I dont get it. Can anyone help me to understand the basics of this?

chl
  • 50,972
  • 18
  • 205
  • 364
Juan
  • 389
  • 3
  • 4
  • 12
  • 7
    Stepwise regression has been discussed here a lot, including one [post from yesterday](http://stats.stackexchange.com/questions/35192/what-terms-should-i-include-in-a-linear-regression-model). See also [search on stepwise](http://stats.stackexchange.com/search?q=stepwise). Stepwise does not necessarily give the "best" model for any definition of "best"; all subsets regression will give it for some definition of "best". But substantive knowledge is always key, and there may not be any one best model. – Peter Flom Aug 28 '12 at 10:21
  • 1
    In addition to what Peter said there are many different criteria that are used in variable selection. Different criteria can lead to different results. – Michael R. Chernick Aug 28 '12 at 11:13

1 Answers1

5

The step function only tries a few possible models within the limits that you give to it. The model that it chooses only means that none of the models it compared to that model improved the AIC (or other criteria), it does not say anything about how it compares to all the infinite number of other models that it did not test.

The suggested lack of fit could mean that there is another variable e that you did not include that will greatly improve the model, or there could be non-linear or interaction effects with the given variables that would improve the model.

Since current computers would take a long time to fit an infinite number of additional models you should explore the data (plots, diagnostics, etc.) and explore the science that produced the data and the question to be answered to figure out what additional models make the most sense and are reasonable, then fit those models to see how they compare.

Also make sure that you understand what question you are trying to answer. Models for prediction of future events have a different "best" than models to help understand the underlying science. Is the question that stepwise regression answers the same question that you are asking? (I have not yet figured out exactly what question stepwise answers, but I have figured out that it is not any of the questions that I am interested in).

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Greg Snow
  • 46,563
  • 2
  • 90
  • 159