3

I have one dependent variable and many predictors, and I need to use a multiple regression model (linear). Now, I performed a stepwise regression to determine which independent variables to include in the final model. However, I see that when results are published, people usually show more than one model. For instance, they include just the first variable and then the first and the second, then the first, the second and the third, and so on. Showing that the adjusted coefficient of determination, $R^2$, improves.

Do I always have to do that, in order to show that each variable has a contribution (of course if coefficients are significant)? Is there any reference to a formal procedure that better explains this?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Forinstance
  • 755
  • 2
  • 8
  • 29
  • 1
    (0) just opinion here (1) no you don't have to do this (2) I've never seen this. Usually if there a multiple models they are presented for research domain specific reasons (3) depending on your field of interest there may be standard ways of presenting models, worth fitting in with the crowed. (4) if doing backward selection (and EPV<50) you might want consider some for of boostrapping: either to validate modeling procedure or try and show stability of variable selection – charles Feb 13 '14 at 20:40
  • EPV = events per variable BTW (by the way). "Observations per coefficient" would be a better way of putting it outside the context of logistic regression. Why couldn't you include all the independent variables? – Scortchi - Reinstate Monica Feb 13 '14 at 21:42
  • I think the key is the type of model building strategy you choose, e.g. forward inclusion, backward exclusion, and which fit criterion or test is used. See for example: https://onlinecourses.science.psu.edu/stat501/node/91 – tomka Feb 13 '14 at 21:52
  • 1
    @tomka: Read [this](http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856) before using (or advocating) stepwise or similar methods. The first step should be determining whether the full model (presumably one including all available predictors in forms thought relevant to predicting the response) is over-fitting to a degree prejudicial to your model-building goals. – Scortchi - Reinstate Monica Feb 13 '14 at 22:28
  • 1
    In econ, people commonly show many different model specificaitons as ways of showing that coefficients are or aren't robust to different controls. Model selection is rarely the motivation for displaying tons of specifications. Are you trying to estimate a causal parameter or predict? – generic_user Feb 17 '14 at 23:48

1 Answers1

1

If you've used a stepwise method (& see Algorithms for automatic model selection for the drawbacks), you can show the current model at each step (more usual for exposition of the method than because of any perceived intrinsic interest of each intermediate model, I'd have thought). Otherwise there's no point: as @charles says, it's common to compare models suggested by competing theories, or that differ in the expense of using them for prediction, or (in general) for reasons that depend on what the models say about the things they model.

It may be tempting to view the change in the coefficient of determination as you add each predictor as a measure of its importance for or contribution to the model's predictive power; but if the predictors are correlated, as they typically will be for observational data, this can be quite misleading—you get different answers by changing the order in which you add predictors. Jeromy Anglim's blog discusses the issues, & suggests better measures.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248