I understand that stepwise regression is computationally intensive in general but is it only "suitable" in cases where you can ignore several variables from the model due to statistical insignificance, and is there a threshold for this to hold true?
Asked
Active
Viewed 27 times
0
-
1What do you want to use the model for? For prediction or inference? – Stephan Kolassa Mar 09 '19 at 18:40
-
See: https://stats.stackexchange.com/questions/215154/variable-selection-for-predictive-modeling-really-needed-in-2016 – kjetil b halvorsen Mar 09 '19 at 19:05
-
Many contributors to this site would argue that stepwise regression is [almost never "suitable"](https://stats.stackexchange.com/q/20836/28500). Please look over that page and the page linked in another comment about whether variable selection is needed for predictive models. Editing your question to provide more specifics, based on what you've seen in those linked pages and your answer to the comment about prediction versus inference, is more likely to get you a useful answer. – EdM Mar 09 '19 at 21:00
-
@StephanKolassa why does it matter? What does it change in the case of inference for example? – Digio Mar 09 '19 at 21:39
-
Data based variable selection invalidates *all* p values and inference - your results will appear "more significant" than they are. (Unless you correct for the variable selection, which I have never seen done.) This has been discussed many, many times here. Variable selection is more defensible in prediction. – Stephan Kolassa Mar 09 '19 at 21:43
-
@StephanKolassa this doesn't seem intuitive to me. I know variable selection is frowned upon by many contributors to this community but it is a common practice in applied research of all kinds (for prediction as well as inference). Why would selecting a model on the basis of adding and removing independent variables by looking at AIC or p-values be a bad thing? – Digio Mar 10 '19 at 14:20
-
@Digio: if you do stepwise variable selection by looking at p values, then the p values of the final model do not have the correct interpretation. They will be biased low. Variables who are actually not significant will appear to be significant. I recommend that you run a simulation to convince yourself of this fact, it's not hard. The very same applies to stepwise selection using AIC. [*Stepwise variable selection invalidates **all** p values and inference.*](https://stats.stackexchange.com/q/179941/1352) – Stephan Kolassa Mar 10 '19 at 15:14
-
@stephankolassa, thanks for the link, I guess that makes sense. It's still quite disturbing that there is no proper way of testing whether coefficients are non-zero once any model selection method has been performed. From what I gather, Bonferroni correction is the only way. Any thoughts on this? – Digio Mar 10 '19 at 16:19
-
1@Digio: Bonferroni answers a different question. The problem in stepwise model selection is that the different tests are dependent, therefore so are their p values. To be honest, I am not aware of *any* justified way of correcting for the bias induced by model selection. p values only have their intended meaning in completely prespecified models. – Stephan Kolassa Mar 10 '19 at 17:15
-
@StephanKolassa That is puzzling but thanks for the tips. I suppose the only way to assess significance would be by calculating empirical p-values with something like [LIME](https://arxiv.org/abs/1602.04938). – Digio Mar 10 '19 at 19:21