8

I'm implementing a logistic regression model in R and I have 80 variables to chose from. I need to automatize the process of variable selection of the model so I'm using the step function.

I've no problem using the function or finding the model, but when I look at the final model I find that some of the variables chosen by the step function are not significant (I look at this using the summary function and looking at the fourth column in $coef, this is the Wald Test). This is a problem because I need all the variables included in the model to be significant.

Is there any function or any way to get the best model based on AIC or BIC methods but that also consider that all the coefficients must be significant? Thanks

amoeba
  • 93,463
  • 28
  • 275
  • 317
Dan
  • 375
  • 2
  • 3
  • 7
  • 1
    It seems to me that you have two competing goals here. Goal 1 is to have a model where all variables are significant and Goal 2 is to have the best model based on AIC/BIC. – TrynnaDoStat Feb 02 '15 at 20:43
  • @TrynnaDoStat Thanks for your answer but I don't think that's the problem. When I do a model using SPSS Modeler I use as input all the variables and the output is the best model chosen by stepwise forward method and all the coefficients are significant (at least by Wald Test) – Dan Feb 02 '15 at 20:51
  • 7
    Looking for models where all variables are significant sounds like data dredging. Statistical significance has its original interpretation when you have prespecified model form and variables. Once you do model search/selection, all the statistics can no longer be interpreted as is because they are a result of a model selection procedure. Meanwhile, if the model you got after stepwise selection has some insignificant coefficients, that need not mean it is a bad model. Perhaps the effect sizes are so big that they compensate the lack of stat. significance (this is a rough statement, I know). – Richard Hardy Feb 02 '15 at 20:52
  • I am curious - why do you think you need to select a subset of predictors? also are your predictors numeric (e.g. income) or nominal (e.g. male/female/other) – probabilityislogic Dec 14 '19 at 07:01

1 Answers1

8

Using stepwise selection to find a model is a very bad thing to do. Your hypothesis tests will be invalid, and your out of sample predictive accuracy will be very poor due to overfitting. To understand these points more fully, it may help you to read my answer here: Algorithms for automatic model selection.

The stepAIC function is selecting a model based on the AIC, not whether individual coefficients are above or below some threshold as SPSS does. However, the AIC can be understood as using a specific alpha, just not .05. Instead, it's approximately .157. For more on that, see @Glen_b's answers here: Stepwise regression in R – Critical p-value.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Thanks! I understand perfectly what's your point. But suppose that I need to do some automatic selection of the model, do you have any recommendation? I'm talking about the best of the bads ideas I know, but is important that the selection is automatized. – Dan Feb 02 '15 at 21:07
  • 3
    If you want valid hypothesis tests / p-values, you cannot use automatic selection. There isn't really a way around that. If you want out of sample predictive accuracy, you can use the LASSO & select for lambda by cross validation. – gung - Reinstate Monica Feb 02 '15 at 21:11
  • Thank you very much! But now I have another problem, if it is no to much to ask I will like to ask you a new question. I'm running this code (using glmnet package): `lasso – Dan Feb 03 '15 at 01:04
  • With a little of research i could get a model with coefficients different from 0 typing the following lines: `cv=cv.glmnet(x,y)` `model – Dan Feb 03 '15 at 01:48
  • Possibly. It's hard to say. How much data do you have? It may be that the intercept is the only thing you have enough data to estimate accurately without overfitting. It may also be that the true values of your coefficients is very close to 0. – gung - Reinstate Monica Feb 03 '15 at 02:27
  • I have 864 rows in my database with 88 defualts (1 in the target variable) and the rest non-defaults (0). In addition to this does LASSO handle factor character variables? How can I see the coefficient assigned to every level of the factor? – Dan Feb 03 '15 at 02:37
  • With only 88 defaults you don't actually have much information to work with. That's probably why you can't pick up information about more variables. I'm not sure about using factors w/ the LASSO--it does seem like it should be possible, but I'm not familiar enough. – gung - Reinstate Monica Feb 03 '15 at 03:12
  • Thank you very much! You have been of great help! I won't disturb you any more for a while I hope. – Dan Feb 03 '15 at 03:20