1

(Note: This question helps to inform the current one)

I would like to identify variables that are significant at the 95% level in a logistic regression but have very little to no impact on the response. I've read the CV questions on interpreting regression output. And have also read the Stanford and UCLA links on interpretation. I used the combination of knowledge I've gained to create a table to determine which predictors are either not significant or have little to no effect on the response. But I am not sure I am coming to the correct conclusions, especially how confidence intervals play a role in log odds ratios:

library(broom)  #for tidy model output
mdl1 <- glm(am ~ mpg + disp, mtcars, family=binomial)
out <- tidy(mdl1)
out[-1] <- round(out[-1], 4)
out$significant <- out$p.value < 0.05
cbind(out[-(1:2)], round(exp(cbind(OR = coef(mdl1), confint(mdl1))), 4))
# Waiting for profiling to be done...
#             std.error statistic p.value significant     OR  2.5 %    97.5 %
# (Intercept)    4.7601   -0.4741  0.6354       FALSE 0.1047 0.0000 1192.7499
# mpg            0.1684    1.0095  0.3127       FALSE 1.1853 0.8727    1.7252
# disp           0.0078   -0.9749  0.3296       FALSE 0.9924 0.9749    1.0064

This appears to be a good start. I know the log odds, p values, and confidence intervals for each variable. This case would be easy since none of the predictors are significant. But let's ignore that for the moment. If I they were significant and I wanted to see how the confidence intervals can help determine the effect of the predictors, can I use confidence intervals that include 1.000?

I ask because this Brandon Foltz tutorial says to remove such variables (around the five minute mark). So these variables would be removed because they satisfy the condition that with 95% confidence the true coefficient includes 1.00 which would indicate a non-effect on the response.

Is this two-step process a good way of using logistic regression output to understand the effect of the predictors?

Pierre L
  • 778
  • 2
  • 8
  • 21
  • 1
    Is there any particular reason why you want to remove non-significant variables? – Michael M Sep 20 '16 at 18:12
  • The ultimate goal of the process is not to predict the outcome but rather to see what variables stand out in terms of effect. – Pierre L Sep 20 '16 at 18:13

1 Answers1

3

If your goal is to find the set of variables that most associate with outcome, you are in the world of Feature Selection. For that purpose you can do stepwise regression (function "step" in R) - which essentially is the automated process of doing what you say in your text.

More popular with statisticians is to use shrinkage methods - in particular the L1 norm penalty (LASSO regression).

Both these methods will result in removing predictors that have no effect on your outcome. More accurately they will result in removing predictors that have no extra effect - extra to the remaining predictors - on your outcome.

Both stepwise and LASSO regression methods are available in R.

Filipe
  • 319
  • 1
  • 4
  • Thank you for this. I am considering Lasso/Ridge. I have to explain to other stakeholders **why** logistic regression is no good when they believe it is a good route. – Pierre L Sep 20 '16 at 18:44
  • Can you explain here why you believe it to be bad? In many problems it's perfectly fine, and a GLM can accommodate LASSO. (Ridge will not do the feature selection you want) – Filipe Sep 20 '16 at 18:52
  • You mentioned two alternatives to my approach. I'm assuming you didn't address the direct question because you felt others were better. – Pierre L Sep 20 '16 at 18:53
  • If you feel that glm can work, can you help in determining the correct interpretation? – Pierre L Sep 20 '16 at 18:54
  • I haven't use this package but give this a try: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf – Filipe Sep 20 '16 at 18:59
  • I'm not sure we are understanding each other. When I say help with 'glm' I'm referring to the out-of-box logistic regression model, not the Lasso/Ridge alternative. The bosses are asking to use a regular logistic regression model for this project. 1) Do you feel that it is a good choice? 2) If yes, how do I correctly interpret the confidence intervals? – Pierre L Sep 20 '16 at 19:02
  • I am acknowledging that you have provided alternatives to a regular logistic regression and I will consider those. But my direct question is regarding the feasibility of the regular logistic regression. – Pierre L Sep 20 '16 at 19:04
  • OK got it. So the process that you suggest in the post of doing feature selection based on confidence intervals (i.e. p value for they not including 1 as the odds ratio), is feasible. It's stepwise feature selection. It's often used. The statistical test tends to be, not on the confidence interval of the individual coefficients for the features, but comparing residual distributions of the model with N features, to the model with N-1 features. If the variance in the residuals of the two models is not different to some user decided threshold of probability, then the feature is kept. – Filipe Sep 20 '16 at 19:14
  • Thank you. I've added a link to the top of the question with more background. – Pierre L Sep 20 '16 at 19:19
  • 2
    @PierreLafortune, whether or not the p-value is significant, the 95% CI on the coefficient covers 0, or the CI for the OR covers 1, are all the same thing. They are a manual version of stepwise selection, as Filipe notes here. Using these methods is *not* recommended. To understand why, it may help to read my answer here: [Algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). Filipe is correct that LASSO selection is more appropriate for your situation. – gung - Reinstate Monica Sep 20 '16 at 19:51
  • @gung I've read your answer to that question. Filipe says that many statisticians use logistic regression. Is the regression to the mean the explanation that I can give to explain why we should not use it? – Pierre L Sep 20 '16 at 19:59
  • @PierreLafortune, it is known to be invalid (but, yes, many people use it anyway). If you need a reference, you can cite Frank Harrell's book *Regression Modeling Strategies*. If you need a simple argument, you could use my regression to the mean story. – gung - Reinstate Monica Sep 20 '16 at 20:04
  • @gung in your story, I'm trying to understand the visualization of the second race. Is the red 'x' someone who was in the yellow group the first race and fell into the red group for the second? – Pierre L Sep 20 '16 at 20:09
  • @PierreLafortune, red 'x's are people in the bottom third by race time. In the second race, if there is a faint background symbol, that was their grouping in the 1st race. The point is that you are selecting based on both their true value & random chance. The ones that look best tend to be inflated & the ones that look worst tend to be deflated. (Comments aren't the place to discuss this, BTW.) – gung - Reinstate Monica Sep 20 '16 at 20:42