2

I am finding that trying to do a stepwise logistic regression is far too slow on my data set (6 hours). Is anyone aware of any faster solutions out there? Perhaps one that takes advantage of the multiple processors on my machine?

model <- glm(y ~ .) # 30 or so independent variables
# start timer
step(model, trace=1, direction="both", k=log(179188))
# end timer -- 6 hours
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user1775655
  • 177
  • 1
  • 6
  • 1
    How long does it take to estimate the model in the first place? Have you tried direction = "forward" and "backward", they might be faster, and if they both arrive at the same solution that would work. You might try something like boosted regression instead of logistic, depending on your goals. – Jeremy Miles Dec 23 '14 at 19:48
  • 3
    If you are interested in using the logistic model for prediction (as opposed to interpreting the significance and magnitude of the coefficients) I would suggest at least exploring a flexible method from the data-mining arena such as boosted CART, random forests, or neural networks. All can adapt to non-flat response surfaces algorithmically so you don't need to run down the list of interactions. – bsbk Dec 23 '14 at 20:09
  • 1
    This question appears to be off-topic because it is about how to speed up R code. – gung - Reinstate Monica Dec 23 '14 at 20:59
  • 5
    I would suggest you do not use stepwise selection. If you want to test hypotheses, stepwise selection will invalidate the reported p-values. If you want to build a predictive model, it will yield a model that is overfitted. There is, in essence, no good reason for using stepwise selection. If that doesn't make sense / you want to learn why, you could read my answer here: [Algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). – gung - Reinstate Monica Dec 23 '14 at 21:02
  • 2
    @gung Thank you for the second comment: it shows why we might want to keep this question around, because it can be interpreted as an implicit request for alternatives to stepwise regression. Often enormous improvements in processing speed are achieved by using a better algorithm or procedure compared to throwing more hardware at a problem. – whuber Dec 23 '14 at 21:45
  • @whuber, I turned my comment into an answer. – gung - Reinstate Monica Dec 25 '14 at 03:49

1 Answers1

8

I gather it is the stepwise selection that is slowing you down, so you would speed up your code by skipping the stepwise selection. As it happens, I would suggest you do not use stepwise selection for other reasons as well. If you want to test hypotheses, stepwise selection will invalidate the reported $p$-values. If you want to build a predictive model, it will yield a model that is overfitted. There is, in essence, no good reason for using stepwise selection. If that doesn't make sense / you want to learn why, you could read my answer here: Algorithms for automatic model selection.

If you need to do variable selection with logistic regression for some reason, you can use the LASSO method. Some of the threads listed in this search may be helpful for you.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650