I need to perform stepwise binary logistic regression (The horror! The horror!) on 1.5 million observations. This takes far too long in SAS, so I'm wondering if I can use R to process it in a multicore environment. Apparently package gmulti (http://www.jstatsoft.org/v34/i12/paper) will do the trick, but it's not clear to me if it will do that outside of its genetic algorithm. That still might work for me, but I don't have a large number of variables (about 30) so it's not necessary. As long as the results of the brute force and ga approach could be assured to be similar, then I might try it. However, I see others have had problems getting the parallel feature to run: https://stat.ethz.ch/pipermail/r-help/2013-April/351820.html. Any other suggestions on how to parallelize logistic regression in R? A web search turned up a couple of papers, but not much that seemed specific to R. And please spare me a lecture about stepwise regression-I'm very well aware of the pitfalls. I'm replicating someone else's analysis. I'm using a Windows 64 bit system.
Asked
Active
Viewed 4,895 times
1 Answers
5
In data analysis you typically don't want to reinvent the wheel
There are packages to do this in R namely: biglm
sorry that is for linear regression.
GLM with large data sets can be fit with speedglm
install.packages('speedglm')
library(speedglm)
set.seed(123)
trt <- c(rep(1,500000),rep(0,500000))
x <- matrix( rnorm(1000000*29), ncol=29)
beta <- c(10,rep(1,29))
y <- exp(cbind(trt,x) %*% beta)/(1+exp(cbind(trt,x) %*% beta))>0.5
data <- data.frame(y=y,trt=trt,x=x)
m <- speedglm(y~trt+x, data,family=binomial())

bdeonovic
- 8,507
- 1
- 24
- 49
-
@Benjamin-very interesting, but I don't think it will do stepwise regression. – William Shakespeare Apr 26 '14 at 05:02
-
1`biglm` can do generalized linear models too. Using either package you can code a stepwise routine. – Scortchi - Reinstate Monica Jul 12 '14 at 10:45
-
why these two packages are fast? it uses advanced optimization algorithm? – Haitao Du Jul 20 '16 at 19:46
-
1As far as I can tell, speedglm does not actually use multiple cores – Michael Ohlrogge Nov 24 '16 at 21:13