The only implementation of variable selection for logistic regression in SAS, as far as I know, is stepwise in PROC LOGISTIC
. Stepwise is generally not recommended, for reasons espoused by gung here.
"Modern" methods of variable selection fall to estimators with "built-in" variable selection via shrinkage penalties, i.e. ridge and lasso, or their generalization in the elastic net - plus cross validation to select how heavy a penalty. Bayesian models can also perform variable selection (I've read that the lasso/ridge/elastic penalties also have a bayesian interpretation).
SAS implements lasso in PROC GLMSELECT
, but just for linear regression. I suppose you can code your response as 0/1, forgo the logistic link, plug that in, and then take the selected parameters back to PROC LOGISTIC
to re-estimate, but I have no idea if that is anywhere near a good idea.
Another issue with SAS's lasso implementation is that it does not appear to be as computationally efficient as the glmnet
implementation in R (especially when used along with sparseMatrix
). With large numbers of predictors, it may not finish in any reasonable amount of time.
My experience:
300k observations/60 predictors took ~15s in R, several minutes in SAS
300k/800 using sparse matrixes took ~45s in R, and I stopped SAS 12 hours in (this is what happens when you max out your RAM)