Variable selection - K Fold cross-validation or Lasso regression?

Question

I have a set of about 40 predictor variables for a set of 20K subjects. The outcome is a binary yes/no response, so I would like to end with a logistic regression model. My thought is to use PROC GLMSELECT to use k fold cross-validation to pare down the list of predictors. I would then use those variables in the logistic regression model without any selection (force all variables into the model). Is this legit?

Sample code for PROC GLMSELECT:

proc glmselect data=traintest seed=111;

partition ROLE=selected(train='1' test='0');

class c1-c30;

model outcomeyn = n1-n10  c1-c30/ selection=stepwise(choose=CV) cvmethod=random(10) ;

I'm also not sure what is the correct selection method to use here. I get very different results between using stepwise vs LASSO or LAR. The stepwise selection results in a simpler model, which I would prefer, but I'd like to know the correct way to choose which method to use.

score 0 · Answer 1 · answered May 07 '18 at 19:56

0

Fist of all, CV is not a feature selection algorithm. Rather, stepwise selection is used, and CV is used to validate its results. In general the methods you mentioned don't have to yield similar results. If you want to try LASSO without the interpretation difficulties, you may run a LASSO model first, and then run a non-regularized Logistic regression only on the features with non-zero coefficients from the LASSO model.

answered May 07 '18 at 19:56

Felipe Gerard

622
3
7

Thank you very much, that is helpful. So will this be the proper code: proc glmselect data=traintest seed=111; partition ROLE=selected(train='1' test='0'); class outcomeyn c1-c30; model outcomeyn = n1-n10 c1-c30/ selection=lasso (choose=CV) cvmethod=random(10) ; And is it okay to run LASSO on a dichotomous variable? – E_Woodhouse May 09 '18 at 15:47
I am unfamiliar with glmselect, but you can run logistic regression with $L_1$ regularization. Just make sure it's logistic and not linear regression, because the name "LASSO" was originally used for linear regression with $L_1$ regularization. – Felipe Gerard May 10 '18 at 16:56

Variable selection - K Fold cross-validation or Lasso regression?

1 Answers1