I have a set of about 40 predictor variables for a set of 20K subjects. The outcome is a binary yes/no response, so I would like to end with a logistic regression model. My thought is to use PROC GLMSELECT to use k fold cross-validation to pare down the list of predictors. I would then use those variables in the logistic regression model without any selection (force all variables into the model). Is this legit?
Sample code for PROC GLMSELECT:
proc glmselect data=traintest seed=111;
partition ROLE=selected(train='1' test='0');
class c1-c30;
model outcomeyn = n1-n10 c1-c30/ selection=stepwise(choose=CV) cvmethod=random(10) ;
I'm also not sure what is the correct selection method to use here. I get very different results between using stepwise vs LASSO or LAR. The stepwise selection results in a simpler model, which I would prefer, but I'd like to know the correct way to choose which method to use.