1

I have a data set with 50 predictors of categorical and numerical variables and 1 dichotomous outcome. I'd like to perform logistic regression, model it and k-fold cross validate it.

However, I have stumbled upon deciding which predictors to include in my model. I have started with the initial hypothesis making, where I try to find some reasonable physical entity. However, my model doesn't produce any good AUC (0.74).

Then I tried stepwise (backward and backward/forward) regression combining both AIC and BIC to let the computer guess which variables better for the outcome. I still can't achieve a better AUC score of 0.75.

Therefore, I would like to enquire if there is gold standard method in such occasion to help me get a grasp of which predictors are best in order to optimize my predictive power of the model.

I use R for my modeling.

RandomEli
  • 113
  • 5
Kostas
  • 13
  • 3
  • 5
    You need to read the **extensive** discussions of this topic on this site. You started with a false premise. – Frank Harrell Dec 10 '16 at 16:07
  • @ Harrell As someone who follows this site regularly, I've y read various discussions about variable importance etc. I think it would be useful to OP if you pointed him to your favorite. He could then use "Related" to do more research. – meh Dec 10 '16 at 16:17
  • 4
    http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection#20856 http://stats.stackexchange.com/questions/24752/52-variables-after-backward-variable-selection-on-logistic-regression-on-160-var http://stats.stackexchange.com/questions/215154/variable-selection-for-predictive-modeling-really-needed-in-2016 http://stats.stackexchange.com/search?q=logistic+variable+selection+model-selection+harrell – kjetil b halvorsen Dec 10 '16 at 16:32
  • 2
    What do you mean when you say that an AUC of 0.74 isn't any good? What are you comparing that to? – Matthew Drury Dec 10 '16 at 16:37
  • Can you clarify what you meant by 'try to find some reasonable physical entity'? Did you have a scientific hypothesis about the variables? – mdewey Dec 10 '16 at 16:55
  • 2
    @FrankHarrell You say the OP starts with a wrong premise. Maybe to be clear you can tell the OP what the false premise is. – Michael R. Chernick Dec 10 '16 at 19:33
  • I've never used AUC to judge whether a logistic regression model is any good. Then again, I'm not using logistic regression models for purposes of classification. My point: is this even a classification problem? – The Laconic Apr 07 '17 at 03:03

1 Answers1

4

Not sure about gold standard, but have you looked at regularizarion methods such as LASSO? They are used when one is trying to fit a regression with a large number of predictors - LASSO in particular can double as a variable selection tool. The R packages gamlr and glmnet both should allow you to easily run a cross-validated LASSO with logistic regression.

RA334
  • 525
  • 3
  • 4
  • 1
    Yes I would agree with RA334.I think LASSO is the gold standard. glmnet can be used to fit ridge regression. See the following references for more detailed info: – Alejandro Ochoa Jan 13 '17 at 16:22
  • 1
    1) Regularization and variable selection via the elastic net Zou, Hui, and Trevor Hastie. "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005): 301-320. 2) The Elements of Statistical Learning (Hastie, Tibshirani, Friedman) Link: http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2005.00503.x/full – Alejandro Ochoa Jan 13 '17 at 17:27