I have a binary dependent variable and 18 independent variables which I want to use as regressors in a logistic regression. Prior to that, I want to reduce the dimensionality of the data set to shield my model against overfitting. Which dimensionality reduction technique do you recommend me given that 8 regressors are categorical, 1 is ordinal scaled, 9 are count variables and 1 is interval scaled? The size of the data set is roughly 50000. I'm working with RStudio.
Asked
Active
Viewed 363 times
1
-
How many levels are your 8 categorical variables? – Ian_Fin Dec 07 '16 at 10:11
-
How many cases in each dependent variable category? How do you intend to use your model after it is built? Penalizing rather than completely excluding predictors might be better. – EdM Dec 07 '16 at 10:18
-
homals ("homogeneity analysis") is a method for dimensionality reduction with mixed variables. See this: http://stats.stackexchange.com/questions/108007/correlations-with-categorical-variables/108028#108028 for some ideas. – kjetil b halvorsen Dec 07 '16 at 11:26
-
@Lan_Fin: 6 / 8 are binary, the other two have 3 levels each. – Joe Dec 08 '16 at 12:49
-
@EdM: the ordinal scaled variable has 5 levels; most of the 9 count variables have roughly 15 levels; the discrete interval scaled variable has roughly 50 levels. The model shall be used to predict the dependent variable. What do you exactly mean by "penalizing"? – Joe Dec 08 '16 at 13:07
-
Of the 50000 cases, how many are in each of the 2 classes? The risk of over-fitting is related to the number of cases in the smaller class, compared against the number of predictors. Also, how you planning on incorporating the count variables into the regression? – EdM Dec 08 '16 at 14:27
-
Penalization refers to methods that restrict regression coefficients to lower absolute values than they would have in a standard regression. This is a major way to minimize over-fitting. Penalization includes LASSO, ridge regression, and their combination in what's called elastic net. See for example [this summary](http://statweb.stanford.edu/~jtaylo/courses/stats203/notes/penalized.pdf). – EdM Dec 08 '16 at 14:37