Best way to use many explanatory dummy variables to determine highest proportion of successes

Question

Let $y$ be a binary dependant variable. I am using a logit model to explain the relationship between $y$ and many binary explanatory variables. I am mainly interested in finding which group (or intersection of groups) gives the highest probability of success. Now, this kind of analysis is fairly simple. However, the main difficulty I am having is that I have well into the millions of different possible categories (when you consider all possible interaction effects).

Do you have any advice on how to approach this problem?

Again, my sticking point here is that I have way too many explanatory dummy variables.

An idea: Run a regression tree and pick the leaf with the highest success probability. Relevant interactions are automatically chosen by the separation method. — Michael M, Dec 03 '13 at 08:59
Could you please elaborate on that point? Or point me to a concise, well-written explanation? PS: I have not studied data mining before. — Christian, Dec 05 '13 at 00:19
The basic idea of classification and regression trees (CART) is as simple as of linear regression. Maybe the wiki article helps you to find good literature about this. http://en.m.wikipedia.org/wiki/Classification_and_regression_tree — Michael M, Dec 05 '13 at 17:35

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

The only implementation of variable selection for logistic regression in SAS, as far as I know, is stepwise in PROC LOGISTIC. Stepwise is generally not recommended, for reasons espoused by gung here.

"Modern" methods of variable selection fall to estimators with "built-in" variable selection via shrinkage penalties, i.e. ridge and lasso, or their generalization in the elastic net - plus cross validation to select how heavy a penalty. Bayesian models can also perform variable selection (I've read that the lasso/ridge/elastic penalties also have a bayesian interpretation).

SAS implements lasso in PROC GLMSELECT, but just for linear regression. I suppose you can code your response as 0/1, forgo the logistic link, plug that in, and then take the selected parameters back to PROC LOGISTIC to re-estimate, but I have no idea if that is anywhere near a good idea.

Another issue with SAS's lasso implementation is that it does not appear to be as computationally efficient as the glmnet implementation in R (especially when used along with sparseMatrix). With large numbers of predictors, it may not finish in any reasonable amount of time.

My experience:
300k observations/60 predictors took ~15s in R, several minutes in SAS
300k/800 using sparse matrixes took ~45s in R, and I stopped SAS 12 hours in (this is what happens when you max out your RAM)

score 4 · Answer 2 · answered Dec 24 '13 at 12:20

Recursive partitioning (CART) will require an enormous sample size in order to achieve stability and have good discrimination ability. It will not handle continuous variables appropriately, and its apparent interactions are often spurious. I suggest fitting a pre-specified logistic regression model with pre-specified sensible interactions, and using a quadratic penalty if the sample size does not support the pre-specified degrees of freedom. Then the search for a set of $X$ yielding maximum $\hat{P}$ is an algebraic problem except that this is subject to a large multiple comparison problem that will result in regression to the mean (overprediction of high $P$). You can bootstrap the entire process to (1) check the stability of the solution for $X$ and (2) find a multiplicity-penalized confidence interval for $P_{\textrm max}$.

Frank, can you kindly visit the dilemma [if we ought to create tag "dummy-variables"](https://stats.meta.stackexchange.com/a/4881/3277) and leave there a comment either con or pro? — ttnphns, Jul 28 '17 at 09:45

score 1 · Answer 3 · answered Dec 24 '13 at 09:49

One exploratory idea would be to estimate a latent class model (mixture model) where the X and Y are all included as conditionally independent observed variables - then constrain 1 (or more) classes i so that P(Y|Class_i) = 1. I don't expect this would do what you are asking, but it might provide some insights into the "types" of patterns that are most strongly related to Y.

Best way to use many explanatory dummy variables to determine highest proportion of successes

3 Answers3