How to quickly select important variables from a very large dataset?

Question

I have a dataset with about 2,000 binary variables/200,000 rows and I'm trying to predict a single binary dependent variable. My chief goal at this stage isn't getting accuracy of prediction, but rather to identify which of these variables are important predictors. I'd like to drop the number of variables in my final model to about 100.

Is there a relatively quick way of getting the most important variables? randomForest seems to be taking a long time.

I don't have to use all 200,000 observations, so sampling is an option on the table.

By "important" rather than "accurate" predictors, do you mean you want to find those that are best candidates for explaining the outcome and helping you build a theory? If so, 100 explanatory variables is a very large number--prohibitively large. Theory-building and true explanation (rather than mere prediction) would likely require that you reduce the number of X's to 10 or 15. — rolando2, Oct 04 '11 at 15:50
@rolando2 In some applications 100 is not large at all. For instance, the literature on accessibility to high-speed broadband points to approximately 150 variables (out of many hundreds analyzed) and all of them are reasonable: they relate to ease of engineering (terrain, rock type, climate, etc), to economics (SES, etc.), to demographics, to the geometry of the local transportation infrastructure, to politics (tax and business climate), etc. I believe that many economic models similarly can include many hundreds of theoretically important variables. — whuber, Oct 04 '11 at 16:23
@whuber - I'm thinking through your point...Would you agree that it takes an uncommonly dedicated, talented and seasoned analyst to sort out the causal (rather than merely predictive) roles played by so many variables? — rolando2, Oct 05 '11 at 03:06

score 13 · Answer 1 · answered Oct 04 '11 at 19:56

This sounds like a suitable problem for lasso and friends that do shrinkage and variable selection. The Elements of Statistical Learning describes lasso and elastic net for regression and, what is more relevant for this problem, logistic regression.

The authors of the book have made an efficient implementation of lasso and elastic net available as an R package called glmnet. I have previously used this package for binary data analysis with data matrices of approximately 250,000 rows, though somewhat fewer columns, but actually running regressions of all columns against all other columns. If the data matrix is also sparse, the implementation can take advantage of that too, and I believe the method can actually work for the OPs full data set. Here are some comments on lasso:

Lasso achieves variable selection by using a penalty function that is non-smooth (the $\ell_1$-norm), which generally results in parameter estimates with some parameters being exactly equal to 0. How many non-zero parameters that are estimated, and how much the non-zero parameters are shrunken, is determined by a tuning parameter. The efficiency of the implementation in glmnet relies heavily on the fact that for a large penalty only few parameters are different from 0.
The selection of the tuning parameter is often done by cross-validation, but even without the cross-validation step the method may be able to give a good sequence of selected variables indexed by the penalty parameter.
On the downside, for variable selection, is that lasso can be unstable in the selection of variables, in particular, if they are somewhat correlated. The more general elastic net penalty was invented to improve on this instability, but it does not solve the problem completely. Adaptive lasso is another idea to improve on variable selection for lasso.
Stability Selection is a general method suggested by Meinshausen and Bühlmann to achieve greater stability of the selected variables with methods like lasso. It requires a number of fits to subsamples of the data set and is, as such, much more computationally demanding.
A reasonable way of thinking of lasso is as a method for generating a one-dimensional set of "good" models ranging from a single-variable model to a more complicated model (not necessarily including all variables) parametrized by the penalty parameter. In contrast, univariate filters produce an selection, or ordering, of good single-variable models only.

For Python there is an implementation in scikit-learn of methods such as lasso and elastic net.

As an added notion, if the number of potential predictors skyrockets, like in GWAS, you could do something like in this article to preselect: [Genome-wide association analysis by lasso penalized logistic regression](http://bioinformatics.oxfordjournals.org/content/25/6/714.full) — Nick Sabbe, Oct 04 '11 at 21:13
@NickSabbe, thanks for this reference. It is very useful. In the context of GWAS, and probably also in other contexts with a huge number of correlated predictors, I heard [Sylvia Richardson](https://www1.imperial.ac.uk/medicine/people/sylvia.richardson/) recommend Bayesian model selection based on e.g. some comparisons with stability selection. The MCMC computations were really demanding, though. — NRH, Oct 04 '11 at 21:47
I think it is worth stressing the Adaptive lasso more as it easy to implement (almost just two calls instead of one to `glmnet` in R). Another option is Thresholding the Lasso which is also quite simple to implement. See section 2.9 of http://www.springer.com/gp/book/9783642201912. — Benjamin Christoffersen, Jan 16 '18 at 22:47

score 6 · Accepted Answer · answered Oct 04 '11 at 16:46

6

You could start with a simple Univariate filter, and use cross-validation to decide which variables to keep. The sbf function in the caret package for R is really useful. You can read more about it here, starting on page 19.

answered Oct 04 '11 at 16:46

Zach

22,308
18
114
158

Thanks. I just read the paper and it seems as a solid approach. I am however running into some memory problems on my 64bit 4MB Memory system. – DevX Oct 04 '11 at 19:07
1

@DevX: If you have too much data, you could try taking a sample and choosing variables based on the sample? – Zach Oct 05 '11 at 13:34

score 2 · Answer 3 · answered Oct 04 '11 at 16:54

2

You could do a logistic regression/chi-square test of association for each variable and only retain those that have a p-value less than some value, say .2.

answered Oct 04 '11 at 16:54

Glen

6,320
4
37
59

4

To get a sense of how this recommendation might play out, consider a case where there are 100 (say) important predictors (highly correlated with the dependent variable) and the rest are completely unrelated to the dependent variable and to each other. Retaining those with p-values less than 0.2 assures that you will end up with approximately 100 + 0.2*(2000-100) = 480 variables, of which 380 are worthless. For *small* datasets this approach is sometimes used as a quick initial screen, but it really cannot seriously be considered here. – whuber Oct 04 '11 at 17:15
Good point @whuber. You would have to set your alpha level much lower to retain around 100 but then you may miss out on variables that may only take affect adjusting for others. However going from 2000 to 480 could be more manageable in running something like random forests. – Glen Oct 04 '11 at 18:22
You're right, there's merit in such screening--if it works properly. 480 is a reduction, but there are additional problems from the possibility of high correlations among all 2000 original variables. This can cause any or all of the 100 correct variables not to be retained, as illustrated in some of the answers to [a related question](http://stats.stackexchange.com/q/14500/). – whuber Oct 04 '11 at 18:27

How to quickly select important variables from a very large dataset?

3 Answers3

Linked