This sounds like a suitable problem for lasso and friends that do shrinkage and variable selection. The Elements of Statistical Learning describes lasso and elastic net for regression and, what is more relevant for this problem, logistic regression.
The authors of the book have made an efficient implementation of lasso and elastic net available as an R package called glmnet. I have previously used this package for binary data analysis with data matrices of approximately 250,000 rows, though somewhat fewer columns, but actually running regressions of all columns against all other columns. If the data matrix is also sparse, the implementation can take advantage of that too, and I believe the method can actually work for the OPs full data set. Here are some comments on lasso:
- Lasso achieves variable selection by using a penalty function that is non-smooth (the $\ell_1$-norm), which generally results in parameter estimates with some parameters being exactly equal to 0. How many non-zero parameters that are estimated, and how much the non-zero parameters are shrunken, is determined by a tuning parameter. The efficiency of the implementation in glmnet relies heavily on the fact that for a large penalty only few parameters are different from 0.
- The selection of the tuning parameter is often done by cross-validation, but even without the cross-validation step the method may be able to give a good sequence of selected variables indexed by the penalty parameter.
- On the downside, for variable selection, is that lasso can be unstable in the selection of variables, in particular, if they are somewhat correlated. The more general elastic net penalty was invented to improve on this instability, but it does not solve the problem completely. Adaptive lasso is another idea to improve on variable selection for lasso.
- Stability Selection is a general method suggested by Meinshausen and Bühlmann to achieve greater stability of the selected variables with methods like lasso. It requires a number of fits to subsamples of the data set and is, as such, much more computationally demanding.
- A reasonable way of thinking of lasso is as a method for generating a one-dimensional set of "good" models ranging from a single-variable model to a more complicated model (not necessarily including all variables) parametrized by the penalty parameter. In contrast, univariate filters produce an selection, or ordering, of good single-variable models only.
For Python there is an implementation in scikit-learn of methods such as lasso and elastic net.