What are the software limitations in all possible subsets selection in regression?

Question

If I have a dependent variable and $N$ predictor variables and wanted my stats software to examine all the possible models, there would be $2^N$ possible resulting equations.

I am curious to find out what the limitations are with regard to $N$ for major/popular statistic software since as $N$ gets large there is a combinatorial explosion.

I've poked around the various web pages for packages but not been able to find this information. I would suspect a value of 10 - 20 for $N$?

If anyone knows (and has links) I would be grateful for this information.

Aside from R, Minitab, I can think of these packages SAS, SPPS, Stata, Matlab, Excel(?), any other packages I should consider?

@levon9 This question generated a lot of sound answers and comments so that I've +1. But, please, forget about Excel for doing serious job in model selection... — chl, Mar 01 '11 at 22:21
@levon9 - I was able to generate all possible subsets using 50 variables in SAS. I do not believe there is any hard limitation other than memory and CPU speed. -Ralph Winters — Ralph Winters, Mar 01 '11 at 14:56
Thanks very useful information. Just curious, did this this take long? — Levon, Mar 01 '11 at 21:34
@chl .. is that because Excel is slow, or just incapable (ie would give inaccurate results?). — Levon, Mar 01 '11 at 23:56
@levon9, @chl Excel is (in principle) capable of implementing model selection algorithms correctly. It doesn't do so out of the box. Does anyone have a particular add-in in mind? — whuber, Mar 02 '11 at 05:55
@levon9 @whuber My point about Excel was not related to its performance (which I don't know in this particular case) but it was merely to point to "better" software that provide integrated tools for model building, selection, diagnostics (and yes I must admit I'm a little bit biased toward R or Stata for that purpose). — chl, Mar 02 '11 at 08:02

cardinal · Accepted Answer · 2011-03-01T04:09:02.487

12

I suspect 30--60 is about the best you'll get. The standard approach is the leaps-and-bounds algorithm which doesn't require fitting every possible model. In $R$, the leaps package is one implementation.

The documentation for the regsubsets function in the leaps package states that it will handle up to 50 variables without complaining. It can be "forced" to do more than 50 by setting the appropriate boolean flag.

You might do a bit better with some parallelization technique, but the number of total models you can consider will (almost undoubtedly) only scale linearly with the number of CPU cores available to you. So, if 50 variables is the upper limit for a single core, and you have 1000 cores at your disposal, you could bump that to about 60 variables.

edited Mar 01 '11 at 04:09

answered Mar 01 '11 at 03:19

cardinal

24,973
8
94
128

1

**leaps** is great, I like the plots from it, +1. In real applications some averaging techniques work faster (and better) than pretested estimators even found from all regression models. So I would suggest to go for Bayesian model averaging (BMA package) or the algorithm I like the most - Weighted average least squares (WALS[1]) developed by J.R.Magnus et al. The Matlab code is easily transformable to R code. The good thing for WALS is $N$ computational difficulty instead of $2^N$. [1]: http://www.tilburguniversity.edu/research/institutes-and-research-groups/center/staff/magnus/wals/ – Dmitrij Celov Mar 01 '11 at 11:14
1

@Dmitrij, thanks for your comments. I've tried to remain fairly agnostic in my response regarding the utility of all-subsets regression. It seems to me that there is nearly always a better solution, but I felt that might seem like too trite of a response to the OP's question. – cardinal Mar 01 '11 at 12:34
@Dmitrij, BMA over main-effects models would still have the same computational complexity as all-subsets regression. No? The main advantage in BMA seems to me to be in trying to figure out which covariates are likely to be influencing the response. BMA does this by essentially averaging the log likelihoods over $2^{n-1}$ submodels. – cardinal Mar 01 '11 at 12:38
Thanks for the pointer to the leaps R package! I didn't know about it and it might come in handy in the future. If I could get some information on specific limitations for N for other popular packages it would be very helpful. – Levon Mar 01 '11 at 12:44
1

@levon9, I doubt it will vary much at all by package. The algorithm that **leaps** uses has been state of the art for at least 20 years or so. Even if you found an implementation that was *twice* as fast, that would mean you got to increment the number of variables you can handle by one. For every doubling of the speed, you get one more variable. Hardware, not algorithmic, limitations are your bottleneck in this case. – cardinal Mar 01 '11 at 13:00
@cardinal, exactly, BMA has the same computational complexity disadvantage as all-subsets regression (in Eviews it is named as combinatorial approach ^_^). It is why I value WALS more, since it both weights the covariates, is faster and is useful if we do have *focus* parameters (weighted estimator has smaller *pre-test* bias) and parameters that go to auxiliary variables we are not certain about and, yes, it solves the problems @Dikran mentioned in his post. Focus variables are theory based (no room to go spurious or over-fitting), large information set fights pre-test bias problem well. – Dmitrij Celov Mar 01 '11 at 16:02

score 10 · Answer 2 · answered Mar 01 '11 at 13:01

10

Just a caveat, but feature selection is a risky business, and the more features you have, the more degrees of freedom you have with which to optimise the feature selection criterion, and hence the greater the risk of over-fitting the feature selection criterion and in doing so obtain a model with poor generalisation ability. It is possible that with an efficient algorithm and careful coding you can perform all subsets selection with a large number of features, that doesn't mean that it is a good idea to do it, especially if you have relatively few observations. If you do use all subsets selection, it is vital to properly cross-validate the whole model fitting procedure (so that all-subset selection is performed independently in each fold of the cross-validation). In practice, ridge regression with no feature selection often out-performs linear regression with feature selection (that advice is given in Millar's monograph on feature selection).

answered Mar 01 '11 at 13:01

Dikran Marsupial

46,962
5
121
178

2

@Dikran, (+1) good comments. I was trying to avoid going there since it didn't directly address the OP's question. But, I agree. All-subsets is rarely the way to go. And, if you do go that way, you need to understand all the implications. – cardinal Mar 01 '11 at 13:19
@Dirkan thank you for the comments, I am a real stats newbie. I realize the danger of overfitting the model when too many variables are in play, so I am just looking at various automated ways (ie without much benefit of insight) such as the stepwise approach (which might get caught in a local maxima) and the exhaustive all subsets model - and the computational limits it faces (and the external limitations imposed by packages) – Levon Mar 01 '11 at 13:48
3

@levon9, you can get over-fitting that is just as serious when you choose the features, so feature selection does not guard against over-fitting. Consider a logistic regression model used to predict the outcome of flipping a fair coin. The potential inputs are the outcome of flipping a large number of other fair coins. Some of these inputs will be positively correlated with the target, so the best all-subsets model will select inputs (even though they are useless) and you will get a model that appears to have skill, but in reality is no better than guessing. – Dikran Marsupial Mar 01 '11 at 14:39
@Dikran (+1) the same as @cardinal, I first wrote a similar text, but then decided it is not what @levon9 asked, because he simply was curious about the complexity :) – Dmitrij Celov Mar 01 '11 at 16:06
@Dikran +1 because I like such advice. – chl Mar 01 '11 at 22:22
@Dikran thanks for the additional clarifications/comments - and sorry about the typo earlier with your name. – Levon Mar 01 '11 at 23:59

score 4 · Answer 3 · edited Jun 11 '20 at 14:32

4

I was able to generate all possible subsets using 50 variables in SAS. I do not believe there is any hard limitation other than memory and CPU speed.

Edit

I generated the 2 best models for N=1 to 50 variables for 5000 observations.

@levon9 - No, this ran in under 10 seconds. I generated 50 random variables from (0,1)

-Ralph Winters

edited Jun 11 '20 at 14:32

Community

1

answered Mar 01 '11 at 14:56

Ralph Winters

801
5
7

For what size of dataset? – M. Tibbits Mar 01 '11 at 16:17
Thanks very useful information. Just curious, did this this take long? – Levon Mar 01 '11 at 21:34
I have undeleted this post (and merged another of your comments in an edit) because the OP found it useful and others might too. Thank you for your contribution; please keep it up! (If you really think it should be deleted, go ahead and do so; I won't contravene your wishes again.) – whuber Mar 02 '11 at 05:53
It seems you are using two different unregistered accounts. I have merged them but you will still need to register. – chl Mar 02 '11 at 14:08

score 3 · Answer 4 · answered Mar 01 '11 at 10:17

3

As $N$ gets big, your ability to use maths becomes absolutely crucial. "inefficient" mathematics will cost you at the PC. The upper limit depends on what equation you are solving. Avoiding matrix inverse or determinant calculations is a big advantage.

One way to help with increasing the limit is to use theorems for decomposing a large matrix inverse from into smaller matrix inverses. This can often means the difference between feasible and not feasible. But this involves some hard work, and often quite complicated mathematical manipulations! But it is usually worth the time. Do the maths or do the time!

Bayesian methods might be able to give an alternative way to get your result - might be quicker, which means your "upper limit" will increase (if only because it gives you two alternative ways of calculating the same answer - the smaller of two, will always be smaller than one of them!).

If you can calculate a regression coefficient without inverting a matrix, then you will probably save a lot of time. This may be particularly useful in the Bayesian case, because "inside" a normal marginalisation integral, the $X^{T}X$ matrix does not need to be inverted, you just calculate a sum of squares. Further, the determinant matrix will form part of the normalising constant. So "in theory" you could use sampling techniques to numerically evaluate the integral (even though it has an analytic expression) which will be eons faster than trying to evaluate the "combinatorical explosion" of matrix inverses and determinants. (it will still be a "combinatorical explosion" of numerical integrations, but this may be quicker).

This suggestion above is a bit of a "thought bubble" of mine. I want to actually test it out, see if it's any good. I think it would be (5,000 simulations + calculate exp(sum of squares) + calculate weighted average beta should be faster than matrix inversion for a big enough matrix.)

The cost is approximate rather than exact estimates. There is nothing to stop you from using the same set of pseudo random numbers to numerically evaluate the integral, which will again, save you a great deal of time.

There is also nothing stopping you from using a combination of either technique. Use exact when the matrices are small, use simulation when they are big. This is because in this part of the analysis. It is just different numerical techniques - just pick the technique which is quickest!

Of course this is all just a bit of "hand wavy" arguments, I don't exactly know the best software packages to use - and worse, trying to figure out which algorithms they actually use.

answered Mar 01 '11 at 10:17

probabilityislogic

22,555
4
76
97

2

@probabilityislogic, while your response is interesting, maybe it could be refocused to better address the OP's question. Also, ***no*** software for computing least-squares solutions performs matrix inversion, much less a determinant. Ever. Unless it's inverting a $1 \times 1$ matrix. – cardinal Mar 01 '11 at 12:28
1

@probabilityislogic, handling the $2^n$ cases efficiently quickly far outstrips the $O(n^3)$ issues of an efficient least-squares solution. That's where the *leaps-and-bounds* algorithm comes in. – cardinal Mar 01 '11 at 12:30
Thanks for the post. "Do the maths or do the time!" :-) .. I'm actually not even trying to figure out the underlying algorithms used by the packages (thought that is interesting to know), at this point really looking for specific information regarding the limitations of N by the major packages. – Levon Mar 01 '11 at 12:47
@cardinal The updating and downdating algorithms also exist for various matrix decomposition procedures, I suspect that is what was meant by "matrix inverse" etc. – Dikran Marsupial Mar 01 '11 at 13:04
@Dikran, several efficient and numerically stable approaches to least-squares exist, including methods for augmenting or reducing a design matrix by one column at a time. Sometimes it's good to understand what is happening under the surface, even if on most days you don't need to care. – cardinal Mar 01 '11 at 13:29
@cardinal - I'm curious about your comment about "least squares" not ever performing matrix inversion. The main equation for the estimates is $\beta=(X^{T}X)^{-1}X^{T}Y$. Further the variance of these estimates is given by $\sigma^{2}(X^{T}X)^{-1}$. The matrix inverse is fundamental to the typical least squares regression, at least in the mathematics. Although I my be showing my ignorance in the actual computational procedures followed. – probabilityislogic Mar 05 '11 at 05:16
@probabilityislogic, one common approach is to use (some variant of) a $QR$ decomposition. So, we write $X = Q R$, where $Q$ is a matrix with orthogonal columns and $R$ is a square triangular matrix. It's an easy exercise to see that the residuals can be written as $\hat{y} = Q Q^T y$ and the parameter estimates are the solution so the triangular system of equations $R \hat{\beta} = Q^T y$. Triangular systems are very efficient to solve. The $Q R$ decomposition using Householder reflections or Givens rotations is very numerically stable. No matrix inversion necessary. – cardinal Mar 05 '11 at 15:11
@cardinal - thanks for that. So I suppose my "thought bubble" reduces to comparing the QR decomposition speed with numerical integration – probabilityislogic Mar 06 '11 at 01:21
@probabilityislogic, $QR$ decomposition should never be worse than $O(n p^2)$, and that provides an exact (to within numerical precision) answer. To compare that against Monte Carlo integration would require you specifying at least a couple notions of desired precision. – cardinal Mar 06 '11 at 02:23
@cardinal - I would suggest that the MC method has the ability to be "scaled-up" (more simulations) or "scaled-down" (less) according to the precision required. With the QR method, although exact, you are stuck to a degree with the same computing time. For something like all subsets regression, exactness of the answer may not be the number one priority. Once again, if you have two methods - one of them will be quicker. Expanding on my thought bubble would consist of what conditions are required for one method to be quicker than the other - and at what cost. – probabilityislogic Mar 06 '11 at 09:30
@probabilityislogic, the issue is still that the $O(2^n)$ work for all-subsets is much larger than the $O(n p^2)$ work of a $QR$ decomposition. In the $QR$ case, once the initial full-regression estimates are obtained, conceivably, only simple one-rank updates can be made to find the parameter estimates for smaller models. These rank-one updates are fast. But, for all-subsets, *every* combination of such updates would have to be made. Your method would require re-estimating the integral every time $X$ changes and is not exact. I'd hazard a guess that that's far less efficient. – cardinal Mar 06 '11 at 15:06
@cardinal - one would not be required to "re-estimate" the numerical integral. You simply ignore the posterior samples for those parts of the model which are being excluded. You only need 1 simulation from the full model, and no rank 1 updates are required - this is where I think you will save a lot of time. One such question could be "is $\beta_j$ relevant to my model, *regardless of what other parameters are in the model?* ". Very quick to decide this - just look at the marginal simulated distribution for $\beta_j$. – probabilityislogic Mar 08 '11 at 06:51
@cardinal - adding to my above point, suppose you have a "rejection region" e.g. $\frac{|\beta_j|}{SE(\beta_j)} \leq 1$ which you are prepared to declare that the coefficient is "zero" and remove it from the model. Then the $2^n$ all subsets regression on the simulated dataset reduces to the problem of an n-way contingency table, each way has 2 outcomes - in the region or not. The "best models" have the highest probability in this table – probabilityislogic Mar 08 '11 at 08:06
@probabilityislogic, your comments really don't have anything to do with all-subsets regression and I wouldn't try to force them into that framework. They appear to have more to do with *model selection* in regression. There are a multitude of such methods, both classical and modern, including threshold approaches like the one you describe. The lasso is one example and it even has a Bayesian interpretation. Usually you need a condition close to orthogonality of the design matrix to guarantee good performance (even asymptotically!). – cardinal Mar 09 '11 at 16:04

What are the software limitations in all possible subsets selection in regression?

4 Answers4

Edit