19

I'm reading about best subset selection in the Elements of statistical learning book. If I have 3 predictors $x_1,x_2,x_3$, I create $2^3=8$ subsets:

  1. Subset with no predictors
  2. subset with predictor $x_1$
  3. subset with predictor $x_2$
  4. subset with predictor $x_3$
  5. subset with predictors $x_1,x_2$
  6. subset with predictors $x_1,x_3$
  7. subset with predictors $x_2,x_3$
  8. subset with predictors $x_1,x_2,x_3$

Then I test all these models on the test data to choose the best one.

Now my question is why is best subset selection not favored in comparison to e.g. lasso?

If I compare the thresholding functions of best subset and lasso, I see that the best subset sets some of the coefficients to zero, like lasso. But, the other coefficient (non-zero ones) will still have the ols values, they will be unbiasd. Whereas, in lasso some of the coefficients will be zero and the others (non-zero ones) will have some bias. The figure below shows it better: enter image description here

From the picture the part of the red line in the best subset case is laying onto the gray one. The other part is laying in the x-axis where some of the coefficients are zero. The gray line defines the unbiased solutions. In lasso, some bias is introduced by $\lambda$. From this figure I see that best subset is better than lasso! What are the disadvantages of using best subset?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
Ville
  • 739
  • 8
  • 16
  • 1
    .. and what do the curves look like when the randomness in the data causes you to select one of the many wrong subsets and the associated coefficient estimates are far from zero relative to their standard errors? – jbowman Jun 09 '18 at 13:57
  • 2
    @jbowman I don't understand it very clearly, why would the randomness in the data cause me to select the wrong one? If I would use cross validation to select the best subset, I would then have smaller chances to select the wrong subset. – Ville Jun 09 '18 at 14:11
  • 1
    You seem to be equating "less bias" with "better". What brings you to place such a high value on unbiasedness? – Matthew Drury Jun 19 '18 at 16:47

2 Answers2

21

In subset selection, the nonzero parameters will only be unbiased if you have chosen a superset of the correct model, i.e., if you have removed only predictors whose true coefficient values are zero. If your selection procedure led you to exclude a predictor with a true nonzero coefficient, all coefficient estimates will be biased. This defeats your argument if you will agree that selection is typically not perfect.

Thus, to make "sure" of an unbiased model estimate, you should err on the side of including more, or even all potentially relevant predictors. That is, you should not select at all.

Why is this a bad idea? Because of the bias-variance tradeoff. Yes, your large model will be unbiased, but it will have a large variance, and the variance will dominate the prediction (or other) error.

Therefore, it is better to accept that parameter estimates will be biased but have lower variance (regularization), rather than hope that our subset selection has only removed true zero parameters so we have an unbiased model with larger variance.

Since you write that you assess both approaches using cross-validation, this mitigates some of the concerns above. One remaining issue for Best Subset remains: it constrains some parameters to be exactly zero and lets the others float freely. So there is a discontinuity in the estimate, which isn't there if we tweak the lasso $\lambda$ beyond a point $\lambda_0$ where a predictor $p$ is included or excluded. Suppose that cross-validation outputs an "optimal" $\lambda$ that is close to $\lambda_0$, so we are essentially unsure whether p should be included or not. In this case, I would argue that it makes more sense to constrain the parameter estimate $\hat{\beta}_p$ via the lasso to a small (absolute) value, rather than either completely exclude it, $\hat{\beta}_p=0$, or let it float freely, $\hat{\beta}_p=\hat{\beta}_p^{\text{OLS}}$, as Best Subset does.

This may be helpful: Why does shrinkage work?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Hmm. I don't think this answers why best subset is worse than lasso (which is the main question here). – amoeba Jun 12 '18 at 08:04
  • @amoeba: would you like to elaborate? – Stephan Kolassa Jun 12 '18 at 20:39
  • Well, I understood the question as asking why lasso is preferred to best subset. Imagine we put both in a cross-validation loop, and then either tune the lasso parameter or find the best subset. The lasso is usually recommended. I understood the question as asking *Why?* (see e.g. the title of the Q) and I am not sure your answer actually answers that. Or did I misunderstand your answer? – amoeba Jun 12 '18 at 21:00
  • @amoeba: I did try to answer that question. The OP argued that Best Subset will give unbiased estimates conditional on selecting a subset that encompasses the true model. My argument is that (a) we can't be sure of that, so we may indeed have bias, so (b) we have an incentive to use rather a larger than a smaller subset, but then (c) the bias-variance tradeoff kicks in, and the larger variance dominates the unbiasedness. Lasso trades an increase in bias against a reduction in variance, which may be better. I believe that this sufficiently answers the question. Am I missing something? – Stephan Kolassa Jun 14 '18 at 17:47
  • (I'll be traveling the next couple of days, so I may be unresponsive.) – Stephan Kolassa Jun 14 '18 at 17:47
  • The procedure suggested by OP is to select the "best subset" on the test data. This should take care of the bias-variance tradeoff; small subsets have high bias and low variance, large subsets have low bias but high variance. But we use the test data to find the sweet spot (exactly like when we are tuning lasso). I don't see how "an incentive to use rather a larger than a smaller subset" is relevant here; the suggestion is to use the subset that will perform best on the validation data. Presumably it will have some optimal size, not too small and not too big. – amoeba Jun 14 '18 at 20:57
  • @amoeba: you have a point. I overlooked the part about the test data. One could start arguing about overfitting on the test data, but otherwise, a holdout or CV type approach is of course better than, say, stepwise approaches to finding the "best" subset. – Stephan Kolassa Jun 15 '18 at 07:22
  • 1
    One remaining issue for Best Subset is that it constrains some parameters to be exactly zero and lets the others float freely, so there is a discontinuity in the estimate, which isn't there if we tweak the lasso $\lambda$ beyond a point $\lambda_0$ where a predictor $p$ is included or excluded. I'd argue that if we are essentially unsure whether $p$ should be included or not, because $\lambda\approx\lambda_0$, then it makes more sense to constrain the parameter estimate $\hat{\beta}_p$ via the lasso, rather than let it float freely. – Stephan Kolassa Jun 15 '18 at 07:23
  • 1
    Agree that this answer doesn't really answer the question - I've added my take on this below... – Tom Wenseleers Jun 19 '18 at 16:41
  • The point is that "best subset" is essentially equivalent to L0 penalty (as @Tom wrote in his answer), and lasso is L1 penalty. So what the question really asks (admittedly in disguise) is why L1 is preferred to L0. You now put forward some argument in your last comment, but your answer itself as it is currently written does not answer the question at all. I suggest you edit it. – amoeba Jun 19 '18 at 18:46
  • Still added a bit more detail in my answer with some extra references to some relevant work... – Tom Wenseleers Jun 19 '18 at 21:43
  • This is getting interesting. Thanks everyone. I'll put this thread in my to do list and address it next week when I'm back from traveling - right now, I don't have the spare bandwidth I should devote to this. – Stephan Kolassa Jun 19 '18 at 21:46
  • @amoeba: thanks once more. I did edit my answer, but I still don't see that "my answer itself as it is currently written does not answer the question at all". We may need to disagree here. – Stephan Kolassa Jun 29 '18 at 09:22
20

In principle, if the best subset can be found, it is indeed better than the LASSO, in terms of (1) selecting the variables that actually contribute to the fit, (2) not selecting the variables that do not contribute to the fit, (3) prediction accuracy and (4) producing essentially unbiased estimates for the selected variables. One recent paper that argued for the superior quality of best subset over LASSO is that by Bertsimas et al (2016) "Best subset selection via a modern optimization lens". Another older one giving a concrete example (on the deconvolution of spike trains) where best subset was better than LASSO or ridge is that by de Rooi & Eilers (2011).

The reason that the LASSO is still preferred in practice is mostly due to it being computationally much easier to calculate. Best subset selection, i.e. using an $L_0$ pseudonorm penalty, is essentially a combinatorial problem, and is NP hard, whereas the LASSO solution is easy to calculate over a regularization path using pathwise coordinate descent. In addition, the LASSO ($L_1$ norm penalized regression) is the tightest convex relaxation of $L_0$ pseudonorm penalized regression / best subset selection (bridge regression, i.e. $L_q$ norm penalized regression with q close to 0 would in principle be closer to best subset selection than LASSO, but this is no longer a convex optimization problem, and so is quite tricky to fit).

To reduce the bias of the LASSO one can use derived multistep approaches, such as the adaptive LASSO (where coefficients are differentially penalized based on a prior estimate from a least squares or ridge regression fit) or relaxed LASSO (a simple solution being to do a least squares fit of the variables selected by the LASSO). In comparison to best subset, LASSO tends to select slightly too many variables though. Best subset selection is better, but harder to fit.

That being said, there are also efficient computational methods now to do best subset selection / $L_0$ penalized regression, e.g. using the adaptive ridge approach described in the paper "An Adaptive Ridge Procedure for L0 Regularization" by Frommlet & Nuel (2016). Note that also under best subset selection you'll still have to use either cross validation or some information criterion (adjusted R2, AIC, BIC, mBIC...) to determine what number of predictors gives you the best prediction performance / explanatory power for the number of variables in your model, which is essential to avoid overfitting. The paper "Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso" by Hastie et al (2017) provides an extensive comparison of best subset, LASSO and some LASSO variants like the relaxed LASSO, and they claim that the relaxed LASSO was the one that produced the highest model prediction accuracy under the widest range of circumstances, i.e. they came to a different conclusion than Bertsimas. But the conclusion about which is best depends a lot on what you consider best (e.g. highest prediction accuracy, or best at picking out relevant variables and not including irrelevant ones; ridge regression e.g. typically selects way too many variables but the prediction accuracy for cases with highly collinear variables can nevertheless be really good).

For a very small problem with 3 variables like you describe it is plain clear best subset selection is the preferred option though.

Tom Wenseleers
  • 2,413
  • 1
  • 21
  • 39
  • 1
    What does "better" mean in the phrase "it is better than lasso"? – Matthew Drury Jun 19 '18 at 16:48
  • 2
    Why is best subset the same as using L0 penalty? Best subset selects the best subset (with the lowest validation error) among subsets of any size; at least that's what OP suggested in their question. L0 penalty requires the subset to be of size $k$ (that is defined by the regularization parameter $\lambda$); one can search for best $k$ using a validation set, and then it's the best subset of size of $k$ across all possible $k$... okay, now I see that it's the same :-) – amoeba Jun 19 '18 at 18:43
  • Edited my answer a bit to give some more detail... – Tom Wenseleers Jun 19 '18 at 21:58
  • 3
    I don't think any of the answers are addressing the problem of stability. Like stepwise and all possible subsets regression, `lasso` is notoriously unstable. In other words if you were to bootstrap the entire process you'll find too much arbitrariness in the list of features selected. – Frank Harrell Jun 29 '18 at 11:16
  • 2
    Yes the variables selected by LASSO can be unstable, and this is even more so the case for best subset regression - elastic net regression is a little bit better in this respect - that tends to include far too many variables then, but selected in a more stable way, and can give better prediction accuracy under high collinearity. But a lot depends on what is the most important criterion for your application - prediction accuracy, the false positive rate of including irrelevant variables or the false negative rate of not including highly relevant variables... – Tom Wenseleers Jun 29 '18 at 15:14