31

For Lasso regression $$L(\beta)=(X\beta-y)'(X\beta-y)+\lambda\|\beta\|_1,$$ suppose the best solution (minimum testing error for example) selects $k$ features, so that $\hat{\beta}^{lasso}=\left(\hat{\beta}_1^{lasso},\hat{\beta}_2^{lasso},...,\hat{\beta}_k^{lasso},0,...0\right)$.

We know that $\left(\hat{\beta}_1^{lasso},\hat{\beta}_2^{lasso},...,\hat{\beta}_k^{lasso}\right)$ is a biased estimate of $\left(\beta_1,\beta_2,...,\beta_k\right)$, so why do we still take $\hat{\beta}^{lasso}$ as the final solution, instead of the more 'reasonable' $\hat{\beta}^{new}=\left(\hat{\beta}_{1:k}^{new},0,...,0\right)$, where $\hat{\beta}_{1:k}^{new}$ is the LS estimate from partial model $L^{new}(\beta_{1:k})=(X_{1:k}\beta-y)'(X_{1:k}\beta-y)$. ($X_{1:k}$ denotes the columns of $X$ corresponding to the $k$ selected features).

In brief, why do we use Lasso both for feature selection and for parameter estimation, instead of only for variable selection (and leaving the estimation on the selected features to OLS)?

(Also, what does it mean that 'Lasso can select at most $n$ features'? $n$ is the sample size.)

amoeba
  • 93,463
  • 28
  • 275
  • 317
yliueagle
  • 755
  • 2
  • 6
  • 10
  • 1
    That is a very good question. Have you tried a few simulations to see how different the results would be from standard Lasso if you one tried it your way? – Placidia Jan 16 '14 at 14:38
  • 3
    Did you understand the purpose of "Shrinkage" in LASSO? – Michael M Jan 16 '14 at 15:02
  • @MichaelMayer The purpose of 'Shrinkage' in Lasso is to decrease feature size for both better interpretation and prediction, I think – yliueagle Jan 16 '14 at 15:06
  • 6
    The idea's to shrink the coefficient estimates precisely because you've picked the biggest ones. Least-squares estimates are no longer unbiased when you've done feature selection beforehand. – Scortchi - Reinstate Monica Jan 16 '14 at 15:07
  • 1
    You should look at the Adaptive Lasso. I guess you will be pleased to see that one. Uses weights based on ols estimates to reduce bias. – Scratch Jan 16 '14 at 16:16
  • 2
    See the following question for a great answer to "What problem do shrinkage methods solve?" http://stats.stackexchange.com/questions/20295/what-problem-do-shrinkage-methods-solve – D L Dahly Jan 16 '14 at 16:26
  • Here's [an example](http://www-personal.umich.edu/~maxhf/HighDim_ATE_GroupLasso.pdf) of this type of approach. – dimitriy Jan 16 '14 at 21:18
  • @Scortchi: Do you have a reference for this claim? My understanding based on the usual proof of unbiasedness is that this is false - adding noise to the true model does not affect unbiasedness. – JohnA May 12 '15 at 23:34
  • 1
    @JohnAustin: Unless effects are either signal or noise (a dichotomy that's not plausible in many applications), *and* your feature selection technique reliably picks only those that are signal; then OLS on the selected features produces parameter estimates biased toward high absolute values, resulting typically in a coefficient of determination biased toward high values. Chatfield (1995), "Model uncertainty, data mining, & statistical inference", *JRSS*, **158**, 3, gives a good review of the issues. ... – Scortchi - Reinstate Monica May 13 '15 at 10:08
  • 1
    ... Tibshirani (1996), "Regression shrinkage and selection via the lasso", *JRSS*, **58**, 1, has some examples of cases where shrinkage helps & some where it doesn't. – Scortchi - Reinstate Monica May 13 '15 at 10:08
  • @Scortchi: This argument assumes that the linear model is false, which is certainly a reasonable assumption in practice but is not the usual assumption when discuss biasedness of linear models, and in particular the context of the original question (it seems to me). In fact, it's not even clear what is meant by "unbiased" if the linear is assumed to be false to begin with. – JohnA May 13 '15 at 16:09
  • @JohnAustin: No it doesn't, only that the reduced model got by feature selection *may* be. – Scortchi - Reinstate Monica May 13 '15 at 16:45
  • @Scortchi: If the true model is linear, then the OLS estimates are unbiased. If we remove a bunch of noise variables after feature selection (i.e. that have zero coefficients in the true model), and then re-run OLS using the reduced feature set, the resulting coefficients will still be unbiased. If we remove a "true" feature, then the estimates may be biased. The Lasso theory is very precise about this: If the original model contains all relevant features to begin with, the Lasso will keep the relevant features, and if we then re-run OLS the estimates will be unbiased. – JohnA May 13 '15 at 18:18
  • 2
    To be clear: Not saying @Scortchi is wrong, but this is a bit of a grey area when discussing feature selection, and I think this is an important technical point that should be made very clear. – JohnA May 13 '15 at 18:20
  • 1
    @JohnAustin: That's correct up to "If the original model contains all relevant features to begin with, the Lasso will keep the relevant features" : Lasso doesn't *know* the relevant features (those with zero coefficients in the true model) but has to guess them from the data - the shrinkage towards zero of those coefficients kept in can help to compensate for the guessing. – Scortchi - Reinstate Monica May 14 '15 at 19:31
  • 1
    @Scortchi: I am not sure what you mean by "guess", but the Lasso is known to be model selection consistent (i.e. selects all relevant features and eliminates all noise variables): [1] Nicolai Meinshausen and Peter Buhlmann. [High-dimensional graphs and variable selection with the Lasso](http://www.jstor.org/discover/10.2307/25463463?uid=2&uid=4&sid=21106406246991). The Annals of Statistics, 34(3):1436–1462, 2006; [2] Peng Zhao and Bin Yu. [On model selection consistency of Lasso](http://dl.acm.org/citation.cfm?id=1248637). The Journal of Machine Learning Research, 7:2541–2563, 2006. – JohnA May 15 '15 at 00:34
  • 1
    @JohnAustin: Thanks for those papers. But note that consistency is an asymptotic property, of little comfort with the finite sample sizes we have to work with in practice. – Scortchi - Reinstate Monica May 15 '15 at 08:08
  • 1
    @Scortchi: The literature on support recovery and error bounds in a non-asymptotic setting is quite vast. It is generally preferable to use regularizers besides the Lasso (e.g. SCAD, MCP), however, similar conditions exist for the Lasso in a finite-sample framework: [Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using \ell _{1} -Constrained Quadratic Programming (Lasso)](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4839045&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4839045). – JohnA May 15 '15 at 16:24

3 Answers3

30

I don't believe there is anything wrong with using LASSO for variable selection and then using OLS. From "Elements of Statistical Learning" (pg. 91)

...the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero and in general they are not consistent [Added Note: This means that, as the sample size grows, the coefficient estimates do not converge]. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an un-restricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since the variables in the second step have less "competition" from noise variables, cross-validation will tend to pick a smaller value for $\lambda$ [the penalty parameter], and hence their coefficients will be shrunken less than those in the initial estimate.

Another reasonable approach similar in spirit to the relaxed lasso, would be to use lasso once (or several times in tandem) to identify a group of candidate predictor variables. Then use best subsets regression to select the best predictor variables to consider (also see "Elements of Statistical Learning" for this). For this to work, you would need to refine the group of candidate predictors down to around 35, which won't always be feasible. You can use cross-validation or AIC as a criterion to prevent over-fitting.

CoderGuy123
  • 441
  • 4
  • 13
ahwillia
  • 2,406
  • 1
  • 14
  • 26
  • Another part of my question is, why 'Lasso can select at most n features'? If this is the case, I think OLS on the selected features will be at least 'good', since OLS is the 'BLUE' (Not strictly BLUE since it's mostly biased). Just consider an extreme situation that Lasso selects the exactly right features, conducting OLS on these features will restore the true model, which I think is better than the Lasso estimation. – yliueagle Jan 23 '14 at 15:22
  • 2
    The problem is that this "extreme situation" is very unlikely to occur, and there is no way of knowing if LASSO has selected exactly the right features. If LASSO selects too many features, then I think the full OLS model may perform worse than the LASSO estimates. Similarly, ridge regression can outperform OLS if there are too many features (i.e. OLS is overfit). – ahwillia Jan 23 '14 at 18:25
  • 2
    See also https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf, the end of Section 2.2: "[...] the least-squares fit on the subset of [...] predictors tends to expand the lasso estimates away from zero. The nonzero estimates from the lasso tend to be biased toward zero, so the debiasing in the right panel can often improve the prediction error of the model. This two-stage process is also known as the relaxed lasso (Meinshausen 2007)." – amoeba May 30 '17 at 12:20
  • 1
    I looked into Meinshausen paper and it actually recommends fitting two penalty parameters, as described in your original quote from The Elements. +1 – amoeba May 30 '17 at 12:29
  • @AlexWilliams But isn't there a sparsity assumption in the previous paragraph about the correlation between the selected set and what is removed being small? – dimitriy Oct 18 '17 at 17:47
  • In the study I'm currently working on, re-fitting using OLS caused a slight decrease in cross-validated R2, about 3%points (double CV). I also tried the adaptive LASSO, but this actually informed me that using just the LASSO estimates was better, congruent with my results. The point of the adaptive lasso paper is that this fact is contingent on your specific dataset. Mine is n=1890, p=1050 (all binary except for one), continuous outcome. Someone will have to do a very large simulation study to figure out some rough guidelines for which approach is best given a dataset's features. – CoderGuy123 Feb 21 '18 at 12:51
15

If your aim is optimal in-sample performance (wrt highest R-squared), then just use OLS on every available variable. Dropping variables will decrease R-squared.

If your aim is good out-of-sample performance (which is usually what is much more important), then your proposed strategy will suffer from two sources of overfitting:

  • Selection of variables based on correlations with the response variable
  • OLS estimates

The purpose of LASSO is to shrink parameter estimates towards zero in order to fight above two sources of overfitting. In-sample predictions will be always worse than OLS, but the hope is (depending on the strength of the penalization) to get more realistic out-of-sample behaviour.

Regarding $p > n$: This (probably) depends on the implementation of LASSO you are using. A variant, Lars (least angle regression), does easily work for $p > n$.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Michael M
  • 10,553
  • 5
  • 27
  • 43
  • 2
    The "Leekasso" (always pick 10 coefficients) is different than the question's proposal (re-estimate OLS with k predictors picked by LASSO) – Affine Jan 16 '14 at 15:42
  • @affine you are completely right. I removed the reference. – Michael M Jan 16 '14 at 15:48
  • 2
    This sounds reasonable, but **the inventors of Lasso argue otherwise** and actually recommend using two-stage procedure with OLS on the Lasso-identified subset (as suggested by the OP), see @Alex'es answer. – amoeba May 30 '17 at 12:31
  • I like this answer because it mentions the selection bias from the search itself; it sure feels like there should be an additional penalty. LASSO as mere subset selection mechanism - is that all it is? Then why even print out its coefficients at all? – Ben Ogorek Nov 14 '19 at 15:33
3

Regarding the OPs question of why Lasso can select at most n features:

Consider why an OLS might be biased: this is when there are more predictors (p) than observations (n). Thus $X^{T}X$ is of size [p,p] in $\beta = (X^{T} X)^{-1}X^{T}Y$. Taking an inverse of such a matrix is not possible (it may be singular).

Lasso is forced to shrink the coefficients of the variables so that this does not happen, thus it never selects more than n features so that $X^{T}X$ is always invertible.

jmp111
  • 39
  • 2
  • 1
    (-1) I don't think this is true. Can you explain more the connection between $(X^TX)^{-1}$ not existing and the lasso? Specifically, what does $X^TX have to do with the lasso? There are proofs of the OPS question (the answers here are revealing, for instance: stats.stackexchange.com/questions/38299/…) but this answer doesn't appear prove it. (Please let me know if I'm mistaken!) – user795305 May 30 '17 at 12:27