7

By using singular value decomposition (SVD), I noticed from the derivation that ridge regression shrinks the coefficients by factor $\frac{D^2}{D^2+\lambda}$, where $D$ is the diagonal matrix of the matrix $\underset{m\times n}A$. Moreover, as the penalty term $\lambda$ increases, the amount of shrinkage increases.

But, what about LASSO regression? Unlike ridge regression, LASSO regression shrinks some of the coefficients to zero. My question:

  • Is there a way to show, in some mathematical fashion, that LASSO regression shrinks some of the coefficients to zero as the notation above does for ridge regression?
    Using the two predictor case would make it easy to understand. Could you please provide mathematical lines?

EDIT

Knight & Fu (2000) show that $\hat{\beta}_{lasso}=0$ if and only if $−\lambda I≤2\sum\limits_{i}Y_iX_i≤\lambda I$.
How does that occur?

References:

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Ereck
  • 127
  • 10
  • 3
    Are you asking for a rigorous proof? Intuition? Geometry? Examples? All of these things could be considered "mathematical". – Matthew Drury Jul 24 '16 at 05:19
  • If the linked duplicate does not answer your question, please indicate more precisely what you need. – Glen_b Jul 24 '16 at 06:03
  • @ Glen_b, @Matthew Drury, I edited my question – Ereck Jul 24 '16 at 13:51
  • The two predictor case is covered in detail here: http://stats.stackexchange.com/questions/154442/lasso-with-two-predictors/154453#154453 – Matthew Drury Jul 24 '16 at 16:12
  • I read the post @Glen_b posted, and I don't see what it doesn't address about this question. What are you asking that is not addressed there? – Matthew Drury Jul 24 '16 at 18:06
  • Also, generally you should define all your terms, what is $D^2$? – Matthew Drury Jul 24 '16 at 18:07
  • What is "the matrix A"? – gung - Reinstate Monica Jul 24 '16 at 23:15
  • 4
    This is clearly not a duplicate of *Why does the Lasso provide Variable Selection?*. The question here is not why it provided variable selection, but if the amount of shrinkage can be concisely noted in terms of lambda & other variables. – gung - Reinstate Monica Jul 24 '16 at 23:17
  • @ gung, yes, you are right I think it is not duplicate. Actually from writing my question till now (approximately 20hours). I was searching in books and articles, but I did not find any thing that convinces me or helps me to understand that. – Ereck Jul 24 '16 at 23:57
  • 1
    @ gung, singular value decomposition theorem states that for any matrix $\underset{m\times n}A$ can be expressed as $$A=UD V^T$$, by applying this theorem to ridge solution we found the factor above within the final form which has effect to shrinks the coefficients to zero. (compared with least squares solution ). – Ereck Jul 25 '16 at 00:04
  • 3
    @gung The variables that are not shrunk to zero are the ones selected for inclusion by the LASSO. Explaining how the LASSO is selecting variables is the same as explaining how some variables are shrunk to 0 and others are not. It's possible I missed something, but the mathematics of shrinking to zero-vs-not and selection-vs-not are the same, because they're the same thing – Glen_b Jul 25 '16 at 00:58
  • 2
    @Glen_b, I take the OP's question to be searching for an expression analogous to $D^2/(D^2+\lambda)$, but for LASSO instead of ridge. This expresses the rate of shrinkage relative to $\lambda$. Note that for some values of $\lambda$ in LASSO there will be no selection yet shrinkage will still have occurred relative to OLS. This seems to me to be a different question than the linked thread. – gung - Reinstate Monica Jul 25 '16 at 01:09
  • @gung, @ Glen_b, I have just read in Keith and Wenjiang paper "ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS" that $$\hat{\beta}_{lasso}=0$$ If and only if $$-\lambda I\leq2\sum\limits_{i}Y_iX_i\leq\lambda I$$ How that occurs ? – Ereck Jul 25 '16 at 01:10
  • 1
    I still don't know that the matrix A is. I am familiar w/ SVD, but from your answer, it only seems that A is some original matrix that could be decomposed via SVD. Is is supposed to be the matrix $\bf X^\top X$, eg? Or is it the variance-covariance matrix of $\boldsymbol\beta$? Something else? – gung - Reinstate Monica Jul 25 '16 at 01:15
  • 1
    @ gung , it is just any matrix, but of coursed not symmetric, SVD is generalization of spectral decomposition theorem which decomposes the symmetric matrix. – Ereck Jul 25 '16 at 01:20
  • 1
    That makes no sense. How can $D^2/(D^2+\lambda)$ be the amount of shrinkage where $D$ is the "diagonal matrix of the matrix $A$", & $A$ is "just any matrix"? Is $D^2/(D^2+\lambda)$ supposed to indicate 'just any amount' of shrinkage? – gung - Reinstate Monica Jul 25 '16 at 01:45
  • 1
    @ gung, Using SVD least solution will be $$\hat{\boldsymbol{\beta}_O}=VD^{-1}U^T\boldsymbol{Y}$$ and ridge solution will be $$\hat{\boldsymbol{\beta}_R}=V\left(D^2+\lambda\right)^{-1}DU^T\boldsymbol{Y}$$ when you compare both solutions you can notice that ridge solution as we just multiply least squares by the factor $$D^2/D^2+\lambda$$ which has impact to shrinks the coefficients to zero because $\lambda >0$. – Ereck Jul 25 '16 at 01:54
  • 1
    @Ereck I believe Lasso doesn't admit closed form solution, so maybe you can't get a factor as in ridge regression. However, if you're willing to assume the orthonormal case, I think you can prove that lasso shrinks the solution compared to OLS (e.g. see [here](http://www.few.vu.nl/~wvanwie/Courses/HighdimensionalDataAnalysis/WNvanWieringen_HDDA_Lecture5_LassoRegression_20162017.pdf) page 7) – Robert Smith Jul 25 '16 at 07:05
  • 2
    @Ereck you may want to consider editing the question in your comment about the expression in the Keith and Wenjiang paper into your question; it may be a fruitful line to pursue. – Glen_b Jul 25 '16 at 08:52

2 Answers2

5

Firstly, I think it's worth noting that the description of what ridge does assumes that the data matrix is orthonormal.
Secondly, the answer to your question is yes under those circumstances. The details may be found in "Elements of Statistical Learning" on p. 69 bis (section 3.4.3) . The short story is that $ \beta \to \text{sign}(\beta)\max(\beta-\lambda,0)$ is the formula. Please see the book for the complete discussion, better formatting, and details.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
meh
  • 1,902
  • 13
  • 18
  • Many thanks . First, the data matrix is not assumed to be orthonormal in ridge solution. Second, I have read that book, in that page the lasso estimator formula is given in the orthonormal case (which may not worth) especially in real data. – Ereck Jul 25 '16 at 16:41
  • 1
    @ Ereck- the formula to solve the ridge regression doesn't depend on the data matrix being orthonormal- I agree. I believe that the interpretation of the coefficients as shrinking the ols coefficients does depend on an orthonormal decomposition. That is implicit in the use of SVD. If $X = VDU$ is the svd of X, then it is wrt to the basis of U that the coefficients are 'shrunk'. Again, this is explained, better than I can do on p.66 of ESL. Clearly writing the ridge 'betas' in terms of the ols 'betas' has to depend on the choice of basis- hth. – meh Jul 25 '16 at 19:01
5

The question can be answered when one assumes an orthogonal matrix of predictors. Then the right singular vector matrix $\mathbf V$ equals the identity matrix $\mathbf I$, and the derivation below holds only for this case.

Consider the lasso problem $$ \min_{\mathbf \beta} \frac12 ||\mathbf X \mathbf \beta - \mathbf y||^2_2 + \lambda ||\beta||_1 $$

Use the singular value decomposition \begin{align} \mathbf X &= \mathbf U \, \mathbf D\, \mathbf V^T\\ &= \sum_i \mathbf u_i \, d_i \mathbf v^T_i \end{align} and also expand $\beta$ and $\mathbf y$ as \begin{align} \beta &= \sum_i \beta_i \mathbf v_i\,\\ \mathbf y &= \sum_i y_i \mathbf u_i \end{align} Insert the whole stuff into the lasso functional to obtain \begin{align} \min_{\beta_i} \left(\sum_i \frac12 |\beta_i d_i - y_i|^2 + \lambda |\beta_i|\right) \end{align}

In the SVD basis, the Lasso minimization problem thus decomposes into separate problems for each component.

For a vanishing singular value, $d_i=0$, the solution is $\beta_i=0$, so let's consider only the case $d_i >0$. For this, one can rewrite the previous equation as

$$\min_{\beta_i} \frac12 \left|\beta_i - \frac{y_i}{d_i}\right|^2 + \frac{\lambda}{d_i^2} \left|\beta_i\right|$$

As stated in the other answer, and also derived here, the solution of this problem is $$ \beta^{Lasso}_i \ = \ \begin{cases} 0 &,\ |d_i y_i| \le \lambda \\ \frac{y_i}{d_i} - \frac{\lambda}{d_i^2} & ,\ |d_i y_i| > \lambda \end{cases} $$

Compare this to the corresponding solution of the ordinary least squares problem and ridge regression (with regularization parameter $\alpha$): \begin{align} \beta_i^{OLS} &= \frac{y_i}{d_i}\\ \beta_i^{ridge} &= \frac{y_i}{\tilde d_i}, \quad \text{where}\ \ \tilde d_i = d_i \cdot \frac{d_i^2+\alpha}{d_i^2} \end{align} As is well known, ridge regression can be obtained by scaling a given singular value $d_i$ with a factor that depends on $d_i$ and $\alpha$.

On the other hand, the Lasso solution is more complex and also depends on the target $y_i$. With a good amount of enforcement, one can write the adjusted singular values as $$ \beta_i^{Lasso} = \frac{y_i}{\bar d_i}, \quad \text{where} \ \ \bar d_i = d_i \frac{1}{\left(1 - \frac{\lambda}{d_i y_i}\right)\theta\big(|d_i y_i|-\lambda\big)} $$ where $\theta$ is the Heaviside step function, which is zero for $|d_i y_i|<\lambda$ and otherwise one.


EDIT: Please note that for a general predictor matrix, the above statement is wrong, for two reasons: first and informally, if it would be correct, it would be the standard approach for an easy Lasso solution. Second, the formal reason why this doesn't work is that the L1-norm is not invariant under orthogonal transformations. So when one does the expansion of $\beta$ in two orthonormal basis sets, $\beta = \sum_i \beta_i \mathbf e_i = \sum_i b_i \mathbf v_i$, one has $\sum_i |\beta_i| \neq \sum_i |b_i|$.

So, what it does is not to solve the Lasso problem, but rather the more exotic Lasso problem $$ \min_{\mathbf \beta} \frac12 ||\mathbf X \mathbf \beta - \mathbf y||^2_2 + \lambda ||\mathbf V^T \beta||_1 $$

davidhigh
  • 1,260
  • 9
  • 20
  • 1
    I'm afraid its doesn't even solve the generalized problem, because in the generalized problem there's no relationship between $X$ and $V.$ – whuber Aug 06 '18 at 18:27
  • @whuber: Thanks for your comment. I tought one can insert $\mathbf V \mathbf V^T = \mathbf I$ between $\mathbf X $ and $\mathbf \beta$, and then optimize with respect to $\mathbf V^T \beta$. Not correct? – davidhigh Aug 06 '18 at 19:51
  • It's correct, but it's not the "generalized problem," because $V$ is a function of $X.$ In the generalized problem, $X$ and $V$ are independently given. – whuber Aug 06 '18 at 19:58