What is $\lambda_{min}$ that makes all regression coefficients zero in LASSO regression

Question

I learnt that in LASSO regression, there is this $\lambda_{min}$ which is given by $$\lambda_{min} = \text{max}_j|\sum_t y_t X_{tj}|$$ in which $y$ is the response, $X$ is the matirx of predictors.

So, when you plug this $\lambda_{min}$ into the objective function $$\text{argmin}_{c \in \Bbb R^{M}} (\Vert y - Xc \Vert_2^2 + \lambda_{min} \Vert c \Vert_1)$$ the vector of regression coefficients $c$ will be a zero vector.

My questions:

What is the intuition of this $\lambda_{min}$? I am not focusing on a rigorous mathematical proof, I just want to intuitively understand this $\lambda_{min}$ so I can use this with more confidence, say for feature selection.
Does this $\lambda_{min}$ apply to other cost function like MaxAE other than squared reconstruction error?

I just learned this $\lambda_{min}$ while reading a research paper. However, I haven't found a definition of this in a textbook. Any references will be appreciated.

Other useful links to related questions are given in my answer there. — Sextus Empiricus, Sep 04 '18 at 06:31
@ Martijn Weterings Thank you for the reference, but I think the answer that is given here is also very good and easy to understand, especially for me who haven't learned about KKT. — meTchaikovsky, Sep 04 '18 at 06:48
you can skip that part about KKT. It related to a specific approach of the person that asked that question. — Sextus Empiricus, Sep 04 '18 at 07:38

Don Walpola · Accepted Answer · 2018-09-04T03:32:22.853

So, let's simplify things and say that your $p$ predictor variables are orthonormal. This means that your $m \times p$ sample matrix $X$ has the property that $X^{T}\cdot X = I$. Now, let's use this assumption to expand the LASSO objective function:

$||{y - X\beta}||^{2}_{2} + 2\lambda ||\beta||_{1} = y^{t}y - \beta^{T}X^{T}X\beta - 2y^{T}X\beta + 2\lambda||\beta||_{1}\\ = y^{t}y - \beta^{T}\beta - 2y^{T}X\beta + 2\lambda||\beta||_{1}$

Where I have modified the objective by replacing $\lambda$ with $2\lambda$ for reasons of algebraic shenanigans that will become clear soon. Now the $l_{1}$ norm is not differentiable, so you can't take the gradient. It is convex, however, so we may use the subgradient. The relevant point here is that the subgradient is just the regular gradient, except at the point where the $l_{1}$ norm is not differentiable - at that point, it is equal to any vector which produces a 'tangent' plane below the function. In the one dimensional case for an absolute value function $|x|$, this means at $x = 0$, the subderivative is any slope in the interval $[-1, 1]$. So let's focus on a single one of the $p$ coordinates of $\beta$. The derivative with respect to $\beta_{j}$ for some $1 \leq j \leq p$ is just the $j^{th}$ component of the subgradient. This is

$2\beta_{j} - 2y^{T}x_{j} +2\lambda\cdot\partial|\beta_{j}|$

Now the subdifferential $\partial|\beta_{j}|$ is $1$ when $\beta_{j} > 0$, $-1$ when $\beta_{j} < 0$ and any value in the interval $[-1, 1]$ when $\beta_{j} = 0$. The last case is what interests us. Since we are trying to find a value that minimizes the objective function, in all three cases we want the subgradient to contain $0$ anyway (note that the subgradient is a set). So now we have two constraints:

$ 0 \in 2\beta_{j} - 2y^{T}x_{j} +2\lambda\cdot\partial|\beta_{j}|\ \ \ (minimization)\\ \partial|\beta_{j}| \in [-1, 1] \ \ \ \ \ \ (\beta_{j} = 0)$

Dropping the first terms since it's equal to 0 and solving the top equation for $\lambda$,

$0 \in -2y^{T}x_{j} + [-2\lambda, 2\lambda]$

Where I'm using the interval notation to denote all possible value of the subdifferential. Since $0$ has to be somewhere between the endpoints of the interval, we can deduce two inequalities:

$-2y^{T}x_{j} - 2\lambda < 0 \\ -2y^{T}x_{j} + 2\lambda >0$

Which combined tell us that whenever $\lambda > |y^{T}x_{j}|$, $0$ is in our subgradient and we have satisfied minimization criterion. Now you can do this for each coordinate $j$, and take the maximum in order to find the $\lambda$ you seek. In fact, all $p$ values of lambda will form a sequence that tells you when the $j^{th}$ coefficient is set to $0$. The question is then, what is the significance of $|y^{T}x_{j}|$?

This is just the absolute value of the dot product, or covariance, of the output vector $y$ with the $j^{th}$ predictor variable $x_{j}$. So when your regularization penalty $\lambda$ exceeds the covariance of any predictor with the output, the penalty is so great that the regularization drops all terms from the model.

Now several simplifying assumptions were made in order to make the explanation easier, as most sets of predictors are not going to be orthonormal, and scaling across the predictors will play a role here, but this should give you a general sense of how the regularization interacts with the predictors.

The above description was adapted from course notes in high-dimensional statistics at Rutgers. You can find a similar set of course notes from Yale here: http://statsmaths.github.io/stat612/. Lectures 17 and 19 are most relevant to LASSO methods.

This is extremely helpful, thank you! So, this $\lambda_{min}$ can only be used for $\Vert y - X\beta \Vert_2^2$, and it doesn't matter the covariance matrix $X^TX$ is not diagonal since $\beta_j = 0$ so it will not affect, am I right? — meTchaikovsky, Sep 04 '18 at 04:51
Can I conclude the proof in such a way: we want the cost function hits its minima when $\beta_j = 0$, so we take the derivate $\frac{\partial C}{\partial \beta_j}$, let $\beta_j = 0$ to see what value $\lambda$ should take to satisfy the condition. — meTchaikovsky, Sep 04 '18 at 05:17
@me_Tchaikovsky Yes, your conclusion captures the idea here. Apropos of your first comment, I believe you are correct but it's been a while since I've worked out the details. One comment I would add though, especially since you seem to be interested in the area of feature selection, is that the placement of the $\lambda$ values will be uneven - and for highly correlated predictors, they will be close together. Combined with the fact that the $l_{1}$ norm is *not rotationally invariant*, unlike the $l_{2}$ norm, means that the particular order in which features are dropped may be arbitrary — Don Walpola, Sep 04 '18 at 09:27
@me_Tchaikovsky So if you were to perform any kind of isometric transformation on your predictors, and rerun your LASSO, you would could get a completely different set of predictors selected. Running bootstrap samples will make this particularly relevant. One thing I would recommend is to check the covariance matrix to detect this (if your set of predictors isn't so absurdly huge that it is computationally prohibitive), and also to check out modifications such as the *group LASSO* — Don Walpola, Sep 04 '18 at 09:30
In the second comment, I don't understand why we need to check covariance matrix, do we detect correlations by doing this? So, when some of the eigenvalues of the covariance matrix are close to zero, we have to be careful about how we clean the data? — meTchaikovsky, Sep 04 '18 at 11:23
@DonWalpola what are your comments on the relation between algorithm stability (e.g. uniform stability) and consistency of the Lasso - also would need help on providing an intuitive view of the conditions discussed here https://stats.stackexchange.com/questions/365938/what-causes-lasso-to-be-unstable-for-feature-selection/366419#366419 — Xavier Bourret Sicotte, Sep 12 '18 at 14:54

What is $\lambda_{min}$ that makes all regression coefficients zero in LASSO regression

1 Answers1