Which lambda is cv.glmnet solving for?

Question

This is my understanding of glmnet:

if OLS is minimizing RSS, where

$ RSS = \sum(y-\beta x)^2 $

I believe glmnet is minimizing:

$ RSS - \sum(\alpha |\beta_j| + (1-\alpha) \beta_j^2) $ where $\alpha=\lambda_1/(\lambda_1+\lambda_2) $

$\lambda_1$ and $\lambda_2$ come from lasso and ridge regression, but I'm confused if $\lambda_1 = \lambda_2 $ such that cv.glmnet in glmnet package of R is solving for a single variable (along the whole path) $\lambda$? But then $\alpha = 0.5$ always.

If $\lambda_1 = \lambda_2 $, is the glmnet penalty equivalent to $RSS - \lambda |\beta| - \lambda \beta^2 $

I've read through Hastie et al. 2009 Elements of Statistical Learning and Zou and Hastie 2005 so now I'm trying to get some clarification on the lambdas and alpha. Thanks

EDIT:

I found this to be a useful formulation in Friedman et al (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent.

$$ 1/2N * \sum (y_i - \beta_0 - X\beta)^2 + \lambda P_\alpha (\beta) $$ where $$ P_\alpha (\beta) = \sum (1/2 (1-\alpha) \beta_j^2 + \alpha |\beta_j|) $$ I thought it provided some intuition how lambda and alpha exist together.

Relevant: http://stats.stackexchange.com/questions/67736/equivalence-between-elastic-net-formulations?rq=1 — Sycorax, Oct 05 '16 at 07:04
relevant and helpful, but I still don't understand how a single $\lambda$ comes into the picture when the penalty is formulated with $\alpha$ — Dominik, Oct 05 '16 at 16:15
You haven't correctly given the penalty: the entire term must be multiplied by a second independent parameter, called "$\lambda$" in the documentation. It functions like the $\lambda$ in your edit (and is directly proportional to it). *This lambda has nothing to do with your $\lambda_1$ and $\lambda_2$.* `cv.glmnet` helps you find $\lambda$, but you have to specify $\alpha$. — whuber, May 26 '17 at 17:21

Sycorax · Accepted Answer · 2019-07-29T11:13:14.467

3

$\alpha=\frac{\lambda_1}{\lambda_1+\lambda_2} \text{and } 1-\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}$. And because $\lambda_i\ge0,$ it should be clear that $\alpha\in[0,1].$ So in glmnet, $\lambda=\lambda_1+\lambda_2$, and each penalty has a coefficient that is either $\alpha(\lambda_1+\lambda_2)$ or $(1-\alpha)(\lambda_1+\lambda_2)$.

But treating $\alpha$ independently of $\lambda_1, \lambda_2$ is convenient as a conceptual model because it controls how much of a ridge and lasso penalty is applied, with either extreme arising as a special case. And you can make a model "more lasso" or "more ridge" by adjusting $\alpha$ without having to worry about how to adjust $\lambda_i$ relative to the size of $\lambda_j, j\neq i$. That is, treated separately, $\alpha$ controls the range of elastic net compositions on a continuum of ridge to lasso, while $\lambda$ controls the overall magnitude of the penalty. The two can be thought of as distinct model hyper-parameters. The method with two lambdas links the two penalties.

And if both $\lambda_1$ and $\lambda_2$ are 0, that should correspond to no penalty, but the fraction $\frac{\lambda_1}{\lambda_1+\lambda_2}=\frac{0}{0}$ is unsightly and indeterminate.

edited Jul 29 '19 at 11:13

answered Oct 05 '16 at 16:40

Sycorax

76,417
20
189
313

Thanks. Could you elaborate your 3rd sentence? how did you get to $\lambda = \lambda_1 + \lambda_2$ ? – Dominik Oct 05 '16 at 18:01
1

That's just the definition/convention that glmnet uses. – Sycorax Oct 05 '16 at 18:52
ok. and when you said "α controls the range of elastic net compositions on a continuum of ridge to lasso, while λ controls the overall magnitude of the penalty.", doesn't $\alpha$ scale $\lambda$ therefore also controlling the magnitude of the penalty? – Dominik Oct 05 '16 at 19:17
1

The *total* magnitude $\lambda$ isn't changed by $\alpha$. $\alpha$ just changes how much penalty is applied to lasso and how much penalty is applied to ridge. Because it's a convex combination, total penalty remains constant even as you change $\alpha$. If $\alpha$ is 0 or 1, you're doing either lasso or ridge regression because the penalty to one is 0. – Sycorax Oct 05 '16 at 19:25
going to have to read up on how L1 vs L2 differ but thanks for the explanations. – Dominik Oct 06 '16 at 01:17
This is discussed in the archives but the biggest difference is that lasso will often zero coefficients out, effectively eliminating them, while ridge will just decrease coefficients in absolute value. – Sycorax Oct 06 '16 at 01:25
Yeah I've read that, but I meant mathematically. why can $\lambda \beta$ force coefs to zero but $\lambda \beta^2$ not? – Dominik Oct 06 '16 at 13:56
1

That's addressed in this thread: http://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection – Sycorax Oct 06 '16 at 13:58
edited my question to include the formulation I found in Friedman et al 2010. I was still struggling for a layman's explanation why there are only 2 parameters instead of 3. I think this formulation elucidates your point. whats the difference between $\beta$ and $\beta_j$ in this case? – Dominik Oct 06 '16 at 20:32
It's not clear to me where you'd like additional explanation. Since this thread has an accepted answer, it's probably best to ask a new question where you lay out clearly what you know and what you'd like to know. – Sycorax Oct 06 '16 at 21:05
@Dominik In your first equation, \beta is a vector. In the penalty equations, you're summing up elements of \beta. So each \beta_j is an element of \beta indexed by j. Is that your question? – Sycorax Oct 06 '16 at 21:23
Basically. I think I was just confused that the regularization was simply a summation from the looks of it – Dominik Oct 07 '16 at 04:43

score 1 · Answer 2 · answered May 26 '17 at 17:04

Just to add: From the help file of glmnet, we read:

Note that cv.glmnet does NOT search for values for alpha. A specific value should be supplied, else alpha=1 is assumed by default. If users would like to cross-validate alpha as well, they should call cv.glmnet with a pre-computed vector foldid, and then use this same fold vector in separate calls to cv.glmnet with different values of alpha

This shows that glmnet doesn't cross validate over $\alpha$, so that the cross validation is just one-dimensional, as I think you suspect.

Which lambda is cv.glmnet solving for?

2 Answers2