0

This is my understanding of glmnet:

if OLS is minimizing RSS, where

$ RSS = \sum(y-\beta x)^2 $

I believe glmnet is minimizing:

$ RSS - \sum(\alpha |\beta_j| + (1-\alpha) \beta_j^2) $ where $\alpha=\lambda_1/(\lambda_1+\lambda_2) $

$\lambda_1$ and $\lambda_2$ come from lasso and ridge regression, but I'm confused if $\lambda_1 = \lambda_2 $ such that cv.glmnet in glmnet package of R is solving for a single variable (along the whole path) $\lambda$? But then $\alpha = 0.5$ always.

If $\lambda_1 = \lambda_2 $, is the glmnet penalty equivalent to $RSS - \lambda |\beta| - \lambda \beta^2 $

I've read through Hastie et al. 2009 Elements of Statistical Learning and Zou and Hastie 2005 so now I'm trying to get some clarification on the lambdas and alpha. Thanks

EDIT:

I found this to be a useful formulation in Friedman et al (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent.

$$ 1/2N * \sum (y_i - \beta_0 - X\beta)^2 + \lambda P_\alpha (\beta) $$ where $$ P_\alpha (\beta) = \sum (1/2 (1-\alpha) \beta_j^2 + \alpha |\beta_j|) $$ I thought it provided some intuition how lambda and alpha exist together.

Dominik
  • 234
  • 1
  • 8
  • Relevant: http://stats.stackexchange.com/questions/67736/equivalence-between-elastic-net-formulations?rq=1 – Sycorax Oct 05 '16 at 07:04
  • relevant and helpful, but I still don't understand how a single $\lambda$ comes into the picture when the penalty is formulated with $\alpha$ – Dominik Oct 05 '16 at 16:15
  • 1
    You haven't correctly given the penalty: the entire term must be multiplied by a second independent parameter, called "$\lambda$" in the documentation. It functions like the $\lambda$ in your edit (and is directly proportional to it). *This lambda has nothing to do with your $\lambda_1$ and $\lambda_2$.* `cv.glmnet` helps you find $\lambda$, but you have to specify $\alpha$. – whuber May 26 '17 at 17:21

2 Answers2

3

$\alpha=\frac{\lambda_1}{\lambda_1+\lambda_2} \text{and } 1-\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}$. And because $\lambda_i\ge0,$ it should be clear that $\alpha\in[0,1].$ So in glmnet, $\lambda=\lambda_1+\lambda_2$, and each penalty has a coefficient that is either $\alpha(\lambda_1+\lambda_2)$ or $(1-\alpha)(\lambda_1+\lambda_2)$.

But treating $\alpha$ independently of $\lambda_1, \lambda_2$ is convenient as a conceptual model because it controls how much of a ridge and lasso penalty is applied, with either extreme arising as a special case. And you can make a model "more lasso" or "more ridge" by adjusting $\alpha$ without having to worry about how to adjust $\lambda_i$ relative to the size of $\lambda_j, j\neq i$. That is, treated separately, $\alpha$ controls the range of elastic net compositions on a continuum of ridge to lasso, while $\lambda$ controls the overall magnitude of the penalty. The two can be thought of as distinct model hyper-parameters. The method with two lambdas links the two penalties.

And if both $\lambda_1$ and $\lambda_2$ are 0, that should correspond to no penalty, but the fraction $\frac{\lambda_1}{\lambda_1+\lambda_2}=\frac{0}{0}$ is unsightly and indeterminate.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Thanks. Could you elaborate your 3rd sentence? how did you get to $\lambda = \lambda_1 + \lambda_2$ ? – Dominik Oct 05 '16 at 18:01
  • 1
    That's just the definition/convention that glmnet uses. – Sycorax Oct 05 '16 at 18:52
  • ok. and when you said "α controls the range of elastic net compositions on a continuum of ridge to lasso, while λ controls the overall magnitude of the penalty.", doesn't $\alpha$ scale $\lambda$ therefore also controlling the magnitude of the penalty? – Dominik Oct 05 '16 at 19:17
  • 1
    The *total* magnitude $\lambda$ isn't changed by $\alpha$. $\alpha$ just changes how much penalty is applied to lasso and how much penalty is applied to ridge. Because it's a convex combination, total penalty remains constant even as you change $\alpha$. If $\alpha$ is 0 or 1, you're doing either lasso or ridge regression because the penalty to one is 0. – Sycorax Oct 05 '16 at 19:25
  • going to have to read up on how L1 vs L2 differ but thanks for the explanations. – Dominik Oct 06 '16 at 01:17
  • This is discussed in the archives but the biggest difference is that lasso will often zero coefficients out, effectively eliminating them, while ridge will just decrease coefficients in absolute value. – Sycorax Oct 06 '16 at 01:25
  • Yeah I've read that, but I meant mathematically. why can $\lambda \beta$ force coefs to zero but $\lambda \beta^2$ not? – Dominik Oct 06 '16 at 13:56
  • 1
    That's addressed in this thread: http://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection – Sycorax Oct 06 '16 at 13:58
  • edited my question to include the formulation I found in Friedman et al 2010. I was still struggling for a layman's explanation why there are only 2 parameters instead of 3. I think this formulation elucidates your point. whats the difference between $\beta$ and $\beta_j$ in this case? – Dominik Oct 06 '16 at 20:32
  • It's not clear to me where you'd like additional explanation. Since this thread has an accepted answer, it's probably best to ask a new question where you lay out clearly what you know and what you'd like to know. – Sycorax Oct 06 '16 at 21:05
  • @Dominik In your first equation, \beta is a vector. In the penalty equations, you're summing up elements of \beta. So each \beta_j is an element of \beta indexed by j. Is that your question? – Sycorax Oct 06 '16 at 21:23
  • Basically. I think I was just confused that the regularization was simply a summation from the looks of it – Dominik Oct 07 '16 at 04:43
1

Just to add: From the help file of glmnet, we read:

Note that cv.glmnet does NOT search for values for alpha. A specific value should be supplied, else alpha=1 is assumed by default. If users would like to cross-validate alpha as well, they should call cv.glmnet with a pre-computed vector foldid, and then use this same fold vector in separate calls to cv.glmnet with different values of alpha

This shows that glmnet doesn't cross validate over $\alpha$, so that the cross validation is just one-dimensional, as I think you suspect.

user795305
  • 2,692
  • 1
  • 20
  • 40