3

In the context of LASSO logistic regression, I understand that $\lambda$ is the tuning parameter obtained by cross validation. There is also the constraint parameter $s$ ($\sum_{i=1}^p|\hat\beta_i|\le s $).

  1. How this constraining parameter $s$ is chosen?

  2. How are $\lambda$, $s$ and $\beta_i$ shrinking to zero related to each other?

  3. What is the decision process or how are some $\hat\beta_i$ shrunk to zero and some are not?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 2
    You don't have to choose $\lambda$ by cross validation. You can specify it a-priori, eg. Cross-validation is just a common strategy when you don't already know the ideal $\lambda$ for your case. Note that there is a correspondence between $\lambda$ & $s$, so choosing a $\lambda$ implies choosing an $s$ and vice versa. – gung - Reinstate Monica Nov 08 '14 at 17:56
  • @gung thanks gung, so $\lambda$ and s are the same, then how are some parameter estimates $\beta_i$ shrinked to zero while others are not ? – Tyrone Williams Nov 08 '14 at 17:59
  • 1
    $\lambda$ & $s$ are *not* the same, there is simply a correspondence b/t them. Someone can give you a full answer explaining the LASSO. – gung - Reinstate Monica Nov 08 '14 at 18:03
  • @gung okay cool, so what is the constraint parameter s ?, how do you chose s ? – Tyrone Williams Nov 08 '14 at 18:08

1 Answers1

5

Consider the original formulation of the Lasso regression problem in a linear regression setting, as following$$ \min_\beta \|y - X \beta\|_2^2 \ \\s.t. \|\beta\|_1 \leq s $$ To do the optimization, we utilize the Lagrange multiplier, and reformulate the problem as follows, $$ \min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 \ $$ From the two formulations, you can see the connection between $\lambda$ and $s$.

(1) as $s$ becomes infinity, the problem becomes unconstrained problem, or ordinary least squares. Thus $\lambda$ becomes 0 accordingly;

(2) as $s$ becomes 0, all $\beta$'s shrink to 0, easily seen from first formulation. Therefore $\lambda$ would go to infinity.

That said, $\lambda$ and $s$ have reverse relationship. Now for your questions.

  1. How this constraining parameter $s$ is chosen?

In practice, you would just need to choose $\lambda$, mainly by cross-validation, as other people pointed out. You are not bothered by what the $s$ value would be.

  1. How are $\lambda, \, s$, and $\hat{\beta}$ shrinking to zero related to each other?

Have answered by Point (2) I made above.

  1. What is the decision process or how are some $\hat{\beta}$'s shrunk to zero and some are not?

This has to do with the L1 constraint. I highly recommend the geometric representation of this problem at P71 of the book The element of statistical learning. The L1 constraint makes the feasible region to be a diamond (in terms of two $\beta$'s, as in the figure). The corners of the region would be "hit" by the function of the residual SS, resulting in shrinking some $\beta$'s to be 0. That's how the sparsity comes from.

enter image description here

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
SixSigma
  • 2,152
  • 1
  • 14
  • 24