Ridge & LASSO norms

Question

This post follows this one: Why does ridge estimate become better than OLS by adding a constant to the diagonal?

Here is my question:

As far as I know, ridge regularization uses a $\ell_2$-norm (euclidean distance). But why do we use the square of this norm ? (a direct application of $\ell_2$ would result with the square root of the sum of beta squared).

As a comparison, we don't do this for the LASSO, which uses a $\ell_1$-norm to regularize. But here it's the "real" $\ell_1$ norm (just sum of the square of the beta absolute values, and not square of this sum).

Can someone help me to clarify?

The penalty term in ridge regression is the squared L2 norm. See these slides written by Tibshirani as an example (slide 7) http://www.stat.cmu.edu/~ryantibs/datamining/lectures/16-modr1.pdf See also here http://en.wikipedia.org/wiki/Tikhonov_regularization — boscovich, Oct 15 '14 at 13:57
Small point of clarification, these are slides from Ryan Tibshirani *not* Rob. — Ellis Valentiner, Oct 15 '14 at 14:03
ok, thanks a lot for the clarification. But I don't understand why squared for L2 and not squared for L1. Don't we have a general formulae for any kind of regularization? — PLOTZ, Oct 15 '14 at 14:05
@user12202013: thank you for pointing that out. I didn't notice that. — boscovich, Oct 16 '14 at 11:16

bdeonovic · Answer 1 · 2014-10-16T11:37:13.450

9

There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".

Properties of interest might be:

1) nearly unbiased estimators (note all penalized estimators will be biased)

2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)

3) Continuity (to avoid instability in model prediction)

These are just a few properties one might be interested in a penalty function.

It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.

edited Oct 16 '14 at 11:37

answered Oct 15 '14 at 14:13

bdeonovic

8,507
1
24
49

ok, thanks. But why squared for L2 and not squared for L1? Don't we have a general formulae for any kind of regularization? This is puzzling me... – PLOTZ Oct 15 '14 at 14:34
@PLOTZ I added a bit to my answer. – bdeonovic Oct 15 '14 at 17:19
Thanks a lot Benjamin! For sure it's clearer now! I didn't get this theoretical purpose before your answer. Many thanks for your answer. – PLOTZ Oct 16 '14 at 05:46
@Benjamin: in point #1 did you actually mean "(**not** all penalized estimators will be unbiased)"? Ridge regression –just to name one– is biased. – boscovich Oct 16 '14 at 07:40
whoops yes thanks for catching that! I think in fact all penalized estimators will be biased. – bdeonovic Oct 16 '14 at 11:37

score 6 · Answer 2 · answered Jan 22 '15 at 16:23

Actually both the square of the $\ell_2$-norm and the $\ell_1$-norm come from a same class of regularization: $\|\boldsymbol{\beta}\|_p^p$ when $p > 0$.

The Ridge regression is then using $p=2$, and the Lasso $p=1$ but one can use other values of $p$.

For example you have sparse solution for all values of $p \leq 1$, and the smaller the value of $p$ the sparser the solution.

For values of $p \leq 1$ your objective is no more smooth so the optimization become harder; for $p<1$ the objective is non-convex and so the optimization even harder...

score 2 · Answer 3 · answered Feb 23 '18 at 19:16

I believe there is an even simpler answer here, although "why" questions are always hard to answer when a technique is developed. The squared $l_2$-norm is used so that the regularization term is easily differentiable. Ridge regression minimizes:

$$\|\mathbf{y - X\beta}\|^2_2+\lambda\|\beta\|_2^2$$

Which can also be written: $$\|\mathbf{y - X\beta}\|^2_2+\lambda\beta^T\beta$$

This can now be easily differentiated wrt $\beta$ to get the closed-form solution:

$$\hat\beta^{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda I)^{-1}\mathbf{X}^T\mathbf{y}$$

from which all kinda of inference can be derived.

score 1 · Answer 4 · answered Feb 23 '18 at 16:43

Consider one other important difference between using the square of the $\ell_2$ norm (i.e. ridge regression) and the unmodified $\ell_2$ norm: the derivative of the $\ell_2$ norm of $x$, $||x||_2$, at $x$ is given by $\frac{x}{ ||x||_2}$ and therefore not differentiable at the zero vector. That is, although the $\ell_2$ norm does not do individual variable selection like the lasso, it could theoretically yield $\beta=0$ as the solution to the maximum penalized likelihood. By squaring the $\ell_2$ norm in the penalty, the ridge-type penalty is differentiable everywhere and can never yield such a solution.

This behavior is exactly (by my understanding) why the group lasso (Yuan and Lin) and the sparse group lasso (Simon, et al.), etc, use the $\ell_2$ norm (on prespecified subsets of the coefficients) instead of the square of the $\ell_2$ norm.

score 0 · Answer 5 · answered Jun 04 '20 at 21:43

Another interpretation of it is the Bayesian one. In a frequentist approach the regularization with $$\|\cdot\|_2^2$$ is equivalent to having a Gaussian prior on your weight vector $\beta$. See here for example: Why is the L2 regularization equivalent to Gaussian prior?

Ridge & LASSO norms

5 Answers5

Linked

Related