L1 and L2 penalty vs L1 and L2 norms

Question

I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.

From what I understand:

L1: Laplace Prior L2: Gaussian Prior

are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.

Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?

Thanks for your help.

Could you be even more explicit regarding what you do not understand? When you refer to *these*, do you mean penalty terms or priors, or what? — Richard Hardy, Nov 08 '18 at 10:00
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms — power.puffed, Nov 08 '18 at 12:30

Tim · Answer 1 · 2018-11-09T07:57:26.150

Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $\ell_1$, $\ell_2$ and $\ell_p$ norms defined as

$$\begin{align} \|\boldsymbol{x}\|_1 &= \sum_i | x_i | \\ \| \boldsymbol{x}\|_2 &= \sqrt{ \sum_i |x_i|^2 } \\ \| \boldsymbol{x}\|_p &= \left( \sum_i | x_i |^p \right)^{1/p} \end{align}$$

In machine learning, we often want to predict target values $y$ using function $f$ of features $\mathbf{x}$ parametrized by a vector of parameters $\boldsymbol{\theta}$. To achieve this, we minimize the loss function $\mathcal{L}$. We sometimes want to penalize the parameters, by forcing them to have small values. The rationale for using regularization is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $\ell_2$ norm (often used squared, as below) of the vector of weights, and minimizing the whole thing

$$ \underset{\boldsymbol{\theta}}{\operatorname{arg\,min}} \; \mathcal{L}\big(y, \,f(\mathbf{x}; \boldsymbol{\theta}) \big) + \lambda\, \|\boldsymbol{\theta}\|_2^2 $$

where $\lambda\ge0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $\lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.

Indeed, using $\ell_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $\ell_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).

For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.

I think OP would also benefit from a discussion of the distinction between mse and mae minimization — generic_user, Nov 08 '18 at 13:23
@generic_user this was already described in a number of places on this site, I gave several links that discuss this. — Tim, Nov 08 '18 at 13:24
People usually refer to the squared $\ell_2$-norm rather than the plain $\ell_2$-norm penalization. Perhaps you could touch upon that? — Firebug, Nov 08 '18 at 15:52
I added a link (including your answer) that already discusses it: https://stats.stackexchange.com/questions/230282/question-about-conventions-for-l1-and-l2-regularization — Tim, Nov 08 '18 at 16:01

L1 and L2 penalty vs L1 and L2 norms

1 Answers1