Why does regularization of coefficient magnitude improve the generalization of linear regression?

Question

What is the basic argument upon which ridge and lasso regression are based on? I went through Tikhonov regularization wiki where it was mentioned that

In many cases, tikhonov matrix is chosen as the identity matrix , giving preference to solutions with smaller norms. In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous.

I want to understand that why are solutions with smaller norms more appealing? Smoothness I can get but why smaller norms?

user603 · Answer 1 · 2013-07-12T11:48:25.270

You can also think of regularization by norm penalization as conceptually similar to random effects (see for example, the beginning of section 2 of Koenker 2004 in particular the first proposition).

Depending on your background, you may be more receptive/familiar to arguments supporting the use of random effects than those supporting the use of regularization.

In any case, there is mapping between the type of regularization and the structure of the random effects and you can justify the use of any one of them by drawing on arguments from the other.

*Quantile regression for longitudinal data; Koenker, R. (2004). Journal of Multivariate Analysis, Volume 91, Issue 1, Pages 74–89. Working paper version here

score 2 · Answer 2 · answered Jul 12 '13 at 07:45

Generally, people use error on a holdout set as a proxy for generalization error. I think a fair response is to say if using an l1 or l2 penalty reduced error on the holdout test, then whatever you were doing was probably overfitting.

Now, as to why it works: for regression, you can consider a l2 penalty as a normal prior on the parameters. That is, it's direct to show that

$$ \underset{ \boldsymbol{w} }{\operatorname{argmax}} \sum_{i=1}^{N} \log \mathcal{N}( y_{i} | \boldsymbol{w^{T}x_{i}}, \sigma^2) + \sum_{i} \log \mathcal{N}(w_j | 0, \tau^2) $$ is a MAP estimate. So the improvement from an l2 norm can be considered as the win from going from a mle to a map estimate. There are also some deeper connections to pca that I don't want to try to type in this box, but essentially, this is a shrinkage estimator that shrinks the directions we are most uncertain about $ \boldsymbol{w}$ more.

One intuition about why a lasso may improve a model is if you have groups of highly correlated explanatory variables, lasso may help you drop some of them.

AS to your last paragraph, it is often noted that for highly colinear regressors, ridge do often work better tha lasso ... — kjetil b halvorsen, Sep 11 '16 at 16:22

score 0 · Answer 3 · answered Jul 11 '13 at 16:42

0

The norm is a smooth way to get some coefficients to zero. If more coefficients are zero, then the model is more parsimonious, hopefully allowing better generalization.

answered Jul 11 '13 at 16:42

Chris

575
1
5
13

1

I am not sure of this but I read that ridge regression cannot set the norm of a coefficient to zero [here on stats.stack](http://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge) In such a case your argument wont hold, right? – Pushpendre Jul 11 '13 at 18:31

Why does regularization of coefficient magnitude improve the generalization of linear regression?

3 Answers3

Linked