Questions tagged [ridge-regression]

A regularization method for regression models that shrinks coefficients towards zero.

Ridge Regression is a technique which penalizes the size of regression coefficients in order to deal with multicollinear variables or ill-posed statistical problems. It is based on the Tikhonov regularization named after the mathematician Andrey Tikhonov.

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{J}$, the estimation problem is:

$$\min_\beta \sum\limits_{i=1}^{n} (y_i - x_i'\beta)^2 + \lambda \sum\limits_{j=1}^J \beta_j^2$$

for which the solution is given by

$$\widehat{\beta}_{ridge} = (X'X + \lambda I)^{-1}X'y$$

which is similar to the OLS estimator but including the tuning parameter $\lambda$ and the Tikhonov matrix (in this case $I$, the identity matrix but other choices are possible). Note that, unlike the OLS estimator, the ridge regression estimator is always invertible even if there are more parameters in the model than degrees of freedom and hence there always exists a unique solution to the estimation problem.

Bayesian derivation

Ridge regression is equivalent to Bayesian linear regression assuming a Normal prior on $\beta$. Define the likelihood:

$$L(X,Y;\beta,\sigma^2) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}$$

And using a normal prior with mean 0 and variance $\alpha I_p$ on $\beta$:

$$\pi(\beta) \sim N(0,\alpha I_p)$$

Using Bayes rule, we calculate the posterior distribution:

$$P(\beta | X,Y) \propto L(X,Y;\beta,\sigma^2)\pi(\beta) $$ $$ \propto \big[\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}\big]e^{-\frac12\beta^T(\alpha^2 I_p)^{-1}\beta}$$

Maximizing the posterior is equivalent to minimizing the negative of the log of the posterior (after some algebra):

$$log (P(\beta | X,Y)) \propto -\frac12\big(\frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{1}{\alpha}\beta^T\beta\big)$$ $$\propto \sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{\sigma^2}{\alpha}\sum_{j=1}^{p}\beta^2 $$

Where $\frac{1}{\alpha}$ is the tuning parameter, corresponding to the choice of $\lambda$ from above.

The tuning parameter $\lambda$ determines the degree of shrinkage of the regression coefficients. The idea is to introduce some degree of bias in order to improve the variance (see bias-variance trade-off). In cases of highly multicollinear variables a small increase in bias to trade off for a lower variance can have a substantial effect.

The bias of the ridge regression estimator is $$Bias(\widehat{\beta}) = -\lambda (X'X + \lambda I)^{-1} \beta$$ It is always possible to find $\lambda$ such that the MSE of the ridge regression estimator is smaller than that of the OLS estimator.

Note that as $\lambda \rightarrow 0, \beta_{ridge} \rightarrow \beta_{ols}$ and as $\lambda \rightarrow \infty, \beta_{ridge} \rightarrow 0$. It is therefore important how to choose the value for $\lambda$. Common methods for this decision include the use of information criteria (AIC or BIC) or (generalized) cross-validation.

691 questions
198
votes
3 answers

When should I use lasso vs ridge?

Say I want to estimate a large number of parameters, and I want to penalize some of them because I believe they should have little effect compared to the others. How do I decide what penalization scheme to use? When is ridge regression more…
Larry Wang
  • 2,091
  • 3
  • 13
  • 8
141
votes
8 answers

Why L1 norm for sparse models

I am reading books about linear regression. There are some sentences about the L1 and L2 norm. I know the formulas, but I don't understand why the L1 norm enforces sparsity in models. Can someone give a simple explanation?
Yongwei Xing
  • 1,583
  • 3
  • 11
  • 7
93
votes
2 answers

When to use regularization methods for regression?

In what circumstances should one consider using regularization methods (ridge, lasso or least angles regression) instead of OLS? In case this helps steer the discussion, my main interest is improving predictive accuracy.
NPE
  • 5,351
  • 5
  • 33
  • 44
87
votes
3 answers

Why is ridge regression called "ridge", why is it needed, and what happens when $\lambda$ goes to infinity?

Ridge regression coefficient estimate $\hat{\beta}^R$ are the values that minimize the $$ \text{RSS} + \lambda \sum_{j=1}^p\beta_j^2. $$ My questions are: If $\lambda = 0$, then we see that the expression above reduces to the usual RSS. What if…
cgo
  • 7,445
  • 10
  • 42
  • 61
74
votes
5 answers

Unified view on shrinkage: what is the relation (if any) between Stein's paradox, ridge regression, and random effects in mixed models?

Consider the following three phenomena. Stein's paradox: given some data from multivariate normal distribution in $\mathbb R^n, \: n\ge 3$, sample mean is not a very good estimator of the true mean. One can obtain an estimation with lower mean…
amoeba
  • 93,463
  • 28
  • 275
  • 317
69
votes
5 answers

What problem do shrinkage methods solve?

The holiday season has given me the opportunity to curl up next to the fire with The Elements of Statistical Learning. Coming from a (frequentist) econometrics perspective, I'm having trouble grasping the uses of shrinkage methods like ridge…
Charlie
  • 13,124
  • 5
  • 38
  • 68
66
votes
3 answers

Why does ridge estimate become better than OLS by adding a constant to the diagonal?

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$ $$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda…
Heisenberg
  • 4,239
  • 3
  • 23
  • 54
64
votes
6 answers

Is ridge regression useless in high dimensions ($n \ll p$)? How can OLS fail to overfit?

Consider a good old regression problem with $p$ predictors and sample size $n$. The usual wisdom is that OLS estimator will overfit and will generally be outperformed by the ridge regression estimator: $$\hat\beta = (X^\top X + \lambda I)^{-1}X^\top…
amoeba
  • 93,463
  • 28
  • 275
  • 317
58
votes
3 answers

Why does shrinkage work?

In order to solve problems of model selection, a number of methods (LASSO, ridge regression, etc.) will shrink the coefficients of predictor variables towards zero. I am looking for an intuitive explanation of why this improves predictive ability.…
56
votes
5 answers

How to derive the ridge regression solution?

I am having some issues with the derivation of the solution for ridge regression. I know the regression solution without the regularization term: $$\beta = (X^TX)^{-1}X^Ty.$$ But after adding the L2 term $\lambda\|\beta\|_2^2$ to the cost function,…
user34790
  • 6,049
  • 6
  • 42
  • 64
50
votes
3 answers

Why do we only see $L_1$ and $L_2$ regularization but not other norms?

I am just curious why there are usually only $L_1$ and $L_2$ norms regularization. Are there proofs of why these are better?
user10024395
  • 1
  • 2
  • 11
  • 20
47
votes
1 answer

Is regression with L1 regularization the same as Lasso, and with L2 regularization the same as ridge regression? And how to write "Lasso"?

I'm a software engineer learning machine learning, particularly through Andrew Ng's machine learning courses. While studying linear regression with regularization, I've found terms that are confusing: Regression with L1 regularization or L2…
46
votes
2 answers

When will L1 regularization work better than L2 and vice versa?

Note: I know that L1 has feature selection property. I am trying to understand which one to choose when feature selection is completely irrelevant. How to decide which regularization (L1 or L2) to use? What are the pros & cons of each of L1 / L2…
GeorgeOfTheRF
  • 5,063
  • 14
  • 42
  • 51
43
votes
2 answers

If only prediction is of interest, why use lasso over ridge?

On page 223 in An Introduction to Statistical Learning, the authors summarise the differences between ridge regression and lasso. They provide an example (Figure 6.9) of when "lasso tends to outperform ridge regression in terms of bias, variance,…
42
votes
4 answers

Ridge, lasso and elastic net

How do ridge, LASSO and elasticnet regularization methods compare? What are their respective advantages and disadvantages? Any good technical paper, or lecture notes would be appreciated as well.
user3269
  • 4,622
  • 8
  • 43
  • 53
1
2 3
46 47