Questions tagged [regularization]

Inclusion of additional constraints (typically a penalty for complexity) in the model fitting process. Used to prevent overfitting / enhance predictive accuracy.

Regularization refers to the inclusion of additional components in the model fitting process that are used to prevent overfitting and/or stabilize parameter estimates.

Parametric approaches to regularization typically add terms to the training error or MLE objective function that penalize model complexity, in addition to the standard data misfit terms (e.g. Ridge Regression, LASSO). This penalty can be interpreted as arising from a prior on the parameter vector in the framework of Bayesian MAP estimation.

Non-parametric regularization techniques include dropout (used in deep learning) and truncated-SVD (used in linear least squares).

Synonyms include: penalization, shrinkage methods, and constrained fitting.

1283 questions
141
votes
8 answers

Why L1 norm for sparse models

I am reading books about linear regression. There are some sentences about the L1 and L2 norm. I know the formulas, but I don't understand why the L1 norm enforces sparsity in models. Can someone give a simple explanation?
Yongwei Xing
  • 1,583
  • 3
  • 11
  • 7
114
votes
4 answers

Why does the Lasso provide Variable Selection?

I've been reading Elements of Statistical Learning, and I would like to know why the Lasso provides variable selection and ridge regression doesn't. Both methods minimize the residual sum of squares and have a constraint on the possible values of…
Zhi Zhao
  • 1,352
  • 3
  • 9
  • 9
88
votes
3 answers

What is the lasso in regression analysis?

I'm looking for a non-technical definition of the lasso and what it is used for.
Paul Vogt
  • 881
  • 1
  • 7
  • 3
86
votes
6 answers

Why is the L2 regularization equivalent to Gaussian prior?

I keep reading this and intuitively I can see this but how does one go from L2 regularization to saying that this is a Gaussian Prior analytically? Same goes for saying L1 is equivalent to a Laplacean prior. Any further references would be great.
Anonymous
  • 1,169
  • 2
  • 10
  • 10
81
votes
5 answers

What is regularization in plain english?

Unlike other articles, I found the wikipedia entry for this subject unreadable for a non-math person (like me). I understood the basic idea, that you favor models with fewer rules. What I don't get is how do you get from a set of rules to a…
Meh
  • 1,135
  • 2
  • 10
  • 12
74
votes
5 answers

Unified view on shrinkage: what is the relation (if any) between Stein's paradox, ridge regression, and random effects in mixed models?

Consider the following three phenomena. Stein's paradox: given some data from multivariate normal distribution in $\mathbb R^n, \: n\ge 3$, sample mean is not a very good estimator of the true mean. One can obtain an estimation with lower mean…
amoeba
  • 93,463
  • 28
  • 275
  • 317
70
votes
6 answers

Why is multicollinearity not checked in modern statistics/machine learning

In traditional statistics, while building a model, we check for multicollinearity using methods such as estimates of the variance inflation factor (VIF), but in machine learning, we instead use regularization for feature selection and don't seem to…
69
votes
5 answers

What problem do shrinkage methods solve?

The holiday season has given me the opportunity to curl up next to the fire with The Elements of Statistical Learning. Coming from a (frequentist) econometrics perspective, I'm having trouble grasping the uses of shrinkage methods like ridge…
Charlie
  • 13,124
  • 5
  • 38
  • 68
66
votes
3 answers

Why does ridge estimate become better than OLS by adding a constant to the diagonal?

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$ $$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda…
Heisenberg
  • 4,239
  • 3
  • 23
  • 54
64
votes
6 answers

Is ridge regression useless in high dimensions ($n \ll p$)? How can OLS fail to overfit?

Consider a good old regression problem with $p$ predictors and sample size $n$. The usual wisdom is that OLS estimator will overfit and will generally be outperformed by the ridge regression estimator: $$\hat\beta = (X^\top X + \lambda I)^{-1}X^\top…
amoeba
  • 93,463
  • 28
  • 275
  • 317
60
votes
7 answers

Why is the regularization term *added* to the cost function (instead of multiplied etc.)?

Whenever regularization is used, it is often added onto the cost function such as in the following cost function. $$ J(\theta)=\frac 1 2(y-\theta X^T)(y-\theta X^T)^T+\alpha\|\theta\|_2^2 $$ This makes intuitive sense to me since minimize the cost…
grenmester
  • 725
  • 1
  • 6
  • 5
58
votes
3 answers

Why does shrinkage work?

In order to solve problems of model selection, a number of methods (LASSO, ridge regression, etc.) will shrink the coefficients of predictor variables towards zero. I am looking for an intuitive explanation of why this improves predictive ability.…
56
votes
5 answers

How to derive the ridge regression solution?

I am having some issues with the derivation of the solution for ridge regression. I know the regression solution without the regularization term: $$\beta = (X^TX)^{-1}X^Ty.$$ But after adding the L2 term $\lambda\|\beta\|_2^2$ to the cost function,…
user34790
  • 6,049
  • 6
  • 42
  • 64
52
votes
3 answers

Regularization methods for logistic regression

Regularization using methods such as Ridge, Lasso, ElasticNet is quite common for linear regression. I wanted to know the following: Are these methods applicable for logistic regression? If so, are there any differences in the way they need to be…
Tapan Khopkar
  • 796
  • 2
  • 7
  • 9
50
votes
3 answers

Why do we only see $L_1$ and $L_2$ regularization but not other norms?

I am just curious why there are usually only $L_1$ and $L_2$ norms regularization. Are there proofs of why these are better?
user10024395
  • 1
  • 2
  • 11
  • 20
1
2 3
85 86