Questions tagged [lasso]

A regularization method for regression models that shrinks coefficients towards zero, making some of them equal to zero. Thus lasso performs feature selection.

LASSO is an acronym for least absolute shrinkage and selection operator. It is a form of regularization used in the estimation of regression coefficients that shrinks coefficient estimates by penalizing their absolute value (i.e. the $L_1$ norm of the estimates). Some coefficients may be shrunk to zero; thus the lasso performs feature selection. The lasso is equivalent to the Bayesian estimation problem where an i.i.d. standard Laplacian prior is used for the regression parameters.

In the context of linear regression, we can formulate the LASSO problem as:

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{p}$ we attempt to find a vector of coefficients $\hat{\beta}_{LASSO} \in \mathbb{R}^{p}$ such that the following holds:

$$\hat{\beta}_{LASSO} = \underset{\beta} {\text{argmin}} \sum\limits_{i=1}^{N}(y_i - \sum\limits_{j=1}^{p}x_{i,j}\beta_{j})^2$$

$$ \text{subject to } \sum\limits_{j=1}^{p}|\beta_{j}| \leq t$$

Due to the nature of the $L_1$ penalty, there is no closed form solution for $\hat{\beta}_{LASSO}$, so computing the LASSO is a quadratic programming problem, unlike ridge where a closed form exists.

In a Bayesian context, we can derive an equivalent regularization penalty by finding the $\beta$ which maximizes the posterior, the MAP estimate:

Assume a Laplacian prior on $\beta$:

$$\pi(\beta|\tau) \propto e^{-\frac{1}{2\tau}\sum_{j=1}^{p} |\beta_j|}$$

If we assume that $y \sim N(X\beta,\sigma^2 I)$, then the posterior of $\beta$:

$$P(\beta|X,Y,\sigma^2,\tau) \propto \prod_{i=1}^{n}e^{-\frac{1}{2\sigma^2}(y_i - x_i^T\beta)^2}e^{-\frac{1}{2\tau}\sum_{j=1}^{p} |\beta_j|} $$

The MAP estimate above is equivalent to minimizing twice the negative log:

$$ \propto \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - x_i^T \beta)^2 + \frac{1}{\tau}\sum_{j=1}^{p} |\beta_j| $$

Let $\lambda = \frac{\sigma^2}{\tau}$. The final expression is:

$$\hat{\beta} = \underset{\beta} {\text{argmin}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 + \lambda\sum_{j=1}^{p} |\beta_j| = \hat{\beta}_{LASSO}$$

1302 questions

198

votes

3 answers

When should I use lasso vs ridge?

Say I want to estimate a large number of parameters, and I want to penalize some of them because I believe they should have little effect compared to the others. How do I decide what penalization scheme to use? When is ridge regression more…

regression lasso ridge-regression

asked Jul 28 '10 at 01:10

Larry Wang

2,091
3
13
8

141

votes

8 answers

Why L1 norm for sparse models

I am reading books about linear regression. There are some sentences about the L1 and L2 norm. I know the formulas, but I don't understand why the L1 norm enforces sparsity in models. Can someone give a simple explanation?

regression lasso regularization ridge-regression

asked Dec 11 '12 at 07:25

Yongwei Xing

1,583
3
11
7

114

votes

4 answers

Why does the Lasso provide Variable Selection?

I've been reading Elements of Statistical Learning, and I would like to know why the Lasso provides variable selection and ridge regression doesn't. Both methods minimize the residual sum of squares and have a constraint on the possible values of…

regression feature-selection lasso regularization

asked Nov 04 '13 at 14:39

Zhi Zhao

1,352
3
9
9

votes

2 answers

When to use regularization methods for regression?

In what circumstances should one consider using regularization methods (ridge, lasso or least angles regression) instead of OLS? In case this helps steer the discussion, my main interest is improving predictive accuracy.

regression least-squares lasso ridge-regression fused-lasso

asked Nov 06 '10 at 17:53

NPE

5,351
5
33
44

votes

3 answers

What is the lasso in regression analysis?

I'm looking for a non-technical definition of the lasso and what it is used for.

regression lasso regularization

asked Oct 19 '11 at 04:24

Paul Vogt

votes

3 answers

An example: LASSO regression using glmnet for binary outcome

I am starting to dabble with the use of glmnet with LASSO Regression where my outcome of interest is dichotomous. I have created a small mock data frame below: age <- c(4, 8, 7, 12, 6, 9, 10, 14, 7) gender <- c(1, 0, 1, 1, 1, 0, 1, 0, 0) bmi_p…

r self-study lasso

asked Oct 08 '13 at 15:56

Matt Reichenbach

3,404
6
25
43

votes

11 answers

What are disadvantages of using the lasso for variable selection for regression?

From what I know, using lasso for variable selection handles the problem of correlated inputs. Also, since it is equivalent to Least Angle Regression, it is not slow computationally. However, many people (for example people I know doing…

regression feature-selection lasso

asked Mar 06 '11 at 23:21

xuexue

2,098
2
16
11

votes

5 answers

What problem do shrinkage methods solve?

The holiday season has given me the opportunity to curl up next to the fire with The Elements of Statistical Learning. Coming from a (frequentist) econometrics perspective, I'm having trouble grasping the uses of shrinkage methods like ridge…

lasso ridge-regression regularization lars

asked Dec 27 '11 at 22:35

Charlie

13,124
5
38
68

votes

6 answers

Standard errors for lasso prediction using R

I'm trying to use a LASSO model for prediction, and I need to estimate standard errors. Surely someone has already written a package to do this. But as far as I can see, none of the packages on CRAN that do predictions using a LASSO will return…

r standard-error prediction lasso

asked Mar 26 '14 at 02:20

Rob Hyndman

51,928
23
126
178

votes

2 answers

Derivation of closed form lasso solution

For the lasso problem $\min_\beta (Y-X\beta)^T(Y-X\beta)$ such that $\|\beta\|_1 \leq t$. I often see the soft-thresholding result $$ \beta_j^{\text{lasso}}= \mathrm{sgn}(\beta^{\text{LS}}_j)(|\beta_j^{\text{LS}}|-\gamma)^+ $$ for the orthonormal…

lasso

asked Nov 01 '11 at 00:03

Gary

1,469
1
13
9

votes

3 answers

Why does shrinkage work?

In order to solve problems of model selection, a number of methods (LASSO, ridge regression, etc.) will shrink the coefficients of predictor variables towards zero. I am looking for an intuitive explanation of why this improves predictive ability.…

lasso ridge-regression intuition regularization

asked Nov 02 '15 at 20:29

aspiringstatistician

votes

3 answers

Why do we only see $L_1$ and $L_2$ regularization but not other norms?

I am just curious why there are usually only $L_1$ and $L_2$ norms regularization. Are there proofs of why these are better?

lasso regularization ridge-regression

asked Mar 23 '17 at 09:28

user10024395

votes

1 answer

Is regression with L1 regularization the same as Lasso, and with L2 regularization the same as ridge regression? And how to write "Lasso"?

I'm a software engineer learning machine learning, particularly through Andrew Ng's machine learning courses. While studying linear regression with regularization, I've found terms that are confusing: Regression with L1 regularization or L2…

regression terminology lasso regularization ridge-regression

asked Mar 07 '16 at 19:24

stackoverflowuser2010

3,190
5
27
35

votes

2 answers

When will L1 regularization work better than L2 and vice versa?

Note: I know that L1 has feature selection property. I am trying to understand which one to choose when feature selection is completely irrelevant. How to decide which regularization (L1 or L2) to use? What are the pros & cons of each of L1 / L2…

regression lasso regularization ridge-regression

asked Nov 28 '15 at 16:57

GeorgeOfTheRF

5,063
14
42
51

votes

3 answers

whether to rescale indicator / binary / dummy predictors for LASSO

For the LASSO (and other model selecting procedures) it is crucial to rescale the predictors. The general recommendation I follow is simply to use a 0 mean, 1 standard deviation normalization for continuous variables. But what is there to do with…

predictive-models model-selection lasso normalization standardization

asked Sep 09 '13 at 14:46

László

2 3

…

86 87 Next