Questions tagged [lasso]

A regularization method for regression models that shrinks coefficients towards zero, making some of them equal to zero. Thus lasso performs feature selection.

LASSO is an acronym for least absolute shrinkage and selection operator. It is a form of regularization used in the estimation of regression coefficients that shrinks coefficient estimates by penalizing their absolute value (i.e. the $L_1$ norm of the estimates). Some coefficients may be shrunk to zero; thus the lasso performs feature selection. The lasso is equivalent to the Bayesian estimation problem where an i.i.d. standard Laplacian prior is used for the regression parameters.

In the context of linear regression, we can formulate the LASSO problem as:

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{p}$ we attempt to find a vector of coefficients $\hat{\beta}_{LASSO} \in \mathbb{R}^{p}$ such that the following holds:

$$\hat{\beta}_{LASSO} = \underset{\beta} {\text{argmin}} \sum\limits_{i=1}^{N}(y_i - \sum\limits_{j=1}^{p}x_{i,j}\beta_{j})^2$$

$$ \text{subject to } \sum\limits_{j=1}^{p}|\beta_{j}| \leq t$$

Due to the nature of the $L_1$ penalty, there is no closed form solution for $\hat{\beta}_{LASSO}$, so computing the LASSO is a quadratic programming problem, unlike ridge where a closed form exists.

In a Bayesian context, we can derive an equivalent regularization penalty by finding the $\beta$ which maximizes the posterior, the MAP estimate:

Assume a Laplacian prior on $\beta$:

$$\pi(\beta|\tau) \propto e^{-\frac{1}{2\tau}\sum_{j=1}^{p} |\beta_j|}$$

If we assume that $y \sim N(X\beta,\sigma^2 I)$, then the posterior of $\beta$:

$$P(\beta|X,Y,\sigma^2,\tau) \propto \prod_{i=1}^{n}e^{-\frac{1}{2\sigma^2}(y_i - x_i^T\beta)^2}e^{-\frac{1}{2\tau}\sum_{j=1}^{p} |\beta_j|} $$

The MAP estimate above is equivalent to minimizing twice the negative log:

$$ \propto \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - x_i^T \beta)^2 + \frac{1}{\tau}\sum_{j=1}^{p} |\beta_j| $$

Let $\lambda = \frac{\sigma^2}{\tau}$. The final expression is:

$$\hat{\beta} = \underset{\beta} {\text{argmin}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 + \lambda\sum_{j=1}^{p} |\beta_j| = \hat{\beta}_{LASSO}$$

1302 questions
198
votes
3 answers

When should I use lasso vs ridge?

Say I want to estimate a large number of parameters, and I want to penalize some of them because I believe they should have little effect compared to the others. How do I decide what penalization scheme to use? When is ridge regression more…
Larry Wang
  • 2,091
  • 3
  • 13
  • 8
141
votes
8 answers

Why L1 norm for sparse models

I am reading books about linear regression. There are some sentences about the L1 and L2 norm. I know the formulas, but I don't understand why the L1 norm enforces sparsity in models. Can someone give a simple explanation?
Yongwei Xing
  • 1,583
  • 3
  • 11
  • 7
114
votes
4 answers

Why does the Lasso provide Variable Selection?

I've been reading Elements of Statistical Learning, and I would like to know why the Lasso provides variable selection and ridge regression doesn't. Both methods minimize the residual sum of squares and have a constraint on the possible values of…
Zhi Zhao
  • 1,352
  • 3
  • 9
  • 9
93
votes
2 answers

When to use regularization methods for regression?

In what circumstances should one consider using regularization methods (ridge, lasso or least angles regression) instead of OLS? In case this helps steer the discussion, my main interest is improving predictive accuracy.
NPE
  • 5,351
  • 5
  • 33
  • 44
88
votes
3 answers

What is the lasso in regression analysis?

I'm looking for a non-technical definition of the lasso and what it is used for.
Paul Vogt
  • 881
  • 1
  • 7
  • 3
86
votes
3 answers

An example: LASSO regression using glmnet for binary outcome

I am starting to dabble with the use of glmnet with LASSO Regression where my outcome of interest is dichotomous. I have created a small mock data frame below: age <- c(4, 8, 7, 12, 6, 9, 10, 14, 7) gender <- c(1, 0, 1, 1, 1, 0, 1, 0, 0) bmi_p…
Matt Reichenbach
  • 3,404
  • 6
  • 25
  • 43
83
votes
11 answers

What are disadvantages of using the lasso for variable selection for regression?

From what I know, using lasso for variable selection handles the problem of correlated inputs. Also, since it is equivalent to Least Angle Regression, it is not slow computationally. However, many people (for example people I know doing…
xuexue
  • 2,098
  • 2
  • 16
  • 11
69
votes
5 answers

What problem do shrinkage methods solve?

The holiday season has given me the opportunity to curl up next to the fire with The Elements of Statistical Learning. Coming from a (frequentist) econometrics perspective, I'm having trouble grasping the uses of shrinkage methods like ridge…
Charlie
  • 13,124
  • 5
  • 38
  • 68
67
votes
6 answers

Standard errors for lasso prediction using R

I'm trying to use a LASSO model for prediction, and I need to estimate standard errors. Surely someone has already written a package to do this. But as far as I can see, none of the packages on CRAN that do predictions using a LASSO will return…
Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
64
votes
2 answers

Derivation of closed form lasso solution

For the lasso problem $\min_\beta (Y-X\beta)^T(Y-X\beta)$ such that $\|\beta\|_1 \leq t$. I often see the soft-thresholding result $$ \beta_j^{\text{lasso}}= \mathrm{sgn}(\beta^{\text{LS}}_j)(|\beta_j^{\text{LS}}|-\gamma)^+ $$ for the orthonormal…
Gary
  • 1,469
  • 1
  • 13
  • 9
58
votes
3 answers

Why does shrinkage work?

In order to solve problems of model selection, a number of methods (LASSO, ridge regression, etc.) will shrink the coefficients of predictor variables towards zero. I am looking for an intuitive explanation of why this improves predictive ability.…
50
votes
3 answers

Why do we only see $L_1$ and $L_2$ regularization but not other norms?

I am just curious why there are usually only $L_1$ and $L_2$ norms regularization. Are there proofs of why these are better?
user10024395
  • 1
  • 2
  • 11
  • 20
47
votes
1 answer

Is regression with L1 regularization the same as Lasso, and with L2 regularization the same as ridge regression? And how to write "Lasso"?

I'm a software engineer learning machine learning, particularly through Andrew Ng's machine learning courses. While studying linear regression with regularization, I've found terms that are confusing: Regression with L1 regularization or L2…
46
votes
2 answers

When will L1 regularization work better than L2 and vice versa?

Note: I know that L1 has feature selection property. I am trying to understand which one to choose when feature selection is completely irrelevant. How to decide which regularization (L1 or L2) to use? What are the pros & cons of each of L1 / L2…
GeorgeOfTheRF
  • 5,063
  • 14
  • 42
  • 51
45
votes
3 answers

whether to rescale indicator / binary / dummy predictors for LASSO

For the LASSO (and other model selecting procedures) it is crucial to rescale the predictors. The general recommendation I follow is simply to use a 0 mean, 1 standard deviation normalization for continuous variables. But what is there to do with…
1
2 3
86 87