Ridge, lasso and elastic net

Question

How do ridge, LASSO and elasticnet regularization methods compare? What are their respective advantages and disadvantages? Any good technical paper, or lecture notes would be appreciated as well.

score 47 · Accepted Answer · edited Oct 22 '17 at 14:05

47

In The Elements of Statistical Learning book, Hastie et al. provide a very insightful and thorough comparison of these shrinkage techniques. The book is available online (pdf). The comparison is done in section 3.4.3, page 69.

The main difference between Lasso and Ridge is the penalty term they use. Ridge uses $L_2$ penalty term which limits the size of the coefficient vector. Lasso uses $L_1$ penalty which imposes sparsity among the coefficients and thus, makes the fitted model more interpretable. Elasticnet is introduced as a compromise between these two techniques, and has a penalty which is a mix of $L_1$ and $L_2$ norms.

edited Oct 22 '17 at 14:05

Agostino

123
1
6

answered Apr 09 '14 at 15:36

MMM

758
6
8

3

That is a wonderful reference book. – bdeonovic Apr 09 '14 at 15:42
5

also because the authors are the inventors of these techniques! – Bakaburg Feb 15 '15 at 16:28
1

Thank you for giving us a reference of this beautiful book – Christina May 18 '15 at 12:32
1

I highly recommend section 18.4 as well, pages 661-668. Provides more information on lasso vs. elastic net. – Katya Willard Mar 23 '16 at 18:49
1

Link to the book is dead as of 14 Oct 2016 – Ashe Oct 14 '16 at 13:25
2

New link: http://statweb.stanford.edu/~tibs/ElemStatLearn/ – DJack Oct 19 '16 at 14:51
@Bakaburg They are the *inventors* ie these are not "age-old" techniques?? – WestCoastProjects Aug 13 '17 at 20:32
AKAIK Tibshirani and/or Hastie are the inventor or at least one of the biggest improver of L1 and L2 regularization techniques for regression, especially of the elastic net. But maybe I am wrong :) – Bakaburg Aug 14 '17 at 10:14

balaks · Answer 2 · 2016-10-18T17:18:44.733

To summarize, here are some salient differences between Lasso, Ridge and Elastic-net:

Lasso does a sparse selection, while Ridge does not.
When you have highly-correlated variables, Ridge regression shrinks the two coefficients towards one another. Lasso is somewhat indifferent and generally picks one over the other. Depending on the context, one does not know which variable gets picked. Elastic-net is a compromise between the two that attempts to shrink and do a sparse selection simultaneously.
Ridge estimators are indifferent to multiplicative scaling of the data. That is, if both X and Y variables are multiplied by constants, the coefficients of the fit do not change, for a given $\lambda$ parameter. However, for Lasso, the fit is not independent of the scaling. In fact, the $\lambda$ parameter must be scaled up by the multiplier to get the same result. It is more complex for elastic net.
Ridge penalizes the largest $\beta$'s more than it penalizes the smaller ones (as they are squared in the penalty term). Lasso penalizes them more uniformly. This may or may not be important. In a forecasting problem with a powerful predictor, the predictor's effectiveness is shrunk by the Ridge as compared to the Lasso.

@ balaks for the second point that you made, what does it mean of 'one does not know which variable gets picked'? Did you mean LASSO is indifferent, so it kind of randomly picks one so we don't really know which one is the best? — meTchaikovsky, Sep 02 '18 at 08:25
total non-expert here, but I think the idea is that X1 and X2 may be significantly indifferent explainers of Y, but the tiny unimportant edge of one over the other leads lasso to select it a floor the other, where Ridge would preserve them both. As such, lasso results may be harder to reproduce or cross validate. This MIGHT be consistent with why we call lasso an unstable selector. (Would love someone else to confirm) — John Vandivier, Oct 18 '21 at 00:21

score 4 · Answer 3 · edited Oct 18 '16 at 17:22

4

I highly recommended you to have a look at An introduction to statistical learning book (Tibshirani et. al, 2013).

The reason for this is that Elements of statistical learning book is intended for individuals with advanced training in the mathematical sciences. In the foreword to ISL, the authors write:

An Introduction to Statistical Learning arose from the perceived need for a broader and less technical treatment of these topics. [...]

An Introduction to Statistical Learning is appropriate for advanced undergraduates or master’s students in statistics or related quantitative fields or for individuals in other disciplines who wish to use statistical learning tools to analyze their data.

edited Oct 18 '16 at 17:22

Sycorax

76,417
20
189
313

answered Jul 23 '16 at 14:53

jeza

1,527
2
16
37

1

Can you elaborate on why you found this reference to be useful? – J. M. is not a statistician Jul 23 '16 at 15:00
2

It's fine to quote a book, but please mark it as a quote and not as your own text. Otherwise it's plagiarism. I edited it for you now. – amoeba Jul 23 '16 at 20:05

score 2 · Answer 4 · answered Dec 17 '18 at 07:25

The above answers are very clear and informative. I would like to add one minor point from the statistic perspective. Take the ridge regression as an example. It is an extension of the ordinal least square regression to solve the multicollinearity problems when there are many correlated features. If the linear regression is

Y=Xb+e

The normal equation solution for the multiple linear regression

b=inv(X.T*X)*X.T*Y

The normal equation solution for the ridge regression is

b=inv(X.T*X+k*I)*X.T*Y.

It is a biased estimator for b and we can always find a penalty term k which will make the mean square error of Ridge regression smaller than that of OLS regression.

For LASSO and Elastic-Net, we could not find such a analytic solution.

Ridge, lasso and elastic net

4 Answers4

Linked

Related