How do ridge, LASSO and elasticnet regularization methods compare? What are their respective advantages and disadvantages? Any good technical paper, or lecture notes would be appreciated as well.
4 Answers
In The Elements of Statistical Learning book, Hastie et al. provide a very insightful and thorough comparison of these shrinkage techniques. The book is available online (pdf). The comparison is done in section 3.4.3, page 69.
The main difference between Lasso and Ridge is the penalty term they use. Ridge uses $L_2$ penalty term which limits the size of the coefficient vector. Lasso uses $L_1$ penalty which imposes sparsity among the coefficients and thus, makes the fitted model more interpretable. Elasticnet is introduced as a compromise between these two techniques, and has a penalty which is a mix of $L_1$ and $L_2$ norms.
-
3That is a wonderful reference book. – bdeonovic Apr 09 '14 at 15:42
-
5also because the authors are the inventors of these techniques! – Bakaburg Feb 15 '15 at 16:28
-
1Thank you for giving us a reference of this beautiful book – Christina May 18 '15 at 12:32
-
1I highly recommend section 18.4 as well, pages 661-668. Provides more information on lasso vs. elastic net. – Katya Willard Mar 23 '16 at 18:49
-
1Link to the book is dead as of 14 Oct 2016 – Ashe Oct 14 '16 at 13:25
-
2New link: http://statweb.stanford.edu/~tibs/ElemStatLearn/ – DJack Oct 19 '16 at 14:51
-
@Bakaburg They are the *inventors* ie these are not "age-old" techniques?? – WestCoastProjects Aug 13 '17 at 20:32
-
AKAIK Tibshirani and/or Hastie are the inventor or at least one of the biggest improver of L1 and L2 regularization techniques for regression, especially of the elastic net. But maybe I am wrong :) – Bakaburg Aug 14 '17 at 10:14
To summarize, here are some salient differences between Lasso, Ridge and Elastic-net:
- Lasso does a sparse selection, while Ridge does not.
- When you have highly-correlated variables, Ridge regression shrinks the two coefficients towards one another. Lasso is somewhat indifferent and generally picks one over the other. Depending on the context, one does not know which variable gets picked. Elastic-net is a compromise between the two that attempts to shrink and do a sparse selection simultaneously.
- Ridge estimators are indifferent to multiplicative scaling of the data. That is, if both X and Y variables are multiplied by constants, the coefficients of the fit do not change, for a given $\lambda$ parameter. However, for Lasso, the fit is not independent of the scaling. In fact, the $\lambda$ parameter must be scaled up by the multiplier to get the same result. It is more complex for elastic net.
- Ridge penalizes the largest $\beta$'s more than it penalizes the smaller ones (as they are squared in the penalty term). Lasso penalizes them more uniformly. This may or may not be important. In a forecasting problem with a powerful predictor, the predictor's effectiveness is shrunk by the Ridge as compared to the Lasso.

- 401
- 4
- 4
-
1@ balaks for the second point that you made, what does it mean of 'one does not know which variable gets picked'? Did you mean LASSO is indifferent, so it kind of randomly picks one so we don't really know which one is the best? – meTchaikovsky Sep 02 '18 at 08:25
-
total non-expert here, but I think the idea is that X1 and X2 may be significantly indifferent explainers of Y, but the tiny unimportant edge of one over the other leads lasso to select it a floor the other, where Ridge would preserve them both. As such, lasso results may be harder to reproduce or cross validate. This MIGHT be consistent with why we call lasso an unstable selector. (Would love someone else to confirm) – John Vandivier Oct 18 '21 at 00:21
I highly recommended you to have a look at An introduction to statistical learning book (Tibshirani et. al, 2013).
The reason for this is that Elements of statistical learning book is intended for individuals with advanced training in the mathematical sciences. In the foreword to ISL, the authors write:
An Introduction to Statistical Learning arose from the perceived need for a broader and less technical treatment of these topics. [...]
An Introduction to Statistical Learning is appropriate for advanced undergraduates or master’s students in statistics or related quantitative fields or for individuals in other disciplines who wish to use statistical learning tools to analyze their data.
-
1Can you elaborate on why you found this reference to be useful? – J. M. is not a statistician Jul 23 '16 at 15:00
-
2It's fine to quote a book, but please mark it as a quote and not as your own text. Otherwise it's plagiarism. I edited it for you now. – amoeba Jul 23 '16 at 20:05
The above answers are very clear and informative. I would like to add one minor point from the statistic perspective. Take the ridge regression as an example. It is an extension of the ordinal least square regression to solve the multicollinearity problems when there are many correlated features. If the linear regression is
Y=Xb+e
The normal equation solution for the multiple linear regression
b=inv(X.T*X)*X.T*Y
The normal equation solution for the ridge regression is
b=inv(X.T*X+k*I)*X.T*Y.
It is a biased estimator for b and we can always find a penalty term k which will make the mean square error of Ridge regression smaller than that of OLS regression.
For LASSO and Elastic-Net, we could not find such a analytic solution.

- 21
- 1