What are the differences between Ridge regression using R's glmnet and Python's scikit-learn?

Question

I am going through the LAB section §6.6 on Ridge Regression/Lasso in the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).

More specifically, I am trying to do apply the scikit-learn Ridge model to the 'Hitters' dataset from the R package 'ISLR'. I have created the same set of features as shown in the R code. However, I cannot get close to the results from the glmnet() model. I have selected one L2 tuning parameter to compare. ('alpha' argument in scikit-learn).

Python:

regr = Ridge(alpha=11498)
regr.fit(X, y)

http://nbviewer.ipython.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb

R:

Note that the argument alpha=0 in glmnet() means that a L2 penalty should be applied (Ridge regression). The documentation warns not to enter a single value for lambda, but the result is the same as in ISL, where a vector is used.

ridge.mod <- glmnet(x,y,alpha=0,lambda=11498)

What causes the differences?

Edit:
When using penalized() from the penalized package in R, the coefficients are the same as with scikit-learn.

ridge.mod2 <- penalized(y,x,lambda2=11498)

Maybe the question could then also be: 'What is the difference between glmnet() and penalized() when doing Ridge regression?

New python wrapper for actual Fortran code used in R package glmnet
https://github.com/civisanalytics/python-glmnet

Totally unfamiliar with glmnet ridge regression. But by default, `sklearn.linear_model.Ridge` does unpenalized intercept estimation (standard) and the penalty is such that `||Xb - y - intercept||^2 + alpha ||b||^2` is minimized for `b`. There can be factors `1/2` or `1/n_samples` or both in front of the penalty, making results different immediately. To factor out the penalty scaling problem, set the penalty to 0 in both cases, resolve any discrepancies there and then check what adding back the penalty does. And btw IMHO here IS the right place to ask this question. — , Jul 04 '15 at 12:51

visitor · Answer 1 · 2017-03-31T20:44:45.513

10

Matthew Drury's answer should have a factor of 1/N. More precisely...

The glmnet documentation states that the elastic net minimizes the loss function

$$ \frac{1}{N} \| X\beta - y \|_2^2 + \lambda \left( \frac{1}{2} (1 - \alpha) \, \| \beta \|_2^2 + \alpha \| \beta \|_1 \right) $$

The sklearn documentation says that linear_model.Ridge minimizes the loss function

$$ \| X\beta - y \|_2^2 + \alpha \| \beta \|_2^2 $$

which is equivalent to minimizing

$$ \frac{1}{N} \| X\beta - y \|_2^2 + \frac{\alpha}{N} \| \beta \|_2^2 $$

To obtain the same solution from glmnet and sklearn, both of their loss functions must be equal. This means setting $\alpha = 0$ and $\displaystyle{\lambda = \frac{2}{N} \alpha_{\text{sklearn}}}$ in glmnet.

library(glmnet)
X = matrix(c(1, 1, 2, 3, 4, 2, 6, 5, 2, 5, 5, 3), byrow = TRUE, ncol = 3)
y = c(1, 0, 0, 1)
reg = glmnet(X, y, alpha = 0, lambda = 2 / nrow(X))
coef(reg)

glmnet output: –0.03862100, –0.03997036, –0.07276511, 0.42727955

import numpy as np
from sklearn.linear_model import Ridge
X = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Ridge(alpha = 1, fit_intercept = True, normalize = True)
reg.fit(X, y)
np.hstack((reg.intercept_, reg.coef_))

sklearn output: –0.03862178, –0.0399697, –0.07276535, 0.42727921

edited Mar 31 '17 at 20:44

answered Mar 30 '17 at 03:10

visitor

203
2
5

4

The different definitions of parameters and their scaling used in different libraries is a common source of confusion. – AaronDefazio Mar 30 '17 at 03:31
1

I wouldn't expect that both Gung and I would get this wrong. – Michael R. Chernick Mar 30 '17 at 06:10
2

Yes, both of you got it wrong. Your reasons for rejecting my edit make it clear that both of you didn't see my comment "Missing factor of 1/N" at stats.stackexchange.com/review/suggested-edits/139985 – visitor Mar 30 '17 at 08:04
Your edit was probably rejected because it changed much more than only what you claim. If youd like to edit my post and only change the missing factor, please do, but changing my links and wording and code as well is overkill. The comments about your unfair treatment in your answer are innapropriate, and unrelated to the content of the question, please remove them. Your wording also plagarized my answer, this is not the right way to respond to a rejected edit. We would love your valuable contributions to our community, but please aquaint yourself with our norms before eviscerating us. – Matthew Drury Mar 31 '17 at 16:53
I copied your wording because it was initially an edit, not intended to be a new answer. Now that my answer has been merged with this question, I have changed my wording as you requested. As for unfairness, if the rejection was fair, then why ask me to censor that comment? Readers can judge for themselves anyway. What upsets me is that I spent quite a bit of time editing and it would have all been wasted if I didn't back it up. Also, the norms of any open community can change over time; it's not gospel. – visitor Mar 31 '17 at 17:29
Visitor, I am glad you were able to recover your suggested edit. I am surprised that it was lost to you, though, because this site tends to save almost everything that is ever posted to it. In fact, I can still see your suggested edit to @Matthew's reply. Since moderators have different capabilities than newcomers I'm unsure whether you have access to that edit, but I think you ought to. At the very least, if you ever believe the site may have lost your work, please flag a moderator because we might be able to find it for you. – whuber Mar 31 '17 at 17:36
I recovered that edit only because I saved it on my computer. I was worried it would be lost if it's rejected. True enough, when I visited stats.stackexchange.com/review/suggested-edits/139985 after it was apparently rejected, I got a "page not found" error. Now I can see it again. Maybe I mistyped the link or I've gained enough points to see it. – visitor Mar 31 '17 at 19:20
1

@visitor Sorry if I came off a bit gruff. I really should just be trying to communicate that you seem like a good potential contributor to the site, and I want you to have a good experience. We have some social norms, just like any other group, and you will have a better experience if you stay aware of them. I still think "Matthew Drury's answer is wrong" is quite harsh, there are surely better ways to communicate that my answer is erroneously missing a factor of $\frac{1}{N}$. "X's answer is wrong" reads as a personal attack. – Matthew Drury Mar 31 '17 at 20:41

Matthew Drury · Accepted Answer · 2017-03-31T20:46:36.773

9

My answer is missing a factor of $\frac{1}{N}$, please see @visitors answer below for the correct comparison.

Here are two references that should clarify the relationship.

The sklearn documentation says that linear_model.Ridge optimizes the following objective function

$$ \left| X \beta - y \right|_2^2 + \alpha \left| \beta \right|_2^2 $$

The glmnet paper says that the elastic net optimizes the following objective function

$$ \left| X \beta - y \right|_2^2 + \lambda \left( \frac{1}{2} (1 - \alpha) \left| \beta \right|_2^2 + \alpha \left| \beta \right|_1 \right) $$

Notice that the two implementations use $\alpha$ in totally different ways, sklearn uses $\alpha$ for the overall level of regularization while glmnet uses $\lambda$ for that purpose, reserving $\alpha$ for trading between ridge and lasso regularization.

Comparing the formulas, it look like setting $\alpha = 0$ and $\lambda = 2 \alpha_{\text{sklearn}}$ in glmnet should recover the solution from linear_model.Ridge.

edited Mar 31 '17 at 20:46

answered Jul 06 '15 at 22:23

Matthew Drury

33,314
2
101
132

And I totally missed that in @eickenberg 's comment as well. I do have to use `standardize = FALSE` in `glmnet()` to get the identical results. – Jordi Jul 07 '15 at 08:30
@Jordi You should definitely standardized if using `linear_model.Ridge` for any real world analysis. – Matthew Drury Jul 07 '15 at 11:56
I understand that sklearn `linear_model.Ridge` model standardizes the features automatically. Normalization is optional. I wonder why I then need to deactivate standardization in `glmnet()` to get the models to produce identical results. – Jordi Jul 07 '15 at 13:26

What are the differences between Ridge regression using R's glmnet and Python's scikit-learn?

2 Answers2

Linked

Related