I'm having trouble reconciling different values for the ridge parameter that minimizes mean squared error when using RidgeCV in scikit-learn (Python) and cv.glmnet (R).
First a few things to note:
- My code takes into account this post which shows how the "lambda" regularization value in cv.glmnet compares to the "alpha" value in RidgeCV
- In the link above, @eickenberg states that RidgeCV uses unpenalized intercept estimation, while the glmnet documentation states it uses penalized maximum likelihood
Here is my R code which uses the mtcars data:
# Load libraries, get data & set seed for reproducibility
set.seed(123) # sees for reproducibility
library(glmnet) # for ridge regression
library(dplyr) # for data cleaning
data("mtcars")
# Center y, X will be standardized in the modelling function
y <- mtcars %>% select(mpg) %>% scale(center = TRUE, scale = FALSE) %>% as.matrix()
X <- mtcars %>% select(-mpg) %>% as.matrix()
# Perform 10-fold cross-validation to select lambda
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
# Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(X, y, alpha = 0, lambda = lambdas_to_try,
standardize = TRUE, nfolds = 10)
# Best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min
print(lambda_cv)
which results in 2.983647
My Python code using the same data (imported from a csv I exported in R):
import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.linear_model import RidgeCV
# %% Define features of interest
df = pd.read_csv(PATH_TO_MTCARS.csv)
features = ['cyl', 'disp', 'hp', 'drat', 'wt', 'qsec',
'vs', 'am', 'gear', 'carb']
# %% Extract Variables
# Isolate features and target variable
X = df[features]
X = scale(X)
y = df['mpg'] - df['mpg'].mean()
# %% Ridge cross validation
n_alphas = 100
lambdas = np.logspace(-3, 5, n_alphas)
alphas = lambdas * 32 / 2 # convert to glmnet lambdas to RidgeCV alphas, n=32
ridge_cv = RidgeCV(alphas=alphas, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X, y)
print(ridge_cv.alpha_ * 2 / 32) # convert RidgeCV alpha to glmnet lambda
which results in 0.8111308
Funny thing is, if I set "scale=TRUE" in the y assignment line in the R code and remove the X = scale(X)
line in the python code, both result in an output of 0.4641589. In my mind, doing these two operations results in non-equivalent models, so I don't understand why this result would happen.
My understanding of the functions are:
- the cv.glmnet function standardizes (i.e. remove mean then divide by stdev) the X-variables automatically
- cv.glmnet uses the average mean squared error of residuals among the folds in its cross-validation algorithm to decide which is the best lambda
- the RidgeCV function does not standardize the X-variables automatically
Should I expect RidgeCV and cv.glmnet to produce the same ridge parameter that minimizes mse, and if so, why doesn't my code do that?