4

I'm having trouble reconciling different values for the ridge parameter that minimizes mean squared error when using RidgeCV in scikit-learn (Python) and cv.glmnet (R).

First a few things to note:

  • My code takes into account this post which shows how the "lambda" regularization value in cv.glmnet compares to the "alpha" value in RidgeCV
  • In the link above, @eickenberg states that RidgeCV uses unpenalized intercept estimation, while the glmnet documentation states it uses penalized maximum likelihood

Here is my R code which uses the mtcars data:

# Load libraries, get data & set seed for reproducibility
set.seed(123)    # sees for reproducibility
library(glmnet)  # for ridge regression
library(dplyr)   # for data cleaning

data("mtcars")
# Center y, X will be standardized in the modelling function
y <- mtcars %>% select(mpg) %>% scale(center = TRUE, scale = FALSE) %>% as.matrix()
X <- mtcars %>% select(-mpg) %>% as.matrix()

# Perform 10-fold cross-validation to select lambda
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
# Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(X, y, alpha = 0, lambda = lambdas_to_try,
                      standardize = TRUE, nfolds = 10)

# Best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min
print(lambda_cv)

which results in 2.983647

My Python code using the same data (imported from a csv I exported in R):

import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.linear_model import RidgeCV

# %% Define features of interest
df = pd.read_csv(PATH_TO_MTCARS.csv)

features = ['cyl', 'disp', 'hp', 'drat', 'wt', 'qsec',
            'vs', 'am', 'gear', 'carb']

# %% Extract Variables
# Isolate features and target variable
X = df[features]
X = scale(X)

y = df['mpg'] - df['mpg'].mean()

# %% Ridge cross validation
n_alphas = 100
lambdas = np.logspace(-3, 5, n_alphas)
alphas =  lambdas * 32 / 2  # convert to glmnet lambdas to RidgeCV alphas, n=32
ridge_cv = RidgeCV(alphas=alphas, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X, y)
print(ridge_cv.alpha_ * 2 / 32)  # convert RidgeCV alpha to glmnet lambda

which results in 0.8111308

Funny thing is, if I set "scale=TRUE" in the y assignment line in the R code and remove the X = scale(X) line in the python code, both result in an output of 0.4641589. In my mind, doing these two operations results in non-equivalent models, so I don't understand why this result would happen.

My understanding of the functions are:

  • the cv.glmnet function standardizes (i.e. remove mean then divide by stdev) the X-variables automatically
  • cv.glmnet uses the average mean squared error of residuals among the folds in its cross-validation algorithm to decide which is the best lambda
  • the RidgeCV function does not standardize the X-variables automatically

Should I expect RidgeCV and cv.glmnet to produce the same ridge parameter that minimizes mse, and if so, why doesn't my code do that?

  • Dear, user110174, your post shows that you put significant effort in investigating the issue yourself. But please specify what you are asking about. I haven't found any question marks in the text :) – hans Dec 09 '19 at 08:27
  • so I suspect glmnet also standardises Y by default... Note also that for "gaussian", glmnet standardizes y to have unit variance (using 1/n rather than 1/(n-1) formula) before computing its lambda sequence (and then unstandardizes the resulting coefficients); if you wish to reproduce/compare results with other software, best to supply a standardized y. [https://cran.r-project.org/web/packages/glmnet/glmnet.pdf] – seanv507 Dec 09 '19 at 08:30
  • 1
    ``glmnet`` does standardize the variables automatically.. Check that using ``?glmnet`` in R. It also uses the MSE of the residuals in the CV algorithm for gaussian models (also check ``?cv.glmnet`` for ``type.measure``) – runr Dec 09 '19 at 10:41
  • Apologies @hans, I've added a formal question. – optimusLime Dec 09 '19 at 15:18
  • thanks for the response @seanv507. I agree your comment suggests that y would be standardized too, however, that makes the fact that changing "scale=FALSE" to "scale=TRUE" in the y variable "scale" pre-processing step result in a different lambda.min value all the more confusing...If glmnet is standardizing y, it should be insensitive to any standardizing done to y outside of the function. – optimusLime Dec 09 '19 at 17:46

0 Answers0