I asked this question at stackoverflow, but at this point I'm not exactly sure where it belongs because it is a question related to the standardization process of glmnet
and the lasso.
I am running glmnet
and trying to see the difference when using standardize = FALSE
with pre-standardized variables.
library(glmnet)
data(QuickStartExample)
### Standardized Way
x_standardized <- scale(x, center = TRUE, scale = TRUE)
y_standardized <- scale(y, center = TRUE, scale = TRUE)
cv_standardized <- cv.glmnet(x_standardized, y_standardized,
intercept = FALSE,
standardize = FALSE, standardize.response = FALSE)
destandardized_coef <- coef(cv_standardized)[-1] * sd(y) / apply(x, 2, sd)
destandardized_coef
mean(y) - sum(destandardized_coef * colMeans(x))
### Let glmnet Stanardize
cv_normal <- cv.glmnet(x, y)
coef(cv_normal, cv_normal$lambda.min) %>% as.numeric()
Initially, I am standardizing the data myself and back transforming the coefficients. I would expect to see the same results but for some reason am getting slightly different coefficients.
My question is, how can I extract the same results, and why are the coefficients currently different this way?
EDIT: Here is my updated code based on @user2974951's response, it does appear that glmnet
standardizes differently than the scale
function in R
. I want to note that I am fitting the initial model without the intercept since I am later calculating it by hand. Ultimately, this shouldn't matter as I provide a standardized x
and y
, the intercept should be zero.
set.seed(123)
library(glmnet)
library(tidyverse)
data(QuickStartExample)
# standardize data
n <- length(y)
y_mean <- mean(y)
y_centered <- y - mean(y)
y_sd <- sqrt(sum((y - y_mean) ^ 2) / n)
ys <- y_centered / y_sd
X_centered <- apply(x, 2, function(x) x - mean(x))
Xs <- apply(X_centered, 2, function(x) x / sqrt(sum(x ^ 2) / n))
ys <- y_centered / sqrt(sum(y_centered ^ 2) / n)
set.seed(123)
cv_standardized <- cv.glmnet(Xs, ys,
intercept = FALSE,
standardize = FALSE,
standardize.response = FALSE)
destandardized_coef <- coef(cv_standardized)[-1] * sd(y) / apply(x, 2, sd)
personal_standardize <- c(mean(y) - sum(destandardized_coef * colMeans(x)),
destandardized_coef)
set.seed(123)
cv_normal <- cv.glmnet(x, y, intercept = TRUE)
glmnet_standardize <- coef(cv_normal, cv_normal$lambda.min)
Results:
cbind(personal_standardize, glmnet_standardize)
21 x 2 sparse Matrix of class "dgCMatrix"
personal_standardize glmnet_standardize
(Intercept) 0.15473991 0.145832036
V1 1.29441444 1.340981414
V2 . .
V3 0.62890756 0.708347139
V4 . .
V5 -0.77785401 -0.848087765
V6 0.48387954 0.554823781
V7 . 0.038519738
V8 0.28419264 0.347221319
V9 . .
V10 . 0.010034050
V11 0.09130386 0.186365264
V12 . .
V13 . .
V14 -1.03241106 -1.086006902
V15 . -0.004444318
V16 . .
V17 . .
V18 . .
V19 . .
V20 -0.96190123 -1.069942845
Thanks in advance.