3

This is my first time working with regularized regression so I apologize if the answer to this is obvious. I am planning on using GLMnet to run a regularized logistic regression on my data set using glmnet. Previously I have been using unregularized logistic regression and have evaluated my model/ compared different models using repeated random sub-sampling validation. After cross validation I then apply the better performing model on the entire test set in order to come up with the final model (aka the model that will be applied to new incoming data). This procedure follows the advice of this (highly ranked) cross validated discussion.

However, I am confused over the question of how to build my final model for application after determining the better model with cv.glmnet. I would have imagined that I would do something similar as before where I get the best model, which in this case refers to the best lambda value (and alpha with elastic net) and run glmnet on the entire data set while passing the best value of lambda form cv.glmnet. However, according to this cross validated discussion:

"you're not actually supposed to give glmnet a single value of lambda. "

So, my concrete question is how to I implement the results of cv.glmnet in order to build my final model?

EDIT: Based on the comments following the response by "Jogi", it appears that the coefficients of the best cv.glmnet model are the same as the coefficients when the best lambda which results from cv.glmnet is supplied to the entire dataset. Is this basically the answer to my question? If so can anyone elaborate on why this is the case?

Here is a sample:

age     <- c(4, 8, 7, 12, 6, 9, 10, 14, 7) 
gender  <- as.factor(c(1, 0, 1, 1, 1, 0, 1, 0, 0))
bmi_p   <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88) 
m_edu   <- as.factor(c(0, 1, 1, 2, 2, 3, 2, 0, 1))
p_edu   <- as.factor(c(0, 2, 2, 2, 2, 3, 2, 0, 0))
f_color <- as.factor(c("blue", "blue", "yellow", "red", "red", "yellow", 
                   "yellow", "red", "yellow"))
asthma <- c(1, 1, 0, 1, 0, 0, 0, 1, 1)
xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[, -1]
x        <- as.matrix(data.frame(age, bmi_p, xfactors))

#Lastly, cross validation can also be used to select lambda.
cv.glmmod <- cv.glmnet(x, y=asthma, alpha=1,family="binomial")
#plot(cv.glmmod)
(best.lambda <- cv.glmmod$lambda.min)
coef(cv.glmmod, s = "lambda.min")

which outputs:

coef(cv.glmmod, s = "lambda.min") 11 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 0.2231436 age .
bmi_p .
gender1 .
m_edu1 .
m_edu2 .
m_edu3 .
p_edu2 .
p_edu3 .
f_colorred .
f_coloryellow .

And the full dataset coefficients are:

fit = glmnet(x, y=as.factor(asthma),lambda = best.lambda, family="binomial", alpha = 1)
coef(fit) 

coef(fit) 11 x 1 sparse Matrix of class "dgCMatrix"

(Intercept) 0.2231436

age 0.0000000 bmi_p .
gender1 .
m_edu1 .
m_edu2 .
m_edu3 .
p_edu2 .
p_edu3 .
f_colorred .
f_coloryellow .

steve zissou
  • 429
  • 5
  • 12

2 Answers2

2

Instead of performing a cross validation for each set of variables separately using a penalized regression, the cv.gmlnet function does this automatically:

library(glmnet)
data(QuickStartExample)

# your approach: use different lambdas and perform cross validation maually
fit_1 = glmnet(x, y,lambda = 1)


# glmnet's approach: automated cross validation
cvfit = cv.glmnet(x, y)
plot(cvfit)

# coeficients of the final model
coef_cv=coef(cvfit, s = "lambda.min")
# prediction of the final model
predict(cvfit, newx = x[1:5,], s = "lambda.min")

# extract optimal lambda
lmabda_opt=cvfit$lambda.min 

# manually plugging lambda into glmnet
fit_2 = glmnet(x, y,lambda = lmabda_opt) 

# compare cefficients - equal
cbind(coef_cv,coef(fit_2))

# compare predictions - equal
cbind(predict(cvfit, newx = x[1:5,], s = "lambda.min"),predict(fit_2, newx = x[1:5,]))

So for each lambda, a cross validation is performed and a performance meansure is calculated. Via plot(cvfit) you can see the result of the cross validation. Recall, that generally using glmnet() and plugging in arbitrary lambdas is not recommended. More detals can be found in the excellent tutorial: https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf

Jogi
  • 616
  • 1
  • 6
  • 13
  • Hi, thank you for your answer. However, I don't feel that it answers my question. I want to know the best way to create a final model for implementation after running cv.glmnet. according to the linked post, a single value of lambda cannot be passed to glmnet, which is what I would do based upon the answer to the first linked post. – steve zissou Sep 22 '17 at 12:07
  • Adressing the "a single value canot be passed to glmnet" - that's wrong. This can be clarified checking out the help-file: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf. Adressing the " I want to know the best way to create a final mode" I might have missunderstood you. But cv.glmnet does is to pick the optimal lambda. coef(cvfit, s = "lambda.min") delivers you the coefficients of the optimal value of lambda, what I would refer to be the "final model". More clarification might give: https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf – Jogi Sep 22 '17 at 12:18
  • Thanks for the link, I will read it. I think we have a difference in how we define the "final model". From my perspective the final model, (which is the model I will “apply” in the field), consists of the coefficients for each variable that have been calculated using parameters that were determined from CV, so "lambda.min” is the best parameter which is then passed to glmnet on the entire data. From your perspective the final model consists of the coefficients that result from the best CV fit? Am I understanding correctly? – steve zissou Sep 22 '17 at 12:34
  • if you check out my first link in the posted question, you will see an answer which provides good reasoning for why the final model is based upon the entire data set. – steve zissou Sep 22 '17 at 12:36
  • Also, checking out the glmnet helpfile, it does specify that a single value of lambda shouldn't be passed to glmnet. – steve zissou Sep 22 '17 at 12:46
  • Gereally it's very time-intense to ask a question which requires to read other questions...This makes a quick understanding impossible.I skimmed through the question and I think you missed the point what CV does. It simply compares predictive power based on e.g. the mean squared error. For each of the k-folds, the number of variables is the same. Now varying the lambda will lead to different amout of variables for each cv - but not for each fold... – Jogi Sep 22 '17 at 12:52
  • The file says that it is not recommended...But it is indeed possible! Just add lmabda_opt=cvfit$lambda.min fit = glmnet(x, y,lambda = lmabda_opt) coef(fit) and compare the coefficients - their equal... – Jogi Sep 22 '17 at 12:56
  • I'm sorry but a second opinion is needed. It appears that you haven't understood the question. I see CV as a method to select the best lambda. My question has nothing to do with variable selection (however of course variable selection occurs as a result of different lambdas). your outline of my approach is incorrect as well. – steve zissou Sep 22 '17 at 13:25
  • A short demo shows that your point about the coefficients being the same is correct. And I've edited my question to show this. If you can edit your response to incorporate this info with an explanation of how that answers my question then I can accept your answer.... – steve zissou Sep 22 '17 at 13:44
0

This post is kinda old. However, apparently, you couldn't get a final answer, which is... There is no need to run glmnet with the best lambda obtained through cv.glmnet.

As a matter of fact, all quantities returned by cv.glmnet (coefficients, predictions etc.) are already referred to the model fitted on the full data. If you check the documentation, it"s made clear at the beginning of the "details" section.

Check also this post https://stackoverflow.com/questions/48199045/r-coefficients-of-glmnetcvfit