Caret glmnet vs cv.glmnet

Question

There seems to be a lot of confusion in the comparison of using glmnet within caret to search for an optimal lambda and using cv.glmnet to do the same task.

Many questions were posed, e.g.:

Classification model train.glmnet vs. cv.glmnet?

What is the proper way to use glmnet with caret?

Cross-validating `glmnet` using `caret`

but no answer has been given, which might be due to the reproducability of the question. Following the first question, I give a quite similar example but do have the same question: Why are the estimated lambdas so different?

library(caret)
library(glmnet)
set.seed(849)
training <- twoClassSim(50, linearVars = 2)
set.seed(849)
testing <- twoClassSim(500, linearVars = 2)
trainX <- training[, -ncol(training)]
testX <- testing[, -ncol(testing)]
trainY <- training$Class

# Using glmnet to directly perform CV
set.seed(849)
cvob1=cv.glmnet(x=as.matrix(trainX),y=trainY,family="binomial",alpha=1, type.measure="auc", nfolds = 3,lambda = seq(0.001,0.1,by = 0.001),standardize=FALSE)

cbind(cvob1$lambda,cvob1$cvm)

# best parameter
cvob1$lambda.mi

# best coefficient
coef(cvob1, s = "lambda.min")


# Using caret to perform CV
cctrl1 <- trainControl(method="cv", number=3, returnResamp="all",classProbs=TRUE,summaryFunction=twoClassSummary)
set.seed(849)
test_class_cv_model <- train(trainX, trainY, method = "glmnet", trControl = cctrl1,metric = "ROC",
                             tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))


test_class_cv_model 

# best parameter
test_class_cv_model$bestTune

# best coefficient
coef(test_class_cv_model$finalModel, test_class_cv_model$bestTune$lambda)

To summarise, the optimal lambdas are given as:

0.055 by using cv.glmnet()
0.001 by using train()

I know that using standardize=FALSE in cv.glmnet() is not advisable, but I really want compare both methods using the same prerequisites. As main explanaition, I think the sampling approach for each fold might be an issue - but I use the same seeds and the results are quite different.

So I'm really stuck on why the two approaches are so different, while they should be quite similar? - I hope the community has some idea whats the issue here

score 19 · Accepted Answer · answered Aug 24 '17 at 20:36

I see two issue here. First, your training set is too small relative to your testing set. Normally, we would want a training set that is at least comparable in size to the testing set. Another note is that for Cross Validation, you're not using the testing set at all, because the algorithm basically creates testing sets for you using the "training set". So you'd be better off using more of the data as your initial training set.

Second, 3 folds is too small for your CV to be reliable. Typically, 5-10 folds is recommended (nfolds = 5 for cv.glmnet and number=5 for caret). With these changes, I got the same lambda values across the two methods and almost identical estimates:

set.seed(849)
training <- twoClassSim(500, linearVars = 2)
set.seed(849)
testing <- twoClassSim(50, linearVars = 2)
trainX <- training[, -ncol(training)]
testX <- testing[, -ncol(testing)]
trainY <- training$Class

# Using glmnet to directly perform CV
set.seed(849)
cvob1=cv.glmnet(x=as.matrix(trainX), y=trainY,family="binomial",alpha=1, 
                type.measure="auc", nfolds = 5, lambda = seq(0.001,0.1,by = 0.001),
                standardize=FALSE)

cbind(cvob1$lambda,cvob1$cvm)

# best parameter
cvob1$lambda.min

# best coefficient
coef(cvob1, s = "lambda.min")


# Using caret to perform CV
cctrl1 <- trainControl(method="cv", number=5, returnResamp="all",
                       classProbs=TRUE, summaryFunction=twoClassSummary)
set.seed(849)
test_class_cv_model <- train(trainX, trainY, method = "glmnet", 
                             trControl = cctrl1,metric = "ROC",
                             tuneGrid = expand.grid(alpha = 1,
                                                    lambda = seq(0.001,0.1,by = 0.001)))

test_class_cv_model 

# best parameter
test_class_cv_model$bestTune

# best coefficient
coef(test_class_cv_model$finalModel, test_class_cv_model$bestTune$lambda)

Result:

> cvob1$lambda.min
[1] 0.001

> coef(cvob1, s = "lambda.min")
8 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.781015706
TwoFactor1  -1.793387005
TwoFactor2   1.850588656
Linear1      0.009341356
Linear2     -1.213777391
Nonlinear1   1.158009360
Nonlinear2   0.609911748
Nonlinear3   0.246029667

> test_class_cv_model$bestTune
alpha lambda
1     1  0.001

> coef(test_class_cv_model$finalModel, test_class_cv_model$bestTune$lambda)
8 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.845792624
TwoFactor1  -1.786976586
TwoFactor2   1.844767690
Linear1      0.008308165
Linear2     -1.212285068
Nonlinear1   1.159933335
Nonlinear2   0.676803555
Nonlinear3   0.309947442

Thanks a lot for your answer - it makes perfect sense to me. Since I'm a newbee to CV I did not account for the a) the size of the sample and b) the folds. — Jogi, Aug 24 '17 at 21:07
Thanks for the post! So if I got it right, usually one splits the dataset into a large training set and a smaller test set (= holdout) and perform the k-fold CV on the training set. Finally one validates on the test set, using the results of the CV right? — Jogi, Aug 24 '17 at 21:33
@Jogi That would be the way to do it. You can also just use the whole dataset for CV if you don't need further validation, since CV already picks the best parameters based on the model's average performance on each iteration of testing sets. — acylam, Aug 25 '17 at 13:08

Caret glmnet vs cv.glmnet

1 Answers1