0

I know, the question has been posted many times, but none of the answers fixed my problem. I still get different results each time I run the cv.glmnet on my data. Here is my code:

set.seed(123)
library(caret)
library(tidyverse)
library(glmnet)
library(ROCR)
library(doParallel)
registerDoParallel(4, cores = 8)
df <- df %>% select(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13)
training.samples <- df$V2 %>% createDataPartition(p = 0.8, list = FALSE)
train <- df[training.samples, ]
test <- df[-training.samples, ]
x.train <- data.frame(train[, names(train) != "V2"])
x.train <- data.matrix(x.train)
y.train <- train$V2
x.test <- data.frame(test[, names(test) != "V2"])
x.test <- data.matrix(x.test)
y.test <- test$V2

list.of.fits <- list()
for (i in 0:10){
    fit.name <- paste0("alpha", i/10) 
    list.of.fits[[fit.name]] <- cv.glmnet(x.train, y.train, type.measure = c("auc"), alpha = i/10, family = "binomial", parallel = TRUE)
}
coef <- coef(list.of.fits[[fit.name]], s = list.of.fits[[fit.name]]$lambda.1se)
coef

I then re-visited other's similar problem, like here, and tried to fix the nfolds to 5 and foldid to foldid <- sample(rep(seq(5), length.out = nrow(train))) so it ended up like this: list.of.fits[[fit.name]] <- cv.glmnet(x.train, y.train, type.measure = c("auc"), alpha = i/10, family = "binomial", nfolds = 5, foldid = foldid, parallel = TRUE).

But, I still get very different results when I re-run my cv.glmnet on the exact same data. What do I do wrong here, since I get different results every time, even after the 'fix'?

StupidWolf
  • 4,494
  • 3
  • 10
  • 26
Thomas
  • 332
  • 1
  • 13

1 Answers1

1

The different coefficients comes about because you extracted only the lambda.1se, which can be different if you use different train and test fold. So we can check this just using 1 alpha value:

library(glmnet)
library(mlbench)
data(Sonar)

Xtrain = as.matrix(Sonar[,-ncol(Sonar)])
Ytrain = Sonar$Class

fits = lapply(1:5,function(i){
cv.glmnet(Xtrain,Ytrain,alpha = 0, family = "binomial")
})

You can see the lambda.1.se are different:

sapply(fits,"[[","lambda.1se")
[1] 0.4238924 0.6749567 0.3862350 0.4652214 0.5105800

Whereas the coefficients for the final fit is the same:

 all.equal(fits[[1]]$glmnet.fit$beta,fits[[2]]$glmnet.fit$beta)
[1] TRUE

You need to set the foldid but it is not shown in your example. Below is a simple implementation, and using "AUC" as a measure:

nfolds = 10
foldid = 1 + (1:nrow(Xtrain) %% nfolds)

fits = lapply(1:5,function(i){
    cv.glmnet(Xtrain,Ytrain,alpha = 0, family = "binomial",measure="AUC",foldid=foldid)
    })

Now the lambda.1se are the same:

sapply(fits,"[[","lambda.1se")
[1] 0.2425668 0.2425668 0.2425668 0.2425668 0.2425668

And if we extract the coefficients there are no surprises:

head(sapply(fits,function(i)as.matrix(coef(i,s=i$lambda.1se))))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,]  3.524326  3.524326  3.524326  3.524326  3.524326
[2,] -5.469565 -5.469565 -5.469565 -5.469565 -5.469565
[3,] -1.510470 -1.510470 -1.510470 -1.510470 -1.510470
[4,]  0.289252  0.289252  0.289252  0.289252  0.289252
[5,] -2.575135 -2.575135 -2.575135 -2.575135 -2.575135
[6,] -1.388629 -1.388629 -1.388629 -1.388629 -1.388629
StupidWolf
  • 4,494
  • 3
  • 10
  • 26
  • Thank you for the extensive answer. I will try it, once my computer has run a test. But I do not want to set `alpha` prior to my model run, as in my script `alpha = i/10`. So will I always get different `lambdas` and, therefore, different `coef`? – Thomas Jul 25 '20 at 09:25
  • You just iterate through different alphas, for example ```lapply(alphas,function(i){cv.glmnet(Xtrain,Ytrain,alpha = i, family = "binomial",measure="AUC",foldid=foldid); as.matrix(coef(fit,...)) })``` – StupidWolf Jul 25 '20 at 09:27
  • Got it, I'll try and see. Give me a moment or two. Thank you again! – Thomas Jul 25 '20 at 09:28
  • I now tried the following with your defined foldid: `fits = lapply(1:10,function(i){cv.glmnet(x.train, y.train, alpha = i/10, family = "binomial", measure = "auc", foldid = foldid, parallel = TRUE)})`, but I still get different `lambdas`. How can that be? I think it has something to do with the non-defined `alpha` of mine, right? – Thomas Jul 25 '20 at 14:19
  • yes you are supposed to get different lambdas, because the alphas are different. My point is, if you run it again, you get the same lambdas right? – StupidWolf Jul 25 '20 at 15:27
  • Yeah, I do. But, I have never accomplished to get the same `coefficients` when retrieved from the run. I have no idea what I am doing wrong. But the `foldid` does not help to diminish the 'randomness' in my coefficient output, unfortunately. – Thomas Jul 25 '20 at 15:51
  • Hey i ran the same code with the Sonar data and it was ok...did you run a createDataPartition every time. I would do it like this. Do it with an example dataset, make sure it is reproducible, then move up to more complicated stuff like train test split – StupidWolf Jul 25 '20 at 16:37
  • Okay, I could not make it work with the code. But, the `nfolds = 10 foldid = 1 + (1:nrow(Xtrain) %% nfolds)` apparently helped rather than the `foldid` I showed in my example. Thank you. – Thomas Jul 27 '20 at 07:29