My situation:
- small sample size: 116
- binary outcome variable
- long list of explanatory variables: 50
- explanatory variables did not come from the top of my head; their choice was based on the literature.
Following a suggestion to a previous question of mine, I have run LASSO (using R's glmnet package) in order to select the subset of exaplanatory variables that best explain variations in my binary outcome variable.
I have noticed that I get very different values of lambda.min through k-folds cross-validation (cv.glmnet command) according the value I attribute to k. I have tried the default (10) and 5. Which would be the most appropriate value for k, considering my sample size?
In my specific case, is it necessary to repeat cross-validation, say 100 times, in order to reduce randomness and allow averaging the error curves, as is suggested in this post? If so: I have tried the code suggested in that post, but got error messages, could anyone suggest a better code?
UPDATE1: I have managed to use the foldid
option in cv.glmnet
, as suggested in the comments below, by organizing my x-matrix in a way that all the 32 observations belonging to one of my outcome classes appears in lines 1-32 and by using the folowing code: foldid=c(sample(rep(seq(10),length=32),sample(rep(seq(10),length=84))
. However, when I ran cv.glmnet
, only one of the levels of a categorical variable with four levels was included in the model. So following a suggestion to a previous question of mine, I tried to run group-lasso using R's gglasso package. And now I am facing this issue.