2

I have a time series data set with about 2million observations and 31 variables, which I break to a few thousand using threshold value for my dependent variable.

I am using lasso regression in R to do some variable selection. After pondering over questions and answers on SO and other sources it is not clear as to what value is suitable for the tuning parameter(lambda), its a topic of much debate.

My question is not regarding what value I should use but interpreting the huge difference between results from using the lamba.min value from cross-validation and the lamda.1se value.

With lambda.min, I get all 31 non-zero coefficients.

x = model.matrix(Comp~.,tcsinfy01.1)[,-1]
y = tcsinfy01.1$Comp

grid = seq(.0000005,0,length = 1000)

set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
test = -train
y.test = y[test]


set.seed(1)
cv.out = cv.glmnet(x, y,alpha=1)

lam = cv.out$lambda.min
lam
## [1] 7.387039e-08
lasso.pred = predict(lasso.mod,s=lam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 4.23711e-09

out = glmnet(x,y,alpha = 1)
lasso.coef = predict(out,type = "coefficients", s = lam)[1:31,]
lasso.coef[lasso.coef!=0]
##   (Intercept)        Comp.1         Nifty        Comp.2       Nifty.1 
## -1.766904e-06 -3.166030e-01  4.927288e-01 -1.401834e-01  3.109382e-01 
##        Comp.3       Nifty.2        Comp.4       Nifty.3        Comp.5 
## -8.069235e-02  3.046442e-01 -8.724700e-02  6.245969e-02 -5.054189e-01 
##       Nifty.4        Comp.6       Nifty.5        Comp.7       Nifty.6 
##  2.079961e-01 -5.335412e-01  2.182253e-01 -2.521104e-01  2.297411e-01 
##        Comp.8       Nifty.7        Comp.9       Nifty.8       Comp.10 
## -1.784449e-01  1.415963e-01 -1.674522e-01  2.150763e-01 -2.700476e-01 
##       Nifty.9            OC          OC.1          OC.2          OC.3 
##  2.844290e-02  6.967335e-02  4.397470e-02  3.363156e-02  8.879313e-03 
##          OC.4          OC.5          OC.6          OC.7          OC.8 
##  4.686098e-02  5.376459e-02  3.644949e-02  8.292445e-02  2.580695e-02 
##          OC.9 
##  8.108197e-02

While with lambda.1se I get only 8.

x = model.matrix(Comp~.,tcsinfy01.1)[,-1]
y = tcsinfy01.1$Comp

grid = seq(.0000005,0,length = 1000)

set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
test = -train
y.test = y[test]

lasso.mod=glmnet(x[train,],y[ train],alpha=1)
plot(lasso.mod)


set.seed(1)
cv.out = cv.glmnet(x, y,alpha=1)
plot(cv.out)


lam = cv.out$lambda.1se
lam
## [1] 2.781173e-06
lasso.pred = predict(lasso.mod,s=lam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 4.350575e-09

out = glmnet(x,y,alpha = 1)
lasso.coef = predict(out,type = "coefficients", s = lam)[1:31,]

lasso.coef[lasso.coef!=0]
##   (Intercept)        Comp.1         Nifty       Nifty.1        Comp.5 
## -2.212212e-06 -1.934390e-01  2.333719e-01  9.445154e-03 -3.519495e-01 
##        Comp.6        Comp.7       Comp.10 
## -3.697492e-01 -2.455037e-02 -8.737498e-02

I'm not sure what to make of this? Am I doing something wrong? Should I try other lambda values? If these are right then which one should I go by?

Why is there such a big difference in variable selection between the two values?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
UtdMan
  • 21
  • 5
  • 1
    Possible duplicate of [How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?](https://stats.stackexchange.com/questions/26528/how-to-estimate-shrinkage-parameter-in-lasso-or-ridge-regression-with-50k-varia) – COOLSerdash Sep 10 '17 at 06:53

0 Answers0