I have a time series data set with about 2million observations and 31 variables, which I break to a few thousand using threshold value for my dependent variable.
I am using lasso regression in R to do some variable selection. After pondering over questions and answers on SO and other sources it is not clear as to what value is suitable for the tuning parameter(lambda), its a topic of much debate.
My question is not regarding what value I should use but interpreting the huge difference between results from using the lamba.min value from cross-validation and the lamda.1se value.
With lambda.min, I get all 31 non-zero coefficients.
x = model.matrix(Comp~.,tcsinfy01.1)[,-1]
y = tcsinfy01.1$Comp
grid = seq(.0000005,0,length = 1000)
set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
test = -train
y.test = y[test]
set.seed(1)
cv.out = cv.glmnet(x, y,alpha=1)
lam = cv.out$lambda.min
lam
## [1] 7.387039e-08
lasso.pred = predict(lasso.mod,s=lam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 4.23711e-09
out = glmnet(x,y,alpha = 1)
lasso.coef = predict(out,type = "coefficients", s = lam)[1:31,]
lasso.coef[lasso.coef!=0]
## (Intercept) Comp.1 Nifty Comp.2 Nifty.1
## -1.766904e-06 -3.166030e-01 4.927288e-01 -1.401834e-01 3.109382e-01
## Comp.3 Nifty.2 Comp.4 Nifty.3 Comp.5
## -8.069235e-02 3.046442e-01 -8.724700e-02 6.245969e-02 -5.054189e-01
## Nifty.4 Comp.6 Nifty.5 Comp.7 Nifty.6
## 2.079961e-01 -5.335412e-01 2.182253e-01 -2.521104e-01 2.297411e-01
## Comp.8 Nifty.7 Comp.9 Nifty.8 Comp.10
## -1.784449e-01 1.415963e-01 -1.674522e-01 2.150763e-01 -2.700476e-01
## Nifty.9 OC OC.1 OC.2 OC.3
## 2.844290e-02 6.967335e-02 4.397470e-02 3.363156e-02 8.879313e-03
## OC.4 OC.5 OC.6 OC.7 OC.8
## 4.686098e-02 5.376459e-02 3.644949e-02 8.292445e-02 2.580695e-02
## OC.9
## 8.108197e-02
While with lambda.1se I get only 8.
x = model.matrix(Comp~.,tcsinfy01.1)[,-1]
y = tcsinfy01.1$Comp
grid = seq(.0000005,0,length = 1000)
set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
test = -train
y.test = y[test]
lasso.mod=glmnet(x[train,],y[ train],alpha=1)
plot(lasso.mod)
set.seed(1)
cv.out = cv.glmnet(x, y,alpha=1)
plot(cv.out)
lam = cv.out$lambda.1se
lam
## [1] 2.781173e-06
lasso.pred = predict(lasso.mod,s=lam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 4.350575e-09
out = glmnet(x,y,alpha = 1)
lasso.coef = predict(out,type = "coefficients", s = lam)[1:31,]
lasso.coef[lasso.coef!=0]
## (Intercept) Comp.1 Nifty Nifty.1 Comp.5
## -2.212212e-06 -1.934390e-01 2.333719e-01 9.445154e-03 -3.519495e-01
## Comp.6 Comp.7 Comp.10
## -3.697492e-01 -2.455037e-02 -8.737498e-02
I'm not sure what to make of this? Am I doing something wrong? Should I try other lambda values? If these are right then which one should I go by?
Why is there such a big difference in variable selection between the two values?