pre-amble:
While investigating different cross validation strategies for small sample size dataset's with relatively large number of features I came across this peculiar result. While making a simple Leave-One-Out-Cross-Validation setup with scaled AND CENTERED data, the LASSO makes seemingly good predictions on random data. However the LASSO model must be without intercept. Additionally, this is only the case for certain types of randomly generated data, see the example below (Y1
and Y2
).
Another wrinkle is that I've only been able to encounter this with the glmnet
package, this does not work for the LASSO method of the HDCI
package. Which I believe to be just a glmnet
wrapper.
code:
library(HDCI)
library(glmnet)
library(Metrics)
#responses predictable by random data
Y1 <- rnorm(49, mean=10, sd=5)
Y2 <- runif(49, min=1, max=40)
#responses NOT predictable by random data
Y3 <- rnorm(49)
Y4 <- runif(49)
X1 <- scale(matrix(rnorm(prod(49, 1848)), nrow=49, ncol=1848), center=FALSE) #
X2 <- scale(matrix(rnorm(prod(49, 1848)), nrow=49, ncol=1848))
preds.glmnet <- vector()
preds.lasso <- vector()
for(i in seq(49)){
temp.X <- X2
Y <- Y3
fit <- Lasso(temp.X[-i,], Y[-i],lambda=4e-2, intercept=FALSE)
preds.lasso[i] <- mypredict(fit, newx=t(temp.X[i,]))
fit <- glmnet(temp.X[-i,], Y[-i], alpha=1, lambda=4e-2, family="gaussian", intercept=FALSE)
preds.glmnet[i] <- predict(fit, newx=t(temp.X[i,]))
}
cor.test(preds.glmnet, Y)$estimate
cor.test(preds.glmnet, Y)$p.value
rmse(preds.glmnet, Y)
cor.test(preds.lasso, Y)$estimate
cor.test(preds.lasso, Y)$p.value
rmse(preds.lasso, Y)
Question:
Why does this combination of LOOCV, scaling, random response, intercept, and glmnet
lead to extremely correlated predictions? I understand that adding more observations, removing the centering, adding back the intercept, and changing to some k-fold CV avoids this issue but I'm more concerned about the cause of it.