What's the typical range of possible values for the shrinkage parameter in penalized regression?

Question

In lasso or ridge regression, one has to specify a shrinkage parameter, often called by $\lambda$ or $\alpha$. This value is often chosen via cross validation by checking a bunch of different values on training data and seeing which yields the best e.g. $R^2$ on test data. What is the range of values one should check? Is it $(0,1)$?

Possible duplicate of [Choosing the range and grid density for regularization parameter in LASSO](http://stats.stackexchange.com/questions/174897/choosing-the-range-and-grid-density-for-regularization-parameter-in-lasso) — Alex, Oct 20 '16 at 23:05
In fact, the optimal ridge parameter can be 0 or even negative. Some discussion: on stats.SE https://stats.stackexchange.com/questions/331264/understanding-negative-ridge-regression with a paper here https://arxiv.org/abs/1805.10939 — Sycorax, Jul 28 '20 at 21:11

Sid · Answer 1 · 2014-08-15T19:52:00.527

You don't really need to bother. In most packages (like glmnet) if you do not specify $\lambda$, the software package generates its own sequence (which is often recommended). The reason I stress this answer is that during the running of the LASSO the solver generates a sequence of $\lambda$, so while it may counterintuitive providing a single $\lambda$ value may actually slow the solver down considerably (When you provide an exact parameter the solver resorts to solving a semi definite program which can be slow for reasonably 'simple' cases.)

As for the exact value of $\lambda$ you can potentially chose whatever you want from $[0,\infty[$. Note that if your $\lambda$ value is too large the penalty will be too large and hence none of the coefficients can be non-zero. If the penalty is too small you will overfit the model and this will not be the best cross validated solution

Hi Sid, the OP appears aware of the fact you mention in your post. It also does not appear to answer the question. :-) — cardinal, Aug 15 '14 at 19:27

score 0 · Answer 2 · answered Jul 28 '20 at 19:48

For those trying to figure this out:

I have found that there is a great difference between allowing glmnet to calculate $\lambda$, and for when we create a range for it to choose from (grid).

Here is an example using "applicants" in the College data set from ISLR

# Don't forget to set seed
set.seed(1)
train <- sample(1:dim(College)[1], 0.75*dim(College[1]))

# Matrices
xmat.train <- model.matrix(Apps~.-1,data=College[train,])
xmat.test <- model.matrix(Apps~.-1, data= College[-train,])

y <- College$Apps[train]

# Create a grid of values for the scope of lambda (optional):
grid <- 10 ^ seq(10,-2,length = 100)

# Add the grid here as lambda (optional)
ridge.fit <- glmnet(xmat.train, y, alpha = 0, lambda=grid)
cv.ridge <- cv.glmnet(xmat.train, y, alpha =0, lambda=grid)

bestlam <- cv.ridge$lambda.min
cat("\nBestlam (with grid):",bestlam)

pred <- predict(ridge.fit, s = bestlam, newx= xmat.test)
cat("\nWith Grid:", mean((College$Apps[-train]-pred)^2))

# Again but without the grid (allowing R to figure lambda out
ridge.fit <- glmnet(xmat.train, y, alpha = 0)
cv.ridge <- cv.glmnet(xmat.train, y, alpha =0)

bestlam <- cv.ridge$lambda.min
cat("\n\nBestlam (no grid):",bestlam)

pred <- predict(ridge.fit, s = bestlam, newx= xmat.test)
cat("\nWithout Grid:", mean((College$Apps[-train]-pred)^2))

You can run this yourself, and you can change grid accordingly as well, I've seen examples ranging from grid <- 10 ^ seq(10,-2,length = 100) to grid <- 10^seq(3, -2, by = -.1).

My best guess is that $\lambda$ can be restricted to certain values, and it is up to us in figuring out the most optimal range.

I have also found this guide quite helpful -> https://drsimonj.svbtle.com/ridge-regression-with-glmnet

What's the typical range of possible values for the shrinkage parameter in penalized regression?

2 Answers2