19

In lasso or ridge regression, one has to specify a shrinkage parameter, often called by $\lambda$ or $\alpha$. This value is often chosen via cross validation by checking a bunch of different values on training data and seeing which yields the best e.g. $R^2$ on test data. What is the range of values one should check? Is it $(0,1)$?

rhombidodecahedron
  • 2,322
  • 3
  • 23
  • 37
  • 2
    Possible duplicate of [Choosing the range and grid density for regularization parameter in LASSO](http://stats.stackexchange.com/questions/174897/choosing-the-range-and-grid-density-for-regularization-parameter-in-lasso) – Alex Oct 20 '16 at 23:05
  • 1
    In fact, the optimal ridge parameter can be 0 or even negative. Some discussion: on stats.SE https://stats.stackexchange.com/questions/331264/understanding-negative-ridge-regression with a paper here https://arxiv.org/abs/1805.10939 – Sycorax Jul 28 '20 at 21:11

2 Answers2

5

You don't really need to bother. In most packages (like glmnet) if you do not specify $\lambda$, the software package generates its own sequence (which is often recommended). The reason I stress this answer is that during the running of the LASSO the solver generates a sequence of $\lambda$, so while it may counterintuitive providing a single $\lambda$ value may actually slow the solver down considerably (When you provide an exact parameter the solver resorts to solving a semi definite program which can be slow for reasonably 'simple' cases.)

As for the exact value of $\lambda$ you can potentially chose whatever you want from $[0,\infty[$. Note that if your $\lambda$ value is too large the penalty will be too large and hence none of the coefficients can be non-zero. If the penalty is too small you will overfit the model and this will not be the best cross validated solution

Sid
  • 2,489
  • 10
  • 15
  • 5
    Hi Sid, the OP appears aware of the fact you mention in your post. It also does not appear to answer the question. :-) – cardinal Aug 15 '14 at 19:27
0

For those trying to figure this out:

I have found that there is a great difference between allowing glmnet to calculate $\lambda$, and for when we create a range for it to choose from (grid).

Here is an example using "applicants" in the College data set from ISLR

# Don't forget to set seed
set.seed(1)
train <- sample(1:dim(College)[1], 0.75*dim(College[1]))

# Matrices
xmat.train <- model.matrix(Apps~.-1,data=College[train,])
xmat.test <- model.matrix(Apps~.-1, data= College[-train,])

y <- College$Apps[train]

# Create a grid of values for the scope of lambda (optional):
grid <- 10 ^ seq(10,-2,length = 100)

# Add the grid here as lambda (optional)
ridge.fit <- glmnet(xmat.train, y, alpha = 0, lambda=grid)
cv.ridge <- cv.glmnet(xmat.train, y, alpha =0, lambda=grid)

bestlam <- cv.ridge$lambda.min
cat("\nBestlam (with grid):",bestlam)

pred <- predict(ridge.fit, s = bestlam, newx= xmat.test)
cat("\nWith Grid:", mean((College$Apps[-train]-pred)^2))

# Again but without the grid (allowing R to figure lambda out
ridge.fit <- glmnet(xmat.train, y, alpha = 0)
cv.ridge <- cv.glmnet(xmat.train, y, alpha =0)

bestlam <- cv.ridge$lambda.min
cat("\n\nBestlam (no grid):",bestlam)

pred <- predict(ridge.fit, s = bestlam, newx= xmat.test)
cat("\nWithout Grid:", mean((College$Apps[-train]-pred)^2))

You can run this yourself, and you can change grid accordingly as well, I've seen examples ranging from grid <- 10 ^ seq(10,-2,length = 100) to grid <- 10^seq(3, -2, by = -.1).

My best guess is that $\lambda$ can be restricted to certain values, and it is up to us in figuring out the most optimal range.

I have also found this guide quite helpful -> https://drsimonj.svbtle.com/ridge-regression-with-glmnet

fas
  • 1