10

Consider a regression model

$$ y = X \beta + \varepsilon. $$

I will use ridge regression to estimate $\beta$. Ridge regression contains a tuning parameter (the penalty intensity) $\lambda$. If I were given a grid of candidate $\lambda$ values, I would use cross validation to select the optimal $\lambda$. However, the grid is not given, so I need to design it first. For that I need to choose, among other things, a maximum value $\lambda_{max}$.

Question: How do I sensibly choose $\lambda_{max}$ in ridge regression?

There needs to be a balance between

  • a $\lambda_{max}$ that is "too large", leading to wasteful computations when evaluating the performance of (possibly many) models that are penalized too harshly;
  • a $\lambda_{max}$ that is "too small" leading to a forgone opportunity to penalize more intensely and get better performance.

(Note that the answer is simple in the case of LASSO; there you take $\lambda_{max}$ such that all coefficients are set exactly to zero for any $\lambda \geq \lambda_{max}$.)

amoeba
  • 93,463
  • 28
  • 275
  • 317
Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • A similar question for LASSO is [here](http://stats.stackexchange.com/questions/174897/choosing-the-range-and-grid-density-for-regularization-parameter-in-lasso); a slightly less similar question is [here](http://stats.stackexchange.com/questions/76827/how-is-lambda-tuning-parameter-in-lasso-logistic-regression-generated/). – Richard Hardy Aug 03 '16 at 13:30
  • 1
    It took me several re-reads to figure out exactly what you were asking there. Can one not actually take the limiting value (since all the coefficients will be set to zero -- you can figure out the fit easily enough)? Of course you can't then use exponentially-distanced points, but one might (for example) use points uniform in the inverse of $\lambda$, or one might use a convenient quantile function to place the points. – Glen_b Aug 03 '16 at 13:48
  • 1
    @Glen, thank you. I reformulated the question; hopefully it is clearer now. Actually, I *would* probably take the limiting value if only I knew it. This is what the question is about. Do you have an idea what the limiting value is? I thought it is $+\infty$... – Richard Hardy Aug 03 '16 at 14:08
  • 6
    Once, I tried reading the glmnet source code to answer this question. It did not go well. – Matthew Drury Aug 03 '16 at 14:11
  • @MatthewDrury, I tried it, too. Once I faced the part coded in Fortran, I gave up. But if I remember correctly, the $\lambda_{max}$ is selected as if the regularization was by LASSO (in which case it is easy), even though it is by ridge? – Richard Hardy Aug 03 '16 at 14:15
  • I don't know, my investigation was inconclusive. I think my state of knowledge is about the same as yours, for LASSO its pretty easy (and explained in the glmnet paper), but they don't say anything about what they do when you're doing pure ridge regression. – Matthew Drury Aug 03 '16 at 14:18
  • 4
    The effect of $\lambda$ in the ridge estimator is that it shrinks singular values $s_i$ of $X$ via terms like that $s_i^2/(s_i^2+\lambda)$. This suggests that selecting $\lambda$ much larger than $s_1^2$ will shrink everything very strongly. I suspect that $\lambda=\|X\|^2=\sum s_i^2$ will be too big for all practical purposes. I usually normalize my lambdas by the squared norm of $X$ and have a grid that goes from $0$ to $1$. – amoeba Aug 03 '16 at 14:29
  • @amoeba, If you wrote it up and expanded a little bit (like including a step-by-step instruction), I would consider accepting that as an answer. Or actually, no extra instruction might be needed, your comment probably has it all. – Richard Hardy Aug 03 '16 at 14:33
  • 1
    yes it's infinity. All my previous comment relates to that fact – Glen_b Aug 03 '16 at 21:03
  • Hey @RichardHardy, I have just added an important update to my answer. Do you think it now resolves your question, or not really? – amoeba Aug 08 '16 at 00:14
  • @amoeba, thank you for the answer, I am still digesting it (was a bit busy with other things). But don't worry, sooner or later I accept answers to most of the questions I post. – Richard Hardy Aug 08 '16 at 05:19
  • No problem and no pressure, I just wanted to draw your attention to the update that I made. – amoeba Aug 08 '16 at 12:06
  • You might consider avoiding a grid search and optimizing cross-validations directly. LOOCV is frequently a better proxy for out-of-sample prediction error than a k-fold CV (See Figure 1 in [Kamair and Maleki](https://arxiv.org/abs/1801.10243)) and both LOOCV and GCV have closed form equations that can be differentiated with respect to $\lambda$ and efficiently optimized by packages such as [peak-engines](https://github.com/rnburn/peak-engines#ridge-regression-parameter-optimization). – Ryan Burn Aug 02 '20 at 17:55
  • @rnickb, hmm, I do not immediately see how this should help with the original problem. I can do LOOCV for a fixed $\lambda$, but this does not help me choose the grid of $\lambda$. Or does it? – Richard Hardy Aug 02 '20 at 18:44
  • @RichardHardy - you don't need a grid. The [package](https://github.com/rnburn/peak-engines) I pointed to finds the $\lambda$ that optimizes the LOOCV for you using a second order optimizer. There's no need for a $\lambda_{max}$ or a search grid, just run $RidgeRegressionModel().fit(X, y).alpha_$ and it find the optimal $\lambda$ – Ryan Burn Aug 02 '20 at 20:21
  • You can see this [notebook](https://github.com/rnburn/peak-engines/blob/master/example/ridge_regression/alpha.ipynb) for a graphical representation of what it does or read this [blog post](https://towardsdatascience.com/how-to-do-ridge-regression-better-34ecb6ee3b12) for more details. Grid search can be viewed as a robust but low-information and imperfect optimizer. When you can compute the gradient and hessian of your objective (in this case LOOCV or GCV) you don't need it and can do much better. – Ryan Burn Aug 02 '20 at 20:25
  • @rnickb, thank you, that makes sense. – Richard Hardy Aug 03 '20 at 10:54

2 Answers2

12

The effect of $\lambda$ in the ridge regression estimator is that it "inflates" singular values $s_i$ of $X$ via terms like $(s^2_i+\lambda)/s_i$. Specifically, if SVD of the design matrix is $X=USV^\top$, then $$\hat\beta_\mathrm{ridge} = V^\top \frac{S}{S^2+\lambda I} U y.$$ This is explained multiple times on our website, see e.g. @whuber's detailed exposition here: The proof of shrinking coefficients using ridge regression through "spectral decomposition".

This suggests that selecting $\lambda$ much larger than $s_\mathrm{max}^2$ will shrink everything very strongly. I suspect that $$\lambda=\|X\|_2^2=\sum s_i^2$$ will be too big for all practical purposes.

I usually normalize my lambdas by the squared Frobenius norm of $X$ and have a cross-validation grid that goes from $0$ to $1$ (on a log scale).


Having said that, no value of lambda can be seen as truly "maximum", in contrast to the lasso case. Imagine that predictors are exactly orthogonal to the response, i.e. that the true $\beta=0$. Any finite value of $\lambda<\infty $ for any finite value of sample size $n$ will yield $\hat \beta \ne 0$ and hence could benefit from stronger shrinkage.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Sorry, I abandoned the topic for the time being and I did not have enough time to think deeper about it. I am giving it an upvote now, but I would like to postpone accepting the answer until I have time to convince myself it gives what I really need (I have some reservations, but currently I do not have time to explore them in detail). I hope this is fine with you. – Richard Hardy Nov 02 '16 at 10:10
  • No problem @RichardHardy. – amoeba Nov 02 '16 at 10:16
  • What's wrong with compactifying lambda to [0,1] range as I specified in my other question. At the end what kind of grid you will place on this range matters. I have seen three different grids on the internet: linear, log, and sqrt. But I think it should be related to the geometry of the problem at hand. Otherwise it is very ad-hoc. – Cagdas Ozgenc Feb 08 '17 at 11:20
  • @CagdasOzgenc Here I suggested a principled way to choose maximal $\lambda$. If you use $\kappa$ instead of $\lambda$, then the maximum value of $\lambda$ will be a function of the minimal value of $\kappa$ that you end up using. I don't see any principled way to choose such a minimal value. – amoeba Feb 08 '17 at 11:24
  • Minimal value of K is 0, practically making maximum lambda a number that your software allows. What matters is the step size in that region. – Cagdas Ozgenc Feb 08 '17 at 11:42
  • 1
    @CagdasOzgenc Zero does not make sense, it corresponds to inifinte lambda. I am talking about your minimal non-zero kappa, i.e. the step size. As I said, I don't see how you can choose it. Any choice IMHO will be more ad hoc than what I am suggesting here. – amoeba Feb 08 '17 at 11:51
  • I understand your point. So the methodology you are suggesting has largest lambda to be the squared F norm of the input matrix or the covariance matrix of inputs? – Cagdas Ozgenc Feb 08 '17 at 12:01
  • @CagdasOzgenc Yes, it's just the sum of all eigenvalues of covariance matrix (times $N$). Glmnet is probably taking the largest eigenvalue (and I think it scales by $N$ too). – amoeba Feb 08 '17 at 12:16
  • Why do you multiply it by N (N is number of variables I assume)? You don't mention it in the answer. – Cagdas Ozgenc Feb 08 '17 at 12:21
  • @CagdasOzgenc I say it's the sum of squared singular values of X. Eigenvalues of cov(X) are equal to squared singular values of X divided by N, because cov(X)=X'X/N. – amoeba Feb 08 '17 at 12:22
1

Maybe not quite answering your question, but instead of using ridge regression with a fixed penalization of your coefficients it would be better to use iterated adaptive ridge regression though, as the latter approximates L0 penalized regression (aka best subset), where the log likelihood of a GLM model is penalized based on a multiple of the number of nonzero coefficients in the model - see Frommlet & Noel 2016. This has the advantage that you don't have to tune the regularization level lambda at all then. Instead, you can then a priori set the regularization level $lambda$ to either $lambda = 2$ if you would like to directly optimize AIC (roughly coinciding with minimizing prediction error) or to $lambda=log(n)$ to optimize BIC (resulting in asymptotically optimal model selection in terms of selection consistency). This is what is done in the l0ara R package. To me this make more sense than first optimizing your coefficients under one objective (e.g. ridge), only to then tune the regularization level of that model based on some other criterion (e.g. minimizing cross validation prediction error, AIC or BIC). The other advantage of L0-penalized regression over ridge regression or LASSO regression is that it gives you unbiased estimates, so you can get rid of the bias-variance tradeoff that plagues most penalized regression approaches. Plus like ridge it also works for high-dimensional problems with $p>n$.

If you would like to stick with regular ridge regression, then this presentation gives a good overview of the strategies you can use to tune the ridge penalization factor. Information criteria such as AIC or BIC can also be used to tune regularisation, and they each asymptotically approximate a particular form of cross validation:

  • AIC approximately minimizes the prediction error and is asymptotically equivalent to leave-1-out cross-validation (LOOCV) (Stone 1977); LOOCV in turn is approximated by generalized cross validation (GCV), but LOOCV should always be better than GCV. AIC is not consistent though, which means that even with a very large amount of data ($n$ going to infinity) and if the true model is among the candidate models, the probability of selecting the true model based on the AIC criterion would not approach 1.
  • BIC is an approximation to the integrated marginal likelihood $P(D|M,A) (D=Data, M=model, A=assumptions)$, which under a flat prior is equivalent to seeking the model that maximizes $P(M|D,A)$. Its advantage is that it is consistent, which means that with a very large amount of data ($n$ going to infinity) and if the true model is among the candidate models, the probability of selecting the true model based on the BIC criterion would approach 1. This would come at a slight cost to prediction performance though if $n$ were small. BIC is also equivalent to leave-k-out cross-validation (LKOCV) where $k=n[1−1/(log(n)−1)]$, with $n=$ sample size (Shao 1997). There is many different versions of the BIC though which come down to making different approximations of the marginal likelihood or assuming different priors. E.g. instead of using a prior uniform of all possible models as in the original BIC, EBIC uses a prior uniform of models of fixed size (Chen & Chen 2008) whereas BICq uses a Bernouilli distribution specifying the prior probability for each parameter to be included.

Note that the LOOCV error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation. This would always be an alternative to the AIC as an asymptotic approximation of the LOOCV error.

References

Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7.

Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.

Tom Wenseleers
  • 2,413
  • 1
  • 21
  • 39
  • Thank you for your extensive answer! Could you make it more obvious which part of it addresses my question? – Richard Hardy Jun 27 '19 at 09:51
  • Well in a way it's maybe not answering your question - rather I would argue not to use ridge but iterated adaptive ridge instead, as you then don't have to tune the regularization parameter at all - rather you can just set it a priori to conform to either maximizing AIC (if you set lambda=2) or BIC (if you set lambda=log(n)). Makes sense? – Tom Wenseleers Jun 27 '19 at 09:54
  • Better now, thank you. I still wonder whether the answer fits the question. An alternative could be to post this answer on a new thread dedicated to the question of when one should choose iterated adaptive ridge regression instead of ridge regression. I would be happy to upvote elements of such a thread. Meanwhile, for those who are going to choose ridge regression, an answer to the precise question I have posed here might still be of interest. – Richard Hardy Jun 27 '19 at 10:54
  • Yes I know what I wrote doesn't exactly answer your question - but I thought it might still be useful to have it here for reference, and it was just too long for a comment... – Tom Wenseleers Jun 27 '19 at 11:34
  • 1
    An alternative: you could post this as a separate thread and then post a comment here with a link to it. – Richard Hardy Jun 27 '19 at 11:50
  • Problem is that I would have to post a thread with the answer already included (as I cannot see any benefits of using plain ridge over adaptive ridge or L0 penalized regression). But I'm working on an R package that fits ridge, adaptive ridge or L0 penalized GLMs with or without positivity constraints on the fitted coefficients and with inference also included. That should be done by next week. Once that's on github I'll use that to update my answer here to give a more targeted answer to your question... Hope that's OK... Just hold on a bit... – Tom Wenseleers Jun 28 '19 at 07:26
  • 1
    "LOOCV in turn is approximated by generalized cross validation (GCV), but LOOCV should always be better than GCV." -- That's not quite right. GCV is a LOOCV of a rotation of the original regression problem. $\bf{X'} = \bf{Q} \bf{X}$, $\bf{y'} = \bf{Q} \bf{y}$ where $\bf{Q} \bf{Q}^H = \bf{I}$ and the rotation matrix is chosen so as to spread out variance. It avoids certain problem cases that LOOCV has and in many cases is better than LOOCV. For more details and examples where GCV is better, see this [blog post](https://rb.gy/bwhsqq) and [Golub and Wahba](https://rb.gy/f6fxjw). – Ryan Burn Aug 02 '20 at 18:12
  • @TomWenseleers, how are the R package and the update doing? Had forgotten about this thread but found it anew, read the comments again and hence am posting this. – Richard Hardy Oct 07 '21 at 11:51
  • @RichardHardy It's been on the back burner for a while, but I will have some bioinformatics students work on this this semester to finish it all off. This was a really old version, but it still needs a lot of work (bug fixed but also implementing more efficient solvers and implementing additional functionality etc): https://github.com/tomwenseleers/L0glm. There is also the l0ara package, which you could look into (that one is already on CRAN): https://github.com/wguo1017/l0ara – Tom Wenseleers Oct 08 '21 at 14:06
  • @TomWenseleers, thank you! – Richard Hardy Oct 08 '21 at 14:18