LASSO and ridge from the Bayesian perspective: what about the tuning parameter?

Question

Penalized regression estimators such as LASSO and ridge are said to correspond to Bayesian estimators with certain priors. I guess (as I do not know enough about Bayesian statistics) that for a fixed tuning parameter, there exists a concrete corresponding prior.

Now a frequentist would optimize the tuning parameter by cross validation. Is there a Bayesian equivalent of doing so, and is it used at all? Or does the Bayesian approach effectively fix the tuning parameter before seeing the data? (I guess the latter would be detrimental to predictive performance.)

I imagine that a fully Bayesian approach would start with a given prior and not modify it, yes. But there is also an [tag:empirical-bayes] approach that optimizes over hyperparameter values: e.g. see https://stats.stackexchange.com/questions/24799. — amoeba, Sep 21 '18 at 12:10
Additional question (could be part of main Q): Do there exist some prior on the regularization parameter that somehow replaces the cross-validation process, somehow? — kjetil b halvorsen, Sep 21 '18 at 12:21
Bayesians can put a prior on the tuning parameter, as it usually corresponds to a variance parameter. This is usually what is done to avoid CV in order to stay fully-Bayes. Alternatively, you can use REML to optimize the regularization parameter. — guy, Sep 21 '18 at 12:49
@amoeba, thank you, this is roughly what I expected. The link to the other thread was helpful, too. — Richard Hardy, Sep 21 '18 at 14:31
@kjetilbhalvorsen, great question. Not sure if it should be appended here or posted separately, though. — Richard Hardy, Sep 21 '18 at 14:34
@guy can you explain better the connection between a hyper-prior and k-fold CV? Is there a prior that would induce a similar behavior? — , Dec 07 '18 at 05:46
PS: to those aiming for the bounty, note my comment: *I want to see an explicit answer that shows a prior that induces a MAP estimate equivalent to frequentist cross-validation.* — , Dec 07 '18 at 06:49
@statslearner2 Did you see the link I gave in the 1st comment above? This might be useful for you. — amoeba, Dec 07 '18 at 08:49
@statslearner2 related https://andrewgelman.com/2004/11/08/crossvalidation/ https://stats.stackexchange.com/questions/343420/bayesian-thinking-about-overfitting — Sextus Empiricus, Dec 07 '18 at 11:00
@amoeba I've read them, but they do not address this question. — , Dec 08 '18 at 21:44
@statslearner2 I think it does address Richard's question very well. Your bounty seems to be focused on a more narrow aspect (about a hyperprior) than Richard's Q. — amoeba, Dec 09 '18 at 00:35

Ben · Accepted Answer · 2021-01-16T20:46:40.740

Penalized regression estimators such as LASSO and ridge are said to correspond to Bayesian estimators with certain priors.

Yes, that is correct. Whenever we have an optimisation problem involving maximisation of the log-likelihood function plus a penalty function on the parameters, this is mathematically equivalent to posterior maximisation where the penalty function is taken to be the logarithm of a prior kernel.$^\dagger$ To see this, suppose we have a penalty function $w$ using a tuning parameter $\lambda$. The objective function in these cases can be written as:

$$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$$

where we use the prior $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$. Observe here that the tuning parameter in the optimisation is treated as a fixed hyperparameter in the prior distribution. If you are undertaking classical optimisation with a fixed tuning parameter, this is equivalent to undertaking a Bayesian optimisation with a fixed hyper-parameter. For LASSO and Ridge regression the penalty functions and corresponding prior-equivalents are:

$$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$$

The former method penalises the regression coefficients according to their absolute magnitude, which is the equivalent of imposing a Laplace prior located at zero. The latter method penalises the regression coefficients according to their squared magnitude, which is the equivalent of imposing a normal prior located at zero.

Now a frequentist would optimize the tuning parameter by cross validation. Is there a Bayesian equivalent of doing so, and is it used at all?

So long as the frequentist method can be posed as an optimisation problem (rather than say, including a hypothesis test, or something like this) there will be a Bayesian analogy using an equivalent prior. Just as the frequentists may treat the tuning parameter $\lambda$ as unknown and estimate this from the data, the Bayesian may similarly treat the hyperparameter $\lambda$ as unknown. In a full Bayesian analysis this would involve giving the hyperparameter its own prior and finding the posterior maximum under this prior, which would be analogous to maximising the following objective function:

$$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$$

This method is indeed used in Bayesian analysis in cases where the analyst is not comfortable choosing a specific hyperparameter for their prior, and seeks to make the prior more diffuse by treating it as unknown and giving it a distribution. (Note that this is just an implicit way of giving a more diffuse prior to the parameter of interest $\theta$.)

(Comment from statslearner2 below) I'm looking for numerical equivalent MAP estimates. For instance, for a fixed penalty Ridge there is a gaussian prior that will give me the MAP estimate exactly equal the ridge estimate. Now, for k-fold CV ridge, what is the hyper-prior that would give me the MAP estimate which is similar to the CV-ridge estimate?

Before proceeding to look at $K$-fold cross-validation, it is first worth noting that, mathematically, the maximum a posteriori (MAP) method is simply an optimisation of a function of the parameter $\theta$ and the data $\mathbf{x}$. If you are willing to allow improper priors then the scope encapsulates any optimisation problem involving a function of these variables. Thus, any frequentist method that can be framed as a single optimisation problem of this kind has a MAP analogy, and any frequentist method that cannot be framed as a single optimisation of this kind does not have a MAP analogy.

In the above form of model, involving a penalty function with a tuning parameter, $K$-fold cross-validation is commonly used to estimate the tuning parameter $\lambda$. For this method you partition the data vector $\mathbb{x}$ into $K$ sub-vectors $\mathbf{x}_1,...,\mathbf{x}_K$. For each of sub-vector $k=1,...,K$ you fit the model with the "training" data $\mathbf{x}_{-k}$ and then measure the fit of the model with the "testing" data $\mathbf{x}_k$. In each fit you get an estimator for the model parameters, which then gives you predictions of the testing data, which can then be compared to the actual testing data to give a measure of "loss":

$$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$$

The loss measures for each of the $K$ "folds" can then be aggregated to get an overall loss measure for the cross-validation:

$$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$$

One then estimates the tuning parameter by minimising the overall loss measure:

$$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$$

We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above for $\theta$, and the one described here for $\lambda$). Since the latter optimisation does not involve $\theta$, we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:

$$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda), \\[6pt] \end{aligned} \end{equation}$$

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$-fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$-fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrary accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

$$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$, with a fixed (and very large) hyper-parameter $\delta$.

(Note: For a related question looking at logistic ridge regression framed in Bayesian terms see here.)

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

Can you explain how the hyper-prior is going to result in a similar MAP estimate as k-fold CV? This is not clear to me. — , Dec 07 '18 at 06:41
Also, can you make you answer specific to LASSO and Ridge, and give the specific parameterizations of the priors that would give the "k-fold-CV-like" behavior? — , Dec 07 '18 at 06:43
@statslearner: It won't be the same as frequentist cross-validation tests. Bayesians use their own methods, which are analogous to certain types of classical optimisation, but are not analogous to classical hypothesis testing. I have edited the question to make this clearer. — Ben, Dec 07 '18 at 06:55
I'm looking for numerical equivalent MAP estimates. For instance, for a fixed penalty Ridge there is a gaussian prior that will give me the MAP estimate exactly equal the ridge estimate. Now, for k-fold CV ridge, what is the hyper-prior that would give me the MAP estimate which is similar to the cv-ridge estimate? — , Dec 07 '18 at 06:57
Okay, I think I understand what you want now. Let me think about it and I'll update this answer later if I have anything useful to say on that. — Ben, Dec 07 '18 at 07:02
Ok +1 already, but for the bounty I'm looking for these more precise answers. — , Dec 07 '18 at 07:04
**1.** I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? **2.** Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP? — Richard Hardy, Dec 08 '18 at 17:09
**3.** Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point **2.** suggest that the Bayesian approach is somehow inferior? — Richard Hardy, Dec 08 '18 at 17:10
@statslearner2: I have now added a section relating directly to the MAP analogy to k-fold cross-validation. — Ben, Dec 09 '18 at 02:06
@RichardHardy My experience from selecting hyperparameters in VARs by empirical Bayes is that as long as you’re in a good neighborhood, the specific choice isn’t important for predictive accuracy. I would suspect that in the same arguably loosely-defined fashion CV helps you find a good neighborhood. But the precise value may be of less importance. — hejseb, Dec 09 '18 at 06:29
Thanks Ben, could you connect this more with the specific case of Lasso and Ridge? What types of priors would give us a "CV-like" result in these cases? — , Dec 11 '18 at 23:05
@statslearner2: I have now added the specific forms of LASSO and Ridge regressions in the first section. — Ben, Dec 11 '18 at 23:38
@RichardHardy: Thanks for these detailed comments. **1.** I have now amended this part to be clearer on what I meant (i.e., that there is no Bayesian equivalent *if* the frequentist decides to use a hypothesis test). **2.** I have now answered this in the third section. **3.** I don't know how you jump from "quite effective" to "superior to Bayesian methods". If you want to assert the superiority of the frequentist methods over Bayesian methods, I think that would need to be established by detailed comparisons of properties, simulations, etc. It is possible that both methods are effective. — Ben, Dec 11 '18 at 23:44
@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding **3.**, yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. **2.** (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not? — Richard Hardy, Dec 12 '18 at 12:24
@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case. — Richard Hardy, Dec 12 '18 at 12:28
@RichardHardy: It depends what you mean by *use*. The general idea of establishing Bayesian equivalents to other optimisation methods in classical statistics is just to check that a particular frequentist procedure also falls within the scope of Bayesian analysis, and can be given a Bayesian interpretation. Usually this doesn't involve any difference in its use. Often it will just mean that an analyst continues to use the frequentist procedure, but now does so knowing that it can be given a Bayesian interpretation as well. ... — Ben, Dec 12 '18 at 23:29
... That last formula really just establishes that model fitting with K-fold CV can be framed equivalently as a Bayesian MAP estimator, but this involves a different likelihood function in the Bayesian analysis than in the frequentist analysis. In particular, if you're doing classical model fitting via MLE with a penalty function, plus K-fold cross-validation of a tuning parameter, that is equivalent to doing Bayesian MAP with a particular prior that encompasses the penalty function (possibly improper), and a likelihood function that is adjusted to incorporate the K-fold-CV. — Ben, Dec 12 '18 at 23:32
Ben, what do you think about empirical Bayes approach of optimizing the hyper-parameters? See my 1st comment under the Q. I was under impression that this is a pretty standard thing to do in a Bayesian setting instead of cross-validation. It seems to directly answer @Richard's question on what is a Bayesian equivalent of CV. Of course empirical Bayes is not a fully Bayesian procedure, but I often see it used in Bayesian contexts. What's your take on it? — amoeba, Dec 13 '18 at 00:05
@amoeba: Yeah, I think you're right. As you rightly point out, that is not a pure Bayesian approach. I guess it would be a way of sneaking the data back into the prior for the hyper-parameter, so it might give you a closer analogy to the frequentist procedure, but at the expense of dropping the pure Bayesian approach. — Ben, Dec 13 '18 at 00:38

Dimitris Rizopoulos · Answer 2 · 2018-12-07T19:55:04.917

Indeed most penalized regression methods correspond to placing a particular type of prior to the regression coefficients. For example, you get the LASSO using a Laplace prior, and the ridge using a normal prior. The tuning parameters are the “hyperparameters” under the Bayesian formulation for which you can place an additional prior to estimate them; for example, for in the case of the ridge it is often assumed that the inverse variance of the normal distribution has a $\chi^2$ prior. However, as one would expect, resulting inferences can be sensitive to the choice of the prior distributions for these hyperparameters. For example, for the horseshoe prior there are some theoretical results that you should place such a prior for the hyperparameters that it would reflect the number of non-zero coefficients you expect to have.

A nice overview of the links between penalized regression and Bayesian priors is given, for example, by Mallick and Yi.

Thank you for your answer! The linked paper is quite readable, which is nice. — Richard Hardy, Sep 21 '18 at 14:45
This does not answer the question, can you elaborate to explain how does the hyper-prior relate to k-fold CV? — , Dec 07 '18 at 05:42

LASSO and ridge from the Bayesian perspective: what about the tuning parameter?

2 Answers2

Linked