20

It's well known that linear regression with an $l^2$ penalty is equivalent to finding the MAP estimate given a Gaussian prior on the coefficients. Similarly, using an $l^1$ penalty is equivalent to using a Laplace distribution as the prior.

It's not uncommon to use some weighted combination of $l^1$ and $l^2$ regularization. Can we say that this is equivalent to some prior distribution over the coefficients (intuitively, it seems that it must be)? Can we give this distribution a nice analytic form (maybe a mixture of Gaussian and Laplacian)? If not, why not?

Michael Curry
  • 365
  • 1
  • 6
  • 2
    see this paper: http://www.tandfonline.com/doi/abs/10.1198/jasa.2011.tm09241 (If this isn't properly answered in a week or two, I'll post (more or less) a summary of that paper) – user795305 Jun 02 '17 at 17:10
  • 9
    I should add that any time frequentists have a penalty $pen$, a bayesian can interpret that as a (possibly improper) prior $e^{-pen}$ under a standard gaussian model. – user795305 Jun 02 '17 at 17:12
  • thanks, this paper and its citations answer my question perfectly! – Michael Curry Jun 02 '17 at 17:15
  • Great! Do you mind pointing out which citations you mean? (I'm planning on reading this paper soon and am interested in your comments) – user795305 Jun 02 '17 at 17:20
  • 1
    Zou and Hastie 2005, which it looks like is the paper introducing the elastic net. They give an interpretation in terms of a prior. – Michael Curry Jun 02 '17 at 17:25
  • 1
    Okay, cool! I think their bayesian interpretation ties into my second comment – user795305 Jun 02 '17 at 18:22

1 Answers1

9

Ben's comment is likely sufficient, but I provide some more references one of which is from before the paper Ben referenced.

A Bayesian elastic net representation was proposed by Kyung et. al. in their Section 3.1. Although the prior for the regression coefficient $\beta$ was correct, the authors incorrectly wrote down the mixture representation.

A corrected Bayesian model for the elastic net was recently proposed by Roy and Chakraborty (their Equation 6). The authors also go on to present an appropriate Gibbs sampler to sample from the posterior distribution, and show that the Gibbs sampler converges to the stationary distribution at a geometric rate. For this reason, these references might turn out to be useful, in addition to the Hans paper.

Michael Curry
  • 365
  • 1
  • 6
Greenparker
  • 14,131
  • 3
  • 36
  • 80
  • (+1) Great answer! – user795305 Jun 05 '17 at 02:33
  • 4
    for anyone in the future -- the papers are all worth looking at, but the Hans paper gives you some Gibbs samplers for various distributions as well as a hierarchical representation of the prior that can be translated easily to Stan. – Michael Curry Jun 28 '17 at 15:16
  • Would you point out the mistakes in Kyung et. al. sec 3.1? – Albert Chen Nov 30 '20 at 15:33
  • With respect to the elastic net, the errors in Section 3.1 are described in [this](https://projecteuclid.org/download/pdfview_1/euclid.ba/1473276258) paper on page 757. – Greenparker Dec 01 '20 at 02:43