Bayesian priors associated with regularization penalties

Question

I gather that adding a penalty term to (linear) least squares minimization typically corresponds with choosing some prior for Bayes estimation in the normal linear regression model. A couple questions in this respect:

I understand from this paper that the "orthant normal" distribution is the prior associated with elastic net. The author explains this prior equivalently as a continuous scale mixture of normals. However, wiki says the prior associated with the elastic net is a normal-Laplace mixture. Are these equivalent statements or is wiki wrong?
Is there an associated prior for L0 regularization?

Tim · Answer 1 · 2021-11-30T20:06:53.253

In Bayesian statistics, normalization corresponds to the choice of a prior. For ElasticNet the prior takes the form (Lin and Lin, 2010)

$$ \pi(\boldsymbol\beta) \propto \exp\left\{ -\lambda_1 \| \boldsymbol\beta \|_1 - \lambda_2 \| \boldsymbol\beta \|_2^2 \right\} $$

This distribution is unnormalized. The paper that you refer to by Hans (2011) "broadens the scope of the Bayesian connection by providing a complete characterization of a class of prior distributions that generate the elastic net estimate as to the posterior mode." The author proposes a normalized prior distribution that can be considered as an equivalent of ElasticNet normalization. Details and proofs can be found in the paper.

$\ell_0$ regularization can be thought (Polson and Sun, 2017) as using a prior that is a mixture of Dirac delta centered at zero $\delta_0$ and a Gaussian

$$ \pi(\beta_i) \propto (1 - \theta) \delta_0 + \theta\, \mathcal{N}(0, \sigma_\beta^2) $$

You asked in the comment if $\pi(\beta)\propto \exp\{-\lambda \|\beta\|_0\}$ would be a proper prior equivalent to $\ell_0$ regularization. First, recall that $\ell_0$ is not a proper norm. Second, think of what would this prior do: it would put probability mass of zero at parameters equal to exactly zero and constant probability mass on all the other values. It would be an improper prior, it wouldn't also do much about regularization because it would be basically uniform for any non-zero value. That is why the prior above has two components, for the probability mass for zeros $(1 - \theta) \delta_0$ and non-uniform component for all the other values $\theta\, \mathcal{N}(0, \sigma_\beta^2)$. It worked differently than $\ell_0$ regularization, but the regularization itself is almost never used because even from an optimization point of view it's problematic.

Notice however that there are multiple priors that can lead to sparse solutions (see Van Erp et al 2019), where the priors that correspond more closely to the penalties do not necessarily work equally well as the traditional penalties. The priors may be thought of as mathematically equivalent, but other methods of estimating the models, different implementations, and other technical nuances may lead to differences in the results and other priors may be preferable.

Thanks. Also, I know the Ridge estimator can be thought of both as a posterior mean (Bayes estimator under quadratic loss) and posterior mode (MAP estimator) given it imputes a normal posterior, which is symmetric. I assume these other regularizations correspond with MAP estimators, and not necessarily Bayes estimators, is that correct? — Golden_Ratio, Nov 30 '21 at 17:43
@Golden_Ratio MAP = mode of the posterior. Not sure what you mean by those estimators "not corresponding to Bayesian estimators"? Full posterior distribution can't ever be the same as a point estimate because it's the whole distribution, but the priors would work the same no matter if you find the mode or estimate the whole distribution. — Tim, Nov 30 '21 at 18:04
That is not what I was asking. Let me reword: the posterior distribution imputed by an arbitrary regularization is not necessarily unimodal and symmetric, correct? If so, the posterior mean would not necessarily be the same as the posterior mode. The significance of the posterior mode is that it is the MAP estimate, while the significance of the posterior mean is that it is a Bayes estimator under a quadratic loss function, i.e. loss function $\|\hat{\beta}-\beta \|^2$. — Golden_Ratio, Nov 30 '21 at 18:09
The Ridge estimator, for instance, enjoys the property of being both a posterior mean and posterior mode since the posterior distribution is normal. — Golden_Ratio, Nov 30 '21 at 18:15
@Golden_Ratio if you need posterior mean, you can always estimate posterior mean of the model using such prior. — Tim, Nov 30 '21 at 18:34
Great, and just to clarify, we can also characterize the prior for the $\ell_0$ regularization by $\pi(\beta)\propto \exp\{-\lambda \|\beta\|_0\}$ as with other regularizations, correct? — Golden_Ratio, Nov 30 '21 at 18:46
@Golden_Ratio I don’t see how it would make sense as probability distribution. For $\ell_1$ and $\ell_2$ it corresponds to Laplace and Gaussian distributions. — Tim, Nov 30 '21 at 19:19

Bayesian priors associated with regularization penalties

1 Answers1

Linked