If the LASSO is equivalent to linear regression with a Laplace prior how can there be mass on sets with components at zero?

Question

We are all familiar with the notion, well documented in the literature, that LASSO optimization (for sake of simplicity confine attention here to the case of linear regression) $$ {\rm loss} = \| y - X \beta \|_2^2 + \lambda \| \beta \|_1 $$ is equivalent to the linear model with Gaussian errors in which the parameters are given the Laplace prior
$$ \exp(-\lambda \| \beta \|_1 ) $$ We are also aware that the higher one sets the tuning parameter, $\lambda $, the larger the portion of parameters get set to zero. This being said, I have the following thought question:

Consider that from the Bayesian point of view we can calculate the posterior probability that, say, the non-zero parameter estimates lie in any given collection of intervals and the parameters set to zero by the LASSO are equal to zero. What has me confused is, given that the Laplace prior is continuous (in fact absolutely continuous) then how can there be any mass on any set that is a product of intervals and singletons at $\{0\}$?

What makes you think that the posterior isn't also a continuous pdf? The fact that the maximum of the posterior happens to occur at a point that happens to have lots of 0 components doesn't mean by itself that the posterior isn't a continuous pdf. — Brian Borchers, Dec 21 '15 at 23:37
The posterior is a continuous PDF. Viewed as constrained maximum likelihood estimation, if we imagine repeated draws from the same data distribution when the true model has zeros at multiple regression coefficients and the tuning constant is large enough then the CMLE will always have the same components set to zero and the non-zero parameters will spread out into corresponding confidence intervals. From the bayesian perspective this is equivalent to having a positive probability for such sets. My question is how can this be so for a continuous distribution. — Grant Izmirlian, Dec 22 '15 at 00:30
The CLME solution coincides with the MAP estimate. There's really nothing more to be said. — Sycorax, Dec 22 '15 at 02:13
I didn't say that the CMLE is a sample from the posterior, but there is a correspondence between frequentist confidence intervals and posterior probability. My original question is that there is an apparent contradiction that a continuous distribution on R^d say assigns positive mass to sets of lower dimension in R^p , p — Grant Izmirlian, Dec 22 '15 at 20:57
There is no contradiction because the posterior does not put mass on sets of lower dimension. — Xi'an, Jan 09 '16 at 14:26
It is the 5th times that I am reading your question but not clear yet. From my realisation of the problem, you are confused between a point estimation method and an interval estimation technique. Bayesian, never estimates a parameter exactly zero. — TPArrow, Sep 04 '16 at 14:37

score 7 · Answer 1 · answered Oct 26 '16 at 13:38

Like all the comments above, the Bayesian interpretation of LASSO is not taking the expected value of the posterior distribution, which is what you would want to do if you were a purist. If that would be the case, then you would be right that there is very small chance that the posterior would be zero given the data.

In reality, the Bayesian interpretation of LASSO is taking the MAP (Maximum A Posteriori) estimator of the posterior. It sounds like you are familiar, but for anyone who is not, this is basically Bayesian Maximum Likelihood, where you use the value that corresponds to the maximum probability of occurrence (or the mode) as your estimator for the parameters in LASSO. Since the distribution increases exponentially until zero from the negative direction and falls off exponentially in the positive direction, unless your data strongly suggests the beta is some other significant value, the maximum value of value of your posterior is likely to be 0.

Long story short, your intuition seems to be based on the mean of the posterior, but the Bayesian interpretation of LASSO is based on taking the mode of the posterior.

If the LASSO is equivalent to linear regression with a Laplace prior how can there be mass on sets with components at zero?

1 Answers1