I searched about Bayesian Ridge Regression on Internet but most of the result I got is about Bayesian Linear Regression. I wonder if it's both the same things because the formula look quite similar
1 Answers
Ridge regression uses regularization with $L_2$ norm, while Bayesian regression, is a regression model defined in probabilistic terms, with explicit priors on the parameters. The choice of priors can have the regularizing effect, e.g. using Laplace priors for coefficients is equivalent to $L_1$ regularization. They are not the same, because ridge regression is a kind of regression model, and Bayesian approach is a general way of defining and estimating statistical models that can be applied to different models.
Ridge regression model is defined as
$$ \underset{\beta}{\operatorname{arg\,min}}\; \|y - X\beta\|^2_2 + \lambda \|\beta\|^2_2 $$
In Bayesian setting, we estimate the posterior distribution by using Bayes theorem
$$ p(\theta|X) \propto p(X|\theta)\,p(\theta) $$
Ridge regression means assuming Normal likelihood and Normal prior for the parameters. After droping the normalizing constant, the log-density function of normal distribution is
$$\begin{align} \log p(x|\mu,\sigma) &= \log\Big[\frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\Big] \\ &= \log\Big[\frac{1}{\sigma \sqrt{2\pi} }\Big] + \log\Big[e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\Big] \\ &\propto -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 \\ &\propto -\frac{1}{\sigma^2} \|x - \mu\|^2_2 \end{align}$$
Now you can see that maximizing normal log-likelihood, with normal priors is equivalent to minimizing the squared loss, with ridge penalty
$$\begin{align} \underset{\beta}{\operatorname{arg\,max}}& \; \log\mathcal{N}(y|X\beta, \sigma) + \log\mathcal{N}(0, \tau) \\ = \underset{\beta}{\operatorname{arg\,min}}&\; -\Big\{\log\mathcal{N}(y|X\beta, \sigma) + \log\mathcal{N}(0, \tau)\Big\} \\ = \underset{\beta}{\operatorname{arg\,min}}&\; \frac{1}{\sigma^2}\|y - X\beta\|^2_2 + \frac{1}{\tau^2} \|\beta\|^2_2 \end{align}$$
For reading more on ridge regression and regularization see the threads: Why does ridge estimate become better than OLS by adding a constant to the diagonal?, and What problem do shrinkage methods solve?, and When should I use lasso vs ridge?, and Why is ridge regression called "ridge", why is it needed, and what happens when $\lambda$ goes to infinity?, and many others we have.

- 108,699
- 20
- 212
- 390
-
Thanks for the answer ! i tried to understand what are the advantages of $L_2$ norm, the explanation on Scikit is a bit complicated for me. It would be nice if you could point out the problem with normal Ordinary Least Squares – Thien Feb 13 '18 at 10:43
-
1@Thien see the edit for some links – Tim Feb 13 '18 at 10:51
-
So, it is called as Bayesian Ridge Regression if using Normal priors for coefficients(which is equivalent to $L_2$ ) ? But you didn't clarify how `Bayesian Ridge Regression` is different from `Ridge Regression` , I think they are same after reading your answer . – Mithril Jul 10 '20 at 09:49
-
@Mithril the difference is that Ridge Regression minimizes loss, while Bayesian version maximizes the posterior probability by fitting a probabilistic model. So it is the same as the difference between [Bayesian linear regression vs linear regression](https://stats.stackexchange.com/questions/252577/bayes-regression-how-is-it-done-in-comparison-to-standard-regression/252608#252608) or any other Bayesian counterpart of the classical model. – Tim Jul 10 '20 at 09:57
-
Thanks for the answer! I can understand how Bayesian Ridge Regression (BRR) base on normal priors and it result equivalent to Ridge regression (with L2 norm). But we always should get the same results with the both approach? or it depends in the weight that the prior information have for BRR...? – marb_021 Sep 11 '21 at 12:07
-
1@marb_021 in principle yes, but due to implementational details or using different optimization algorithms you could see some differences. Also the results would be same when using MLE and MAP, but not when using full Bayesian approach. – Tim Sep 11 '21 at 12:22