Estimation of Bayesian Ridge Regression

Question

According to scikit-learn, by using a probabilistic model :

$p(y|X,\omega,\alpha) = \mathcal{N}(y|X\omega,\alpha)$

with $\omega$ given by a spherical Gaussian: $p(\omega|\lambda) = \mathcal{N}(\omega|0,\lambda^{-1}\mathbf{I_p})$

it is now a Bayesian model of ridge regression. So can i say that the estimation of this model on unknown data $X^*$ is a probability distribution on y with mean $\mu$ = $X\omega$ and variance $\sigma^2 = \alpha$ or $\sigma^2=\lambda^{-1}\mathbf{I_p}$ ? What exactly do $\alpha$ and $\lambda$ do in the equations ?

What is meant by $\omega \sim$ Spherical Gaussian? Is $\mathbf{I}_p$ the $p \times p$ identity matrix? I don't think that's a good probability model for the model coefficients which may in fact have some non-zero covariance. — AdamO, Feb 14 '18 at 15:52
Yes $\mathbf{I_p}$ is the identity matrix. I just pulled it from theory on scikit-learn. I didn't understand it quite good so that why i asked here — Thien, Feb 14 '18 at 15:57

Tim · Accepted Answer · 2018-02-14T16:13:03.140

What the description in the sklearn documentation says is that the model is a regression model with extra regularization parameter for the coefficients. The model is

$$\begin{align} y &\sim \mathcal{N}(\mu, \alpha^{-1}) \\ \mu &= X\omega \\ \omega &\sim \mathcal{N}(0, \lambda^{-1}\mathbf{I}_p) \\ \alpha &\sim \mathcal{G}(\alpha_1, \alpha_2) \\ \lambda &\sim \mathcal{G}(\lambda_1, \lambda_2) \end{align}$$

So $y$ follows normal distribution (the likelihood function) parametrized by mean $\mu = X\omega$ and variance $\alpha^{-1}$. Where we choose Gamma priors for $\alpha$ and regularizing parameter $\lambda$, the distributions have hyperpriors $\alpha_1, \alpha_2, \lambda_1, \lambda_2$. The regression parameters $\omega$ have independent Gaussian priors with mean $0$ and variance $\lambda^{-1}$, so $\lambda$ serves as a regularization parameter (it is a precision parameter, so the larger $\lambda$, the $\omega$ values are a priori assumed to be more concentrated around zero).

so it find the best \alpha, \lambda, \omega that makes P(\alpha, \lambda, \omega|\alpha_1,\alpha_2, \lambda_1,\lambda_2, Y, X) the largest? — ArtificiallyIntelligence, Jun 28 '18 at 02:05

Estimation of Bayesian Ridge Regression

1 Answers1