6

Most textbooks (also this blog) cover the fact that ridge regression,

$$ \hat y = \hat \beta X; \\ \hat \beta = \underset{\beta}{\text{argmin}}\ \ \frac{(y-\beta X)^T(y-\beta X)}{\sigma^2} + \lambda \|\beta\|_2^2; $$

can be interpreted as a MAP estimate of a Bayesian model with $N(0, \tau)$ priors on the $\beta$ parameters, where

$$ \lambda = \frac{\sigma^2}{\tau^2} $$ What is the equivalent Bayesian interpretation of logistic ridge regression?

$$ \hat y = logit^{-1}(\hat \beta X); \\ \hat \beta = \underset{\beta}{\text{argmin}}\ \ -y\ log(\hat y) - (1-y)\ log(1 - \hat y) + \lambda \|\beta\|_2^2; $$

I'm looking for this both out of theoretical interest, and because I would like to use stochastic gradient descent to estimate MAP logistic regression parameters in a context (JavaScript) where I don't have access to any specialised solvers.

Eoin
  • 4,543
  • 15
  • 32

2 Answers2

7

As a preliminary note, I see that your equations seem to be dealing with the case where we only have a single explanatory variable and a single data point (and no intercept term). I will generalise this to look at the general case where you observe $n$ data points, so that the log-likelihood function is a sum over these $n$ observations. (I will use only one explanatory variable, as in your question.) For a logistic regression of this kind you have the observable values $Y_i|\mathbf{x}_i \sim \text{Bern}(\mu_i)$ with true mean values:

$$\mu_i \equiv \mathbb{E}(Y_i|\mathbf{x}_i) = \text{logistic}(\boldsymbol{\beta}^\text{T} \mathbf{x}) = \frac{e^{\boldsymbol{\beta}^\text{T} \mathbf{x}}}{1+e^{\boldsymbol{\beta}^\text{T} \mathbf{x}}}.$$

The log-likelihood function is given by:

$$\begin{align} \ell(\mathbf{y}|\mathbf{x},\boldsymbol{\beta}) &= \sum_{i=1}^n \log \text{Bern}(y_i|\mu_i) \\[6pt] &= \sum_{i=1}^n y_i \log (\mu_i) + \sum_{i=1}^n (1-y_i) \log (1-\mu_i) \\[6pt] &= \sum_{i=1}^n y_i \log (\mu_i) + \sum_{i=1}^n (1-y_i) \log (1-\mu_i) \\[6pt] &= \sum_{i=1}^n y_i \log(\boldsymbol{\beta}^\text{T} \mathbf{x}) - \sum_{i=1}^n y_i \log(1+e^{\boldsymbol{\beta}^\text{T} \mathbf{x}}) - (1-y_i) \log(1+e^{\boldsymbol{\beta}^\text{T} \mathbf{x}}) \\[6pt] &= \sum_{i=1}^n y_i \log(\boldsymbol{\beta}^\text{T} \mathbf{x}) - \sum_{i=1}^n \log(1+e^{\boldsymbol{\beta}^\text{T} \mathbf{x}}). \\[6pt] \end{align}$$

Logistic ridge regression operates by using an estimation method that imposes a penalty on the parameter $\boldsymbol{\beta}$ that is proportionate to its squared norm. (Note that you have stated this slightly incorrectly in your question.) It estimates the parameter $\boldsymbol{\beta} = (\beta_1,...,\beta_K)$ via the optimisation problem:

$$\begin{align} \hat{\boldsymbol{\beta}}_\text{Ridge} &= \underset{\boldsymbol{\beta} \in \mathbb{R}^K}{\text{argmax}} \ \ \ \ \ell(\mathbf{y}|\mathbf{x},\boldsymbol{\beta}) - \lambda ||\boldsymbol{\beta}||^2. \\[6pt] \end{align}$$

Since the log-posterior is the sum of the log-likelihood and log-prior, the MAP estimator is:

$$\begin{align} \hat{\boldsymbol{\beta}}_\text{MAP} &= \underset{\boldsymbol{\beta} \in \mathbb{R}^K}{\text{argmax}} \ \ \ \ \ell(\mathbf{y}|\mathbf{x},\boldsymbol{\beta}) + \log \pi(\boldsymbol{\beta}). \\[6pt] \end{align}$$

We obtain the result $\hat{\boldsymbol{\beta}}_\text{Ridge} = \hat{\boldsymbol{\beta}}_\text{MAP}$ by using the prior kernel $\pi(\boldsymbol{\beta}) \propto \exp(- \lambda ||\boldsymbol{\beta}||^2)$ so that $\log \pi (\boldsymbol{\beta}) = - \lambda ||\boldsymbol{\beta}||^2 + \text{const}$ in the above equation. Integrating to find the constant of integration gives the prior distribution:

$$\pi(\boldsymbol{\beta}) = \prod_k \mathcal{N} \bigg( \beta_k \bigg| 0, \frac{1}{2\lambda} \bigg).$$

Thus, we see that ridge logistic regression is equivalent to MAP estimation if a priori the individual $\beta_k$ parameters are IID normal random variables with zero mean. The variance parameter for this normal distribution is a one-to-one mapping of the "penalty" hyperparameter in the ridge logistic regression --- a larger penalty in the ridge regression corresponds to a smaller variance for the prior.

(Note: For a related question showing LASSO and ridge regression framed in Bayesian terms see here.)

Ben
  • 91,027
  • 3
  • 150
  • 376
3

To look for equivalence one should compare the form of,

$$\hat{\beta} = \underset{\beta}{\text{argmin}} -y\log(\hat{y}) - (1-y)\log(1-\hat{y}) + \lambda||\beta||_2^2,$$

with the posterior distribution whilst keeping a general expression for the prior. The posterior distribution has form, $$\pi(\beta|x) \propto \pi(\beta)L(\beta;x).$$ Where $\pi(\beta)$ is the prior and $L(\beta;x)$ is the likelihood. Noting that $\beta$ is $p\times1$ and that $x$ represents the data where $x_i$ is one observation and would be $p\times1$. In logistic regression the model for the data is Bernoulli (more generally Binomial). So, $$y_i|\beta,x_i \sim Bernoulli(p_i)$$ where $p_i = \frac{\exp\{\beta^Tx_i\}}{1 + \exp\{\beta^Tx_i\}}.$ Let $f(\cdot)$ be the density function, then the posterior for $\beta$ becomes

\begin{align*} \pi(\beta|x)&\propto\pi(\beta)\prod_{i=1}^{n}f(x_i|\beta) \\ &= \pi(\beta)\prod_{i=1}^{n}p_i^{y_i}(1-p_i)^{1-y_i}. \end{align*} The maximum-a-posterior (MAP) of $\beta$ is the mode of its posterior distribution and since $\log$ is monotone, $$\hat{\beta}_{MAP} = \underset{\beta}{\text{argmax}}\pi(\beta|x) = \underset{\beta}{\text{argmax}}\log\pi(\beta|x).$$ So taking, $$\log\pi(\beta|x) \propto \log\pi(\beta) + \sum_{i=1}^n\big\{y_i\log p_i + (1-y_i)\log(1-p_i)\big\}$$ and noting that $\hat{\beta}_{MAP} = \underset{\beta}{\text{argmax}}\log\pi(\beta|x) = \underset{\beta}{\text{argmin}}\big\{-\log\pi(\beta|x)\big\}$ we can see that, \begin{align*} \log\pi(\beta) &\propto - \lambda||\beta||_2^2 \\ \Rightarrow \pi(\beta) &\propto \exp\{-\lambda||\beta||_2^2\}. \end{align*} This can be seen as taking independent normal priors with mean zero and variance $\frac{1}{2\lambda}$, $$\beta_j \sim N\left(0,\frac{1}{2\lambda}\right) \ \ j=1,\dots,p.$$

ztkpat001
  • 96
  • 3
  • Thanks. Do you have any idea of a reference for the conclusion, specifically that the penalty $\lambda$ matches the prior precision $\frac{1}{\sigma^2}$? – Eoin Jul 06 '20 at 17:55
  • Sure. This comes from comparing the form of $\exp\{-\lambda\beta_j\}$ with that of a normal density and then manipulating the constants so that the forms match. In this case we are only interested in $\beta$ and because it is a proportional relationship we can multiply by any constant. Apologies I had previously made a mistake with $\frac{1}{\lambda}$, this has been fixed to $\frac{1}{2\lambda}$. – ztkpat001 Jul 06 '20 at 19:32