13

I must confess that I previously haven't heard of that term in any of my classes, undergrad or grad.

What does it mean for a logistic regression to be Bayesian? I'm looking for an explanation with a transition from regular logistic to Bayesian logistic similar to the following:

This is the equation in the linear regression model: $E(y) = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$.

This is the equation in the logistic regression model: $\ln(\frac{E(y)}{1-E(y)}) = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$. This is done when y is categorical.

What we have done is change $E(y)$ to $\ln(\frac{E(y)}{1-E(y)})$.

So what's done to the logistic regression model in Bayesian logistic regression? I'm guessing it's not something to do with the equation.

This book preview seems to define, but I don't really understand. What is all this prior, likelihood stuff? What is $\alpha$? May someone please explain that part of the book or Bayesian logit model in another way?

Note: This has been asked before but not answered very well I think.

BCLC
  • 2,166
  • 2
  • 22
  • 47
  • 1
    I do not want to put this in an answer because I think @Tim has most of it covered. The only thing missing from that otherwise great answer is that, in Bayesian logistic regression and Bayesian generalized linear models (GLMs) more generally, prior distributions are not only placed over the coefficients, but over the variances and covariance of those coefficients. This is incredibly important to mention because one of the key advantages of a Bayesian approach to GLMs is the greater tractability of specifying and in many cases also fitting complex models for the covariance of the coefficients. – Brash Equilibrium Jul 27 '15 at 04:59
  • 2
    @BrashEquilibrium: you are mentioning a possible hierarchical extension of the standard Bayesian modelling for a logit model. In [our book](http://www.amazon.com/gp/product/1441922865/ref=as_li_ss_tl?ie=UTF8&tag=chrprobboo-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1441922865), we use for instance a g-prior on the $\beta$'s, prior which fixed covariance matrix is derived from the covariates $X$. – Xi'an Aug 02 '15 at 09:03
  • 1
    Fair enough on the g prior. – Brash Equilibrium Aug 02 '15 at 15:07
  • 1
    That said, there is still a prior on the covariances!!!!!! If you don't discuss it, you aren't describing how logistic regression works completely. – Brash Equilibrium Aug 02 '15 at 15:19

2 Answers2

24

Logistic regression can be described as a linear combination

$$ \eta = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k $$

that is passed through the link function $g$:

$$ g(E(Y)) = \eta $$

where the link function is a logit function

$$ E(Y|X,\beta) = p = \text{logit}^{-1}( \eta ) $$

where $Y$ take only values in $\{0,1\}$ and inverse logit functions transforms linear combination $\eta$ to this range. This is where classical logistic regression ends.

However if you recall that $E(Y) = P(Y = 1)$ for variables that take only values in $\{0,1\}$, than $E(Y | X,\beta)$ can be considered as $P(Y = 1 | X,\beta)$. In this case, the logit function output could be thought as conditional probability of "success", i.e. $P(Y=1|X,\beta)$. Bernoulli distribution is a distribution that describes probability of observing binary outcome, with some $p$ parameter, so we can describe $Y$ as

$$ y_i \sim \text{Bernoulli}(p) $$

So with logistic regression we look for some parameters $\beta$ that togeder with independent variables $X$ form a linear combination $\eta$. In classical regression $E(Y|X,\beta) = \eta$ (we assume link function to be identity function), however to model $Y$ that takes values in $\{0,1\}$ we need to transform $\eta$ so to fit in $[0,1]$ range.

Now, to estimate logistic regression in Bayesian way you pick up some priors for $\beta_i$ parameters as with linear regression (see Kruschke et al, 2012), then use logit function to transform the linear combination $\eta$, so to use its output as a $p$ parameter of Bernoulli distribution that describes your $Y$ variable. So, yes, you actually use the equation and the logit link function the same way as in frequentionist case, and the rest works (e.g. choosing priors) like with estimating linear regression the Bayesian way.

The simple approach for choosing priors is to choose Normal distributions (but you can also use other distributions, e.g. $t$- or Laplace distribution for more robust model) for $\beta_i$'s with parameters $\mu_i$ and $\sigma_i^2$ that are preset or taken from hierarchical priors. Now, having the model definition you can use software such as JAGS to perform Markov Chain Monte Carlo simulation for you to estimate the model. Below I post JAGS code for simple logistic model (check here for more examples).

model {
   # setting up priors
   a ~ dnorm(0, .0001)
   b ~ dnorm(0, .0001)

   for (i in 1:N) {
      # passing the linear combination through logit function
      logit(p[i]) <- a + b * x[i]

      # likelihood function
      y[i] ~ dbern(p[i])
   }
}

As you can see, the code directly translates to model definition. What the software does is it draws some values from Normal priors for a and b, then it uses those values to estimate p and finally, uses likelihood function to assess how likely is your data given those parameters (this is when you use Bayes theorem, see here for more detailed description).

The basic logistic regression model can be extended to model the dependency between the predictors using a hierarchical model (including hyperpriors). In this case you can draw $\beta_i$'s from Multivariate Normal distribution that enables us to include information about covariance $\boldsymbol{\Sigma}$ between independent variables

$$ \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{pmatrix} \sim \mathrm{MVN} \left( \begin{bmatrix} \mu_0 \\ \mu_1 \\ \vdots \\ \mu_k \end{bmatrix}, \begin{bmatrix} \sigma^2_0 & \sigma_{0,1} & \ldots & \sigma_{0,k} \\ \sigma_{1,0} & \sigma^2_1 & \ldots &\sigma_{1,k} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{k,0} & \sigma_{k,1} & \ldots & \sigma^2_k \end{bmatrix} \right)$$

...but this is going into details, so let's stop right here.

The "Bayesian" part in here is choosing priors, using Bayes theorem and defining model in probabilistic terms. See here for definition of "Bayesian model" and here for some general intuition on Bayesian approach. What you can also notice is that defining models is pretty straightforward and flexible with this approach.


Kruschke, J. K., Aguinis, H., & Joo, H. (2012). The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15(4), 722-752.

Gelman, A., Jakulin, A., Pittau, G.M., and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 1
    You need proofs for the variances, not only the coefficients. – Brash Equilibrium Jul 25 '15 at 04:02
  • Thanks Tim. logit = $\eta$ ? – BCLC Jul 25 '15 at 09:22
  • 3
    @BCLC no, for logistic regression logit is used as link function $g$, while $\eta$ is a linear combination $\eta = \beta_0 + \beta_1 X_1$ , e.g. for linear regression $g$ is identity function so $E(Y) = \eta$, this is just a standard specification of [GLM](https://en.wikipedia.org/wiki/Generalized_linear_model). – Tim Jul 25 '15 at 11:38
  • @Tim Thanks. I actually don't really know really what 'prior' is. Is that what's in the Kruschke page? – BCLC Jul 25 '15 at 11:39
  • 1
    @BCLC check the links in my answer, they provide an introduction to Bayesian statistics in general. This is a much broader topic that the one mentioned in your initial question but you can find a nice introduction in the references I provided in my answer. – Tim Jul 25 '15 at 11:41
  • @Tim Oh right. That's part of Bayesian which is in the latter links. The Kruschke is...? – BCLC Jul 25 '15 at 11:48
  • 1
    @Tim I made a typo there. Proofs is supposed to read priors. Basically, the coefficients aren't the only unknown parameters. The multinomial distribution also has a variance covariance matrix and typically we don't assume it is known. – Brash Equilibrium Jul 25 '15 at 13:36
  • 1
    Basically any regression model involves a multinomial distribution over the coefficient vector and their covariance matrix. Often people assume a constant variance with an inverse gamma or half Cauchy prior. – Brash Equilibrium Jul 25 '15 at 13:59
  • 1
    We are only talking about different things if in your description you were explicitly considering the special case of a prior in which there is zero covariance AND zero variance among the betas, which would be an incomplete answer to the question about what Bayesian logistic regression is indeed. – Brash Equilibrium Jul 25 '15 at 21:04
  • 1
    That's because you are not reading the part of those text books where they describe the prior distribution of over the variances – Brash Equilibrium Jul 26 '15 at 14:31
  • 1
    Or because the Stan manual glosses over it because it is not an intoxicating text book. – Brash Equilibrium Jul 26 '15 at 14:32
  • 1
    Or because it is not a complete text book with respect to the specification of variance priors in regression models, which is actually REALLY important – Brash Equilibrium Jul 26 '15 at 14:33
  • 1
    @Tim it's not that I'm offering a better alternative model but rather that any Bayesian regression model requires the specification of a prior over the variance component. Many discussions of Bayesian GLM gloss over this point because it was already discussed in a prior chapter on linear models. – Brash Equilibrium Jul 26 '15 at 20:01
  • 1
    @Tim challenge accepted – Brash Equilibrium Jul 26 '15 at 20:01
  • Thanks for the edit Tim, I think. Wish you would've informed of such in a comment. Thanks for your input too @BrashEquilibrium, I think. PS Now I think I understand how user127662 felt [here](http://math.stackexchange.com/questions/943396/probability-random-variables-and-probability-distributions/943408#comment1944175_943408). – BCLC Aug 01 '15 at 14:19
  • Tim, why do you say $P(Y = 1)$ instead of $P(Y = 1 | X, \beta)$ or [$P(Y = 1 | X)$](http://stats.stackexchange.com/a/20527) ? – BCLC Aug 02 '15 at 08:44
  • Tim, why is $P(Y=1)$ the output? You mean $P(Y=1) = logit(E(Y) = \ln(\frac{E(Y)}{1-E(Y)})$? All I can think of in relation to that is $P(Y=1) = E(1_{\{Y=1\}})$ – BCLC Aug 02 '15 at 09:05
  • Tim, why $p = logit(\eta)$ ? You mean $p = \eta = logit(E(Y))$? – BCLC Aug 02 '15 at 09:12
  • @BCLC By $P(Y=1)$ I described in here anly what $Y$ is. As about the last question: $p = \mathrm{logit}(\eta)$. Your questions are not really related to Bayesian model, but to GLM in general, so I would recommend some handbook on GLM, e.g. one of those http://stats.stackexchange.com/q/94371/35989 – Tim Aug 02 '15 at 13:23
  • @Tim Thanks...so p = logit(logit(E(y))) ? – BCLC Aug 03 '15 at 03:08
  • 1
    @BCLC now I get what was not clear for you in my answer... there was a mistake in my answer. it should be p is invers logit of $\eta$. So $p = E(Y) = \mathrm{logit}^{-1}(\eta)$. – Tim Aug 03 '15 at 06:33
  • @Tim Okay, I think I get why P(Y=1) is the output if it is indeed P(Y=1) instead of the alternatives, but why $P(Y=1)$ and not $P(Y=1 | X, \beta)$ or $P(Y=1 | X)$? Now I'm giving you a bounty. Edit: Need 23 hrs hahahaha. Didn't know that – BCLC Aug 04 '15 at 19:36
  • 1
    @BCLC You model $P(Y=1|X,\beta)$, yes, this is what logistic model describes, by $P(Y=1)$ I meant simply that Y *alone* is a variable with two states {0,1} and you are interested in learning something about $P(Y=1)$. So it was simply description of what Y variable alone is. – Tim Aug 04 '15 at 19:40
  • @Tim Wait, actually I'm not sure I get why that is the output or maybe I forgot. You said 'Now if you recall that Y takes only values in {0,1}, then the logit function output could be thought as probability of "success", i.e. $P(Y=1)$.' Mathematically, does this mean $P(Y=1) = logit(E[Y])$ ? If so, is that because Y is either 0 or 1? If not, what then? Guess I didn't know logit/logistic regression as well as I thought I did... – BCLC Aug 05 '15 at 16:42
  • 1
    @BCLC I made some small edits and corrections, check if it is clear, it shoudl be now. – Tim Aug 05 '15 at 16:58
  • 3
    "The "Bayesian" part in here is choosing priors, using Bayes theorem and defining model in probabilistic terms." A good reference here is Gelman et al. A WEAKLY INFORMATIVE DEFAULT PRIOR DISTRIBUTION FOR LOGISTIC AND OTHER REGRESSION MODELS http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf – Dalton Hance Aug 05 '15 at 17:05
  • 1
    @DaltonHance good point, thanks for reminding about this one, I'll add it to references. – Tim Aug 05 '15 at 17:11
  • Thank you very much [Tim and](https://en.wikipedia.org/wiki/Timothy_Dalton) @DaltonHance. – BCLC Aug 05 '15 at 18:19
  • Hey Tim, you mentioned 'Now, to estimate logistic regression in Bayesian way'...how would one do it in a frequentist way? OLS or MLE, I guess? Any others? – BCLC Sep 11 '15 at 16:39
6

What is all this prior, likelihood stuff?

That's what makes it Bayesian. The generative model for the data is the same; the difference is that a Bayesian analysis chooses some prior distribution for parameters of interest, and calculates or approximates a posterior distribution, upon which all inference is based. Bayes rule relates the two: The posterior is proportional to likelihood times prior.

Intuitively, this prior allows an analyst mathematically to express subject matter expertise or preexisting findings. For instance, the text you reference notes that the prior for $\bf\beta$ is a multivariate normal. Perhaps prior studies suggest a certain range of parameters that can be expressed with certain normal parameters. (With flexibility comes responsibility: One should be able to justify their prior to a skeptical audience.) In more elaborate models, one can use domain expertise to tune certain latent parameters. For example, see the liver example referenced in this answer.

Some frequentist models can be related to a Bayesian counterpart with a specific prior, though I'm unsure which corresponds in this case.

Sean Easter
  • 8,359
  • 2
  • 29
  • 58
  • SeanEaster, 'prior' is the word used for assumed distribution? For instance we assume the X's or $\beta$'s (if you mean $\beta$ as in $\beta_1, \beta_2, ..., \beta_n$, do you mean instead $X_1$, $X_2$, ..., $X_n$? I don't think the $\beta$'s have distributions...?) are normal but then we try to fit them into another distribution? What exactly do you mean by 'approximates' ? I have a feeling it's not the same as 'fits' – BCLC Aug 02 '15 at 08:52
  • 1
    @BCLC To answer those, I'll start with the bare process of Bayesian inference and define the terms as I go: Bayesians treat *all* parameters of interest as random variables and update their beliefs about these parameters in light of data. The *prior distribution* expresses their belief about the parameters before analyzing the data; the *posterior distribution*—by Bayes rule, the normalized product of prior and likelihood—summarizes uncertain belief about the parameters in light of the prior and data. Calculating the posterior is where the fitting takes place. – Sean Easter Aug 02 '15 at 14:18
  • 1
    @BCLC Thus why the $\beta$ parameters have a distribution. In other—generally simple—Bayesian models, posterior distributions may have a closed form expression. (In a Bernoulli random variable with a beta prior on $p$, the posterior of $p$ is a beta distribution, for example.) But when posteriors cannot be expressed analytically, we *approximate* them, generally using MCMC methods. – Sean Easter Aug 02 '15 at 14:21
  • Okay I think I understand you better after reading [An Essay towards solving a Problem in the Doctrine of Chances](https://en.wikipedia.org/wiki/An_Essay_towards_solving_a_Problem_in_the_Doctrine_of_Chances). Thanks SeanEster – BCLC Aug 11 '15 at 10:04
  • By normalized, you mean [this](https://en.wikipedia.org/wiki/Normalizing_constant)? So the P(B) [here](https://upload.wikimedia.org/math/8/9/7/89752ad8a5356154acf633669f3681fb.png) ? – BCLC Aug 11 '15 at 10:09
  • 1
    Yep. In many cases, that $P(B)$ would be impossible to calculate analytically. – Sean Easter Aug 11 '15 at 13:17
  • Ah right. Read about that on Wiki, I think. Thanks – BCLC Aug 11 '15 at 13:29