I'm looking for an intuitive explanation of Bayesian Logistic Regression (I'm using it for texts if that's relevant). It seems that this article presents it, but it's, uh, way too mathy.
-
1Two good resources are both of Andrew Gelman's textbooks, *Bayesian Data Analysis* (3rd ed) and the one on hierarchical modeling. The former is a general Bayesian reference. The latter is a general regression reference from a Bayesian POV. – Sycorax Jun 09 '14 at 17:14
-
2What things were you looking for it to explain? – Glen_b Jun 10 '14 at 00:58
-
2Bayesian logistic regression is Bayesian statistics for the logistic regression model. So same question than @Glen_b: you are looking for an explanation of Bayesian statistics, of logistic regression, or something more precise (such as the choice of the Bayesian priors for the logistic regression model) ? – Stéphane Laurent Jun 10 '14 at 23:02
-
Yes, @StéphaneLaurent: I have general knowledge of logistic regression and Naive Bayes classification separately. My question is how they are combined. You wrote that it's: "Bayesian statistics for the logistic regression model" - please elaborate on that. Thank you very much. – Cheshie Jun 11 '14 at 09:13
-
This is a broad question. Take a look at [/wiki/Bayesian_inference#Formal_description_of_Bayesian_inference](http://en.wikipedia.org/wiki/Bayesian_inference#Formal_description_of_Bayesian_inference) for instance. – Stéphane Laurent Jun 11 '14 at 09:22
-
@StéphaneLaurent - I know the question is broad, that's why I asked for a general answer. I tried to answer myself; if you know this stuff, do you mind skimming it to see if I'm generally in the right direction? I'd really appreciate it... (and thanks for the link, but I didn't really understand it). – Cheshie Jun 11 '14 at 11:35
-
What explanation of Bayesian logistic regression could be more general than a textbook on Bayesian regression methods? – Sycorax Jul 24 '15 at 19:55
-
@Cheshie Do not confuse Naive Bayes and Bayesian Methods. Naive Bayes is not (in its simplest form Bayesian). Unfortunately one guy developed a lot of the theory for probability and statistics so lots of things have his name. Bayesian methods have to do with representing your uncertainty about a fixed value (typically parameter in a model) using a probability distribution. – jlimahaverford Nov 05 '15 at 19:08
2 Answers
In short, there's no complicated definition of "Bayesian logistic regression" that cannot be inferred from its parent terms. It's Bayesian in that its inferences are based on the posterior, and begin with a specified prior for model parameters; it's logistic regression in that we're fitting the $\beta$ coefficients of a logistic regression.
This choice of prior is often discussed as a means of expressing existing uncertain knowledge regarding the problem; in the case of the linked paper, the Laplace prior is chosen because it yields a posterior with certain desired properties. Namely, it can be tuned for desired levels of sparsity.
In inferential applications, one would base inferences on a numerical approximation of the model parameters' full posterior. Since this is a machine learning setting and predictions are of primary interest, other concerns enter, e.g. memory and efficiency, and point estimates will do.
As in many Bayesian models, no analytic expression exists for this posterior. CLG the basis of the authors' approach to fitting the parameters numerically.

- 8,359
- 2
- 29
- 58
I'll try to answer my question:
The probability that y=+1 is: $p(y=+1 |X,\beta) = \sigma(\beta^TX)$.
To estimate the $\beta$ values, you'd normally use something like logistic regression. In this case, they assumed a Gaussian distribution on the parameters $\beta$ (assuming a mean of 0 and variance >0 (to avoid overfitting)), and thus they have the prior probability on $\beta$ (i.e., $p(\beta)$ ).
They then compute the posterior density $L(\beta) = p(\beta) p(X|\beta,y).$ Plug in $p(\beta)$ that we previously got from the Gaussian distribution (before). The MAP estimate is then the $\beta$ that maximizes $l(\beta)$. Something like that.
They also used something with Laplace priors to avoid overfitting, and a cyclic coordinate descent as an optimization.
Hope it's correct.

- 285
- 1
- 2
- 9
-
1This is very close to correct. In the third paragraph you should have p(y | \beta, X), instead. $L(\beta)$ is not a density because it is not normalized. By Bayes Law we would need to divide by $p(y | X)$ which is often difficult to compute, that is why we are finding where this function is *maximized* which does not depend on normalization. Just for the sake of terminology, Gaussian errors lead to L2-Regularization and a method called Ridge Regression. Laplace errors lead to L1-Regularization and a method called LASSO. – jlimahaverford Nov 05 '15 at 19:39