Suppose we have $n$ observations. For example, consider $n$ people who each have their blood pressure ($x_1$), pulse ($x_2$), and blood glucose ($x_3$) levels measured. So there are are $3$ explanatory variables measured for each person. The outcome variable is presence or absence of obesity ($Y$). In this case, does logistic regression assume that the data are distributed as $\text{Bernoulli}(p_i)$? For example, for the first person, we measure $x_1,x_2,x_3$ and compute $p_1$ (the probability of observing this)?
-
2Obestity is not a binary variable and it is inappropriate to analyze it as such. – Frank Harrell Jul 25 '13 at 13:08
-
@FrankHarrell: The presence or absence of obesity is a binary variable. – NebulousReveal Jul 25 '13 at 13:16
-
2Obesity, as a clinical artificial category is a binary variable, but it makes no biological sense to use it as an outcome. Your real outcome is BMI or weight. – Aghila Jul 25 '13 at 16:13
2 Answers
Yes: the model is $\operatorname{logit} p_i = \beta_0 +\beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i}$.
That's true for bog-standard logistic regression anyway - the term is sometimes used where there's an extra parameter for dispersion, or for an estimating equation approach for which the Bernoulli model isn't assumed.
Re your comment: $\sum_{i=1}^{m_j} Y_{ij}$ has a binomial distribution $\operatorname{Bin}(m_j,p_j)$ for groups of $m_j$ people (from the original $n$) who have the same covariate pattern—the same blood pressure, pulse rate & glucose levels—& therefore the same probability $p_j$ of obesity. If no-one has the same covariate pattern, then there are $n$ groups, each with $m_j=1$, i.e. $n$ different Bernoulli distributions. To be clear, for each individual person $Y_i\sim\operatorname{Bin}(1,p_i)\equiv\operatorname{Bern}(p_i)$, & as @Frank says, there's no real need to consider people grouped together by covariate pattern, though it's sometimes useful for diagnostics.
To be really clear, if your model says this:–
Tom: 90 mmHg, 80 /min, 6 mmol/l => 60% chance of obesity
Dick: 90 mmHg, 80 /min, 6 mmol/l => 60%
Harry: 60 mmHg, 60 /min, 5 mmol/l => 20%
you can write this:–
$$Y_{\mathrm{Tom}}+Y_{\mathrm{Dick}}\sim \operatorname{Bin}(2,60\%)$$ $$Y_{\mathrm{Harry}}\sim \operatorname{Bin}(1,20\%)\equiv\operatorname{Bern}(20\%)$$
or this:–
$$Y_{\mathrm{Tom}}\sim \operatorname{Bin}(1,60\%)\equiv\operatorname{Bern}(60\%)$$ $$Y_{\mathrm{Dick}}\sim \operatorname{Bin}(1,60\%)\equiv\operatorname{Bern}(60\%)$$ $$Y_{\mathrm{Harry}}\sim \operatorname{Bin}(1,20\%)\equiv\operatorname{Bern}(20\%)$$
Note that $Y_{\mathrm{Tom}}+Y_{\mathrm{Dick}}+Y_{\mathrm{Harry}}$ is not binomially distributed because there's not a common probability for each person.

- 27,560
- 8
- 81
- 248
-
2But what does it exactly mean the the data are distributed as $\text{Bernoulli}(p_i)$? Each person has a different probability of being obese depending on the values of their covariates? – NebulousReveal Jul 17 '13 at 14:47
-
2
-
@HongOoi: So if we say that the data are distributed as $B(n_{x_i},p_{x_i})$ (binomial distribution) what would this mean? For example, for the first person, the distribution would be $B(n_{x_1}, p_{x_1})$. Does $n_{x_1}$ represent the number of covariates observed for person $1$ (i.e. $n_{x_1}$ can be $0,1,2$ or $3$ in our example)? – NebulousReveal Jul 17 '13 at 14:52
-
1Not at all. Just consider one trial per person and keep it simple. – Frank Harrell Jul 17 '13 at 15:13
-
@FrankHarrell: Trial in this case means explanatory variable? Measuring $\textbf{x}_1 = (x_1,x_2,x_3)$ for one person would be $1$ trial. – NebulousReveal Jul 17 '13 at 15:20
-
No, please read above. A trial is an observation of $Y=0, 1$ for one subject. – Frank Harrell Jul 17 '13 at 16:19
As @Scortchi correctly notes, the answer is yes. However, I think this is not quite the right question.
I suspect what you are wondering about is the way that probability, $p_i$, is related to the explanatory variables. In generalized linear models, this is done via a link function. The default link function for binary GLiMs is the logit, however, if BMI is normally distributed, but was categorized as obese
, not obese
for the study, then your response variable depends on a hidden Gaussian variable, and a different link function is appropriate (namely the probit). For more on this topic, you may want to read my answer here: difference-between-logit-and-probit-models.

- 132,789
- 81
- 357
- 650
-
Thanks. What would $B(n_{x_i}, p_{x_i})$ (binomial distribution) mean in our example? In our example, would $n_{x_i} = 3$ for all $n$ people because all of the covariates are measured? – NebulousReveal Jul 17 '13 at 15:00
-
1You have 3 explanatory variables, so you need those subscripts to be vectors. But for simplicity's sake, imagine you have only 1 variable $X$. $\mathcal B(n_{x_i},~p_{x_i})$ is the conditional distribution of $Y$ (ie, `obesity`) when $X=x_i$: given that there are $n_{x_i}$ such people & their probability of being obese is $p_{x_i}$, that is the distribution that describes the number of obese people you will see. – gung - Reinstate Monica Jul 17 '13 at 15:09
-
So it would be $B(n_{(x_{1}, x_2, x_3)}, p_{(x_1,x_2,x_3)})$? Or we could write it as $B(n_{\textbf{x}_i},p_i)$. – NebulousReveal Jul 17 '13 at 15:12
-
I suppose you could write it $\mathcal{Bin}(n_{\bf x_i},~p_{\bf x_i})$. I'm generally not too concerned about this. The important points about what I said above are the same whether you have 1 explanatory variable, 3, or 70. – gung - Reinstate Monica Jul 17 '13 at 15:21