5

Suppose we have $n$ observations. For example, consider $n$ people who each have their blood pressure ($x_1$), pulse ($x_2$), and blood glucose ($x_3$) levels measured. So there are are $3$ explanatory variables measured for each person. The outcome variable is presence or absence of obesity ($Y$). In this case, does logistic regression assume that the data are distributed as $\text{Bernoulli}(p_i)$? For example, for the first person, we measure $x_1,x_2,x_3$ and compute $p_1$ (the probability of observing this)?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
NebulousReveal
  • 441
  • 2
  • 5

2 Answers2

6

Yes: the model is $\operatorname{logit} p_i = \beta_0 +\beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i}$.

That's true for bog-standard logistic regression anyway - the term is sometimes used where there's an extra parameter for dispersion, or for an estimating equation approach for which the Bernoulli model isn't assumed.

Re your comment: $\sum_{i=1}^{m_j} Y_{ij}$ has a binomial distribution $\operatorname{Bin}(m_j,p_j)$ for groups of $m_j$ people (from the original $n$) who have the same covariate pattern—the same blood pressure, pulse rate & glucose levels—& therefore the same probability $p_j$ of obesity. If no-one has the same covariate pattern, then there are $n$ groups, each with $m_j=1$, i.e. $n$ different Bernoulli distributions. To be clear, for each individual person $Y_i\sim\operatorname{Bin}(1,p_i)\equiv\operatorname{Bern}(p_i)$, & as @Frank says, there's no real need to consider people grouped together by covariate pattern, though it's sometimes useful for diagnostics.

To be really clear, if your model says this:–

Tom: 90 mmHg, 80 /min, 6 mmol/l => 60% chance of obesity

Dick: 90 mmHg, 80 /min, 6 mmol/l => 60%

Harry: 60 mmHg, 60 /min, 5 mmol/l => 20%

you can write this:–

$$Y_{\mathrm{Tom}}+Y_{\mathrm{Dick}}\sim \operatorname{Bin}(2,60\%)$$ $$Y_{\mathrm{Harry}}\sim \operatorname{Bin}(1,20\%)\equiv\operatorname{Bern}(20\%)$$

or this:–

$$Y_{\mathrm{Tom}}\sim \operatorname{Bin}(1,60\%)\equiv\operatorname{Bern}(60\%)$$ $$Y_{\mathrm{Dick}}\sim \operatorname{Bin}(1,60\%)\equiv\operatorname{Bern}(60\%)$$ $$Y_{\mathrm{Harry}}\sim \operatorname{Bin}(1,20\%)\equiv\operatorname{Bern}(20\%)$$

Note that $Y_{\mathrm{Tom}}+Y_{\mathrm{Dick}}+Y_{\mathrm{Harry}}$ is not binomially distributed because there's not a common probability for each person.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 2
    But what does it exactly mean the the data are distributed as $\text{Bernoulli}(p_i)$? Each person has a different probability of being obese depending on the values of their covariates? – NebulousReveal Jul 17 '13 at 14:47
  • 2
    @guest43434 Yes, that's exactly what it means. – Hong Ooi Jul 17 '13 at 14:47
  • @HongOoi: So if we say that the data are distributed as $B(n_{x_i},p_{x_i})$ (binomial distribution) what would this mean? For example, for the first person, the distribution would be $B(n_{x_1}, p_{x_1})$. Does $n_{x_1}$ represent the number of covariates observed for person $1$ (i.e. $n_{x_1}$ can be $0,1,2$ or $3$ in our example)? – NebulousReveal Jul 17 '13 at 14:52
  • 1
    Not at all. Just consider one trial per person and keep it simple. – Frank Harrell Jul 17 '13 at 15:13
  • @FrankHarrell: Trial in this case means explanatory variable? Measuring $\textbf{x}_1 = (x_1,x_2,x_3)$ for one person would be $1$ trial. – NebulousReveal Jul 17 '13 at 15:20
  • No, please read above. A trial is an observation of $Y=0, 1$ for one subject. – Frank Harrell Jul 17 '13 at 16:19
2

As @Scortchi correctly notes, the answer is yes. However, I think this is not quite the right question.

I suspect what you are wondering about is the way that probability, $p_i$, is related to the explanatory variables. In generalized linear models, this is done via a link function. The default link function for binary GLiMs is the logit, however, if BMI is normally distributed, but was categorized as obese, not obese for the study, then your response variable depends on a hidden Gaussian variable, and a different link function is appropriate (namely the probit). For more on this topic, you may want to read my answer here: difference-between-logit-and-probit-models.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Thanks. What would $B(n_{x_i}, p_{x_i})$ (binomial distribution) mean in our example? In our example, would $n_{x_i} = 3$ for all $n$ people because all of the covariates are measured? – NebulousReveal Jul 17 '13 at 15:00
  • 1
    You have 3 explanatory variables, so you need those subscripts to be vectors. But for simplicity's sake, imagine you have only 1 variable $X$. $\mathcal B(n_{x_i},~p_{x_i})$ is the conditional distribution of $Y$ (ie, `obesity`) when $X=x_i$: given that there are $n_{x_i}$ such people & their probability of being obese is $p_{x_i}$, that is the distribution that describes the number of obese people you will see. – gung - Reinstate Monica Jul 17 '13 at 15:09
  • So it would be $B(n_{(x_{1}, x_2, x_3)}, p_{(x_1,x_2,x_3)})$? Or we could write it as $B(n_{\textbf{x}_i},p_i)$. – NebulousReveal Jul 17 '13 at 15:12
  • I suppose you could write it $\mathcal{Bin}(n_{\bf x_i},~p_{\bf x_i})$. I'm generally not too concerned about this. The important points about what I said above are the same whether you have 1 explanatory variable, 3, or 70. – gung - Reinstate Monica Jul 17 '13 at 15:21