How does the logit link handle binomial (1/0) data?

Question

I have a data set that contains a continuous explanatory variable and a set of responses as binary success and failures. For example,

require(stats)

test.data <- data.frame(variable = runif(1000,100,200))
make.data <- function(x){ 
  if(runif(1,0,1) <= ((x + runif(1,-50,50) - 100)/100)){1} else {0} 
}
test.data$response <- sapply(test.data$variable, make.data)

head(test.data)
#  variable response
#1 171.4345        1
#2 186.9876        0
#3 122.4847        0
#4 189.0977        1
#5 109.0487        0
#6 157.7554        1

It's easy enough to run a glm on this data and get valid results, e.g.

glm.test <- glm(response ~ variable, data = test.data, family = binomial("logit"))

Somehow, the embedded glm logit link function seems to be able to account for entirely zero and entirely one values. If I was to perform the link function manually, e.g.

logit_func <- make.link("logit")$linkfun

test.data$link_response <- sapply(test.data$response, logit_func)

For obvious reasons I get a returned array of +Inf and -Inf.

head(test.data)
#  variable response link_response
#1 185.1213        1           Inf
#2 150.7970        1           Inf
#3 178.1121        0          -Inf
#4 127.2224        1           Inf
#5 132.4209        0          -Inf
#6 195.1341        1           Inf

So my questions is, what is the embedded glm link function doing which the standard logit link function not doing? How could I emulate the embedded glm link function?

There's an excellent discussion of what link functions do [here](http://stats.stackexchange.com/a/30909/17230). They're not transformations applied to the response. — Scortchi - Reinstate Monica, Aug 26 '15 at 16:36

score 4 · Accepted Answer · edited Aug 26 '15 at 20:05

$\newcommand{\variable}{\rm variable}$The link function is link to parameter of the distribution (in this example is $p$ of Bernoulli distribution) to the linear score $\eta$ (in this example is $b_0+b_1\times\variable$)

$\log(p_i/(1-p_i))=b_0+b_1\times\variable$

Then such $p$ derives the outcome of $0$ and $1$ by the binomial probability function $p_i^{y_i}(1-p_i)^{1-y_i}$

The link function is not the link from or to the response directly.

score 3 · Answer 2 · edited Jul 02 '21 at 16:30

In case of logistic regression, you have a response variable $y_i$ that is 0/1 and (in the univariate case) one explanatory variable.

So you have, for each case in your sample, a binary outcome $y_i$ and a value for $x_i$.

The idea is that the outcome 0/1 is the outcome of a Bernoulli variable with success probability $p_i$ that depends on $x_i$.

For the case $i$ the probability that the outcome is $y_i \in \{0,1\}$ is $p_i^{y_i}(1-p_i)^{1-y_i}$. Indeed, the probability of $y_i=1$ is $p_i$, obtained by substituting $y_i=1$ in $p_i^{y_i}(1-p_i)^{1-y_i}$, and the probability of $y_i=0$ is $1-p_i$, obtained by substituting $y_i=0$ in $p_i^{y_i}(1-p_i)^{1-y_i}$.

It is further assumed that the probability $p_i$ depends on $x_i$, i.e. $p_i(x_i)$. So for case $i$, the probability to observe the outcome $y_i$ is equal to $p_i(x_i)^{y_i}(1-p_i(x_i))^{1-y_i}$.

The probability to observe all the $y_i$ for all cases in your sample is thus (if all the observations are independent)

$\prod_{i=1}^n p_i(x_i)^{y_i}(1-p_i(x_i))^{1-y_i}$.

This is the place where the link function comes in: It is assumed that the dependence of $p_i$ on $x_i$ has a very particular functional form namely:

$p_i(x_i)=\frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}$.

So the probability to observe all $y_i$ for all cases in the sample is given by:

$\prod_{i=1}^n \left(\frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}\right)^{y_i}\left(1-\frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}\right)^{1-y_i}$.

Note that the parameters $\beta_0$ and $\beta_1$ are what we are looking for and if we see this probability as a fuction of these unknown parameters, then we get the likelihood function:

$L(\beta_0, \beta_1)=\prod_{i=1}^n \left(\frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}\right)^{y_i}\left(1-\frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}\right)^{1-y_i}$.

As the $x_i$ and $y_i$ are known from our sample, we can find the values $\hat{\beta}_i$ that maximise the likelihood.

So, as you observed (and as @Scortchi said), you can not simply transform the zero and one , but instead you have to model each case in your sample as a Bernoulli variable with a success probability that depends on $x_i$ and estimate the parameters by maximum likelihood estimation.

This is very good as is, but can you please consider explicitly stating the regression equation to tie it up with the rest? — Antoni Parellada, Aug 26 '15 at 20:38
Never mind, it doesn't really apply... Sorry for the confusion. — Antoni Parellada, Aug 26 '15 at 21:09

score 1 · Answer 3 · answered Aug 27 '15 at 03:32

The existing answers are both right, but let me come at this another way. Let's start from the beginning. You have $X$ data and $Y$ data, and you wonder how they are related. So why not just run a simple regression? Of course, in OLS regression, the residuals are supposed to be normally distributed (and these won't be), but that is generally the least of the assumptions. Lots of times our residuals are not exactly normal, and we don't mind, so what's the big deal here? (We also assume homoscedasticity, which won't hold here, and that is a somewhat more important assumption, but I still don't think that fully captures the problem with using OLS regression.)

The real problem, I'd argue, is that OLS regression means we aren't thinking about our data the right way. For example, if you fit a linear regression, the fitted line will eventually go outside of the $[0, 1]$ interval. This may not happen within the range of $X$ values in your dataset, but the fitted line nonetheless implies predicted $Y$ values outside that interval with some $X$ values and there is nothing that says those $X$ values are impermissible.

When we fit an OLS regression, what we are actually doing is predicting $\mu$, the parameter that controls the behavior of the normal distribution. (Not coincidentally, $\mu$ is the mean of a normal distribution.) Since your data are $0$'s and $1$'s, they are distributed as Bernoulli / binomial, and if we think clearly about this, we need to model $\pi$, the parameter that controls the behavior of a Bernoulli ($\pi$, in turn, is the mean of a Bernoulli).

The problem is that the right hand side / structural part of your regression equation, $\beta_0 + \beta_1X$ can ultimately range from negative infinity to positive infinity, as noted above. But $\pi$ can only range from $0$ to $1$. Thus, we need to transform $\pi$ so that the transformed version (that's the logit) can range from negative infinity to positive infinity. That is, the transformation is being applied to the parameter, not the data.

How does the logit link handle binomial (1/0) data?

3 Answers3