15

Suppose I have a $2 \times 2$ table that looks like:

            Disease       No Disease
Treatment         55                67
Control           42                34

I would like to do a logistic regression in R on this table. I understand that the standard way is to use the glm function with a cbind function in the response. In other words, the code looks like:

glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())

I am wondering why R requires us to use the cbind function and why simply using proportions is not sufficient. Is there a way to write this out explicitly as a formula? What would it look in the form of:

$$ log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X $$

where $X = 1$ if we have treatment and $X=0$ for control?

Right now it seems like I am regressing on a matrix for the dependent value.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user321627
  • 2,511
  • 3
  • 13
  • 49
  • You can fit binary logistic models just by using 0/1 as the outcome. Only when you have binomial with 2+ trials do you need to do this cbind business. – gammer Feb 02 '17 at 03:56
  • 2
    @gammer Could I ask what you mean by a binomial with $2+$ trials? Don't all models have $2+$ trials? Thanks! – user321627 Feb 02 '17 at 04:26
  • 1
    I just mean if `y` is binary (i.e. binomial with 1 trial) then you can just do `glm(y~x,family=binomial)`, etc., but if `y` is binomial with `n`(>1 for at least one case) trials then you need to do the whole `glm(cbind(y,n-y)~x,family=binomial)` thing – gammer Feb 02 '17 at 04:30
  • As an alternative to the `cbind` option above, I created a huge vector of $1$ or $0$'s. My data matrix converts the table into `DATA – user321627 Feb 02 '17 at 04:44
  • The models are equivalent so that's good that the coefficient and p-value are the same...Regarding the deviance, in the binomial (non-binary) case, you've set it up as only 2 binomial measurements (grouped by the two values of a single binary predictor). So, the observed frequencies can exactly match the modeled frequencies (i.e. your model is saturated), so the residual deviance is zero. That's not possible in the binary outcome formulation because the fitted probabilities, unless you have complete separation, will not be exactly "1" or "0" ever. – gammer Feb 02 '17 at 05:04
  • Generally, if you have a different binomial measure for each level of a categorical predictor (as was the case with your first formulation), you will get zero residual deviance because you're fitting the saturated model. – gammer Feb 02 '17 at 05:05
  • I see, so you are saying that the deviances don't really matter here and rather its the coefficients we should only care about? – user321627 Feb 02 '17 at 05:13
  • The deviances matter if you're comparing models but not otherwise. In the case of the first model, it doesn't really make sense to compare models when one of the two models fits perfectly in a trivial way and the other has no predictors. You should analyze the data as it arose in the first place, so if each of the trials are single trials from different people (for example), analyze it as binary data. But, for example, each person did ten tasks and you measured the number of successes out of ten, then model `cbind(y,10-y)` as the outcome variable. – gammer Feb 02 '17 at 05:17
  • See also https://stats.stackexchange.com/questions/97515/what-does-likelihood-is-only-defined-up-to-a-multiplicative-constant-of-proport/97522#97522 – kjetil b halvorsen May 15 '21 at 21:57

2 Answers2

19

First I show how you can specify a formula using aggregated data with proportions and weights. Then I show how you could specify a formula after dis-aggregating your data to individual observations.

Documentation inglm indicates that:

"For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes"

I create new columns total and proportion_disease in df for the 'number of trials' and 'proportion of successes' respectively.

library(dplyr)
df <- tibble(treatment_status = c("treatment", "no_treatment"),
       disease = c(55, 42),
       no_disease = c(67,34)) %>% 
  mutate(total = no_disease + disease,
         proportion_disease = disease / total) 

model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)

The above weighted approach takes in aggregated data and will give the same solution as the cbind method but allows you to specify a formula. (Below is equivalent to Original Poster's method but cbind(c(55,42), c(67,34)) rather than cbind(c(55,67),c(42,34)) so that 'Disease' rather than 'Treatment' is the response variable.)

model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))  

You could also just dis-aggregate your data into individual observations and pass these into glm (allowing you to specify a formula as well).

df_expanded <- tibble(disease_status = c(1, 1, 0, 0), 
                      treatment_status = rep(c("treatment", "control"), 2)) %>%
                        .[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ]

model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))

Let's compare these now by passing each model into summary. model_weighted and model_cbinded both produce the exact same results. model_expanded produces the same coefficients and standard errors, though outputs different degrees of freedom, deviance, AIC, etc. (corresponding with the number of rows/observations).

    > lapply(list(model_weighted, model_cbinded, model_expanded), summary)
[[1]]

Call:
glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"), 
    data = df, weights = total)

Deviance Residuals: 
[1]  0  0

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2


[[2]]

Call:
glm(formula = cbind(disease, no_disease) ~ treatment_status, 
    family = binomial("logit"), data = df)

Deviance Residuals: 
[1]  0  0

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2


[[3]]

Call:
glm(formula = disease_status ~ treatment_status, family = binomial("logit"), 
    data = df_expanded)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.268  -1.095  -1.095   1.262   1.262  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 274.41  on 197  degrees of freedom
Residual deviance: 272.46  on 196  degrees of freedom
AIC: 276.46

Number of Fisher Scoring iterations: 3

(See R bloggers for conversation on weights parameter in glm in the regression context.)

Bryan Shalloway
  • 499
  • 5
  • 11
  • 1
    I think this answer could be improved by directly addressing the original question - what, explicitly, is the functional form of the regression equation here? – Silverfish Jan 29 '18 at 03:42
  • Thanks, made a few changes to hopefully make it more straight-forward. In `model_weighted`, the `proportional_disease ~ treatment_status` is the functional form and shows the output is the same as using `cbind` method (provided you've inputted `weights` and have `family = binomial`). OR you can dis-aggregate the data and feed in individual 1/0 observations (model_expanded approach)... though AIC, deviance, etc. do not come out identical. – Bryan Shalloway Jan 29 '18 at 05:15
4

Binomial distribution versus Bernoulli distribution

The binomial distribution can be viewed as multiple Bernoulli distributions with the same probability parameter $p$. Their probability distributions differ up to a constant.

For observations $x_i \in \lbrace 0,1 \rbrace$, where $k$ is the number of observations that are to $1$

  • $1$ Bernoulli experiment PMF $$f(x_1) = p^{x_1} (1-p)^{1-x_1}$$
  • $n$ Bernoulli experiments PMF $$f(x_1,x_2,\dots,x_n) = \prod p^{x_i} (1-p)^{1-x_i} = p^{k} (1-p)^{n-k}$$
  • Binomial distribution PMF $$f(x_1,x_2,\dots,x_n) = f(k,n) = {n \choose k} p^{k} (1-p)^{n-k}$$

The difference is that the Binomial distribution sort of 'ignores' the order of the observations $x_i$ and only looks at the total number of $x_i=1$ and $x_i=0$.

Influence of sample size on likelihood distribution and error

why simply using proportions is not sufficient

The maximum likelihood estimate is maximimizing $p^k(1-p)^{n-k}$ and this happens at $p_{max} = k/n$ the proportion of observations with $x_i=1$. So indeed for finding the maximum likelihood the proportion is sufficient.

However, when we wish to estimate some expression for the error of the estimate (for instance standard error or confidence intervals) then the number of observations becomes important.

An intuitive view to see how the sample size matters might be the following: Let's look at the sample distribution of the outcome of the experiment as a function of the sample size. In the image below we plot the binomial distribution for $p=0.8$ with different size parameters $n$. On the left you see that the curves shift to the right for larger $n$, which is obvious as you would expect more counts when the sample size $n$ is increased. But the distance from the expected rate $p$ times $n$ is becoming relatively shorter. You see this on the right where we divide the outcome (the counts) by the total sample size $n$, which gives a relative rate. You see that the observed relative rate is more likely to be close to the true rate (0.8 in this graph) when the size $n$ is larger. (A related question is: How to estimate a probability of an event to occur based on its count?).

distribution of k as function of n

Input for the GLM function

I am wondering why R requires us to use the cbind function

You are not required to use the cbind function. There are 2/3 methods (two or three depending on how you look at it: two ways to present the data and the last one splits up in two ways to feed it to the glm function).

  • We can also use GLM as if it was doing a Bernoulli variable. In that case the data would consist of $55+67+42+34 = 198$ rows. One row for each single count (each count can be considered a Bernoulli distributed variable)

    So the data would look like a long table

    $$\begin{array}{rrr} \text{ID} & \text{Observation of disease} & \text{Treatment}\\ 1 & 1 & 1\\ 2 & 1 & 1\\ 3 & 1 & 1\\ \vdots & \vdots & \vdots \\ 53 & 1 & 1\\ 54 & 1 & 1\\ 55 & 1 & 1\\ 56 & 0 & 1\\ 57 & 0 & 1\\ 58 & 0 & 1\\ \vdots & \vdots & \vdots \\ 120 & 0 & 1\\ 121 & 0 & 1\\ 122 & 0 & 1\\ 123 & 1 & 0\\ 124 & 1 & 0\\ 125 & 1 & 0\\ \vdots & \vdots & \vdots \\ 162 & 1 & 0\\ 163 & 1 & 0\\ 164 & 1 & 0\\ 165 & 0 & 0\\ 166 & 0 & 0\\ 167 & 0 & 0\\ \vdots & \vdots & \vdots \\ 196 & 0 & 0\\ 197 & 0 & 0\\ 198 & 0 & 0\\ \end{array}$$

     data <- data.frame(ID = 1:198, 
                        observation = c(rep(1,55),rep(0,67),rep(1,42),rep(0,34)),
                        treatment = c(rep(1,55+67), rep(0,42+34)))
     glm1 <- glm(observation ~ treatment, family = binomial, data = data) 
    
  • The previous method is often used when the independent variable is continuous. In the case of a categorical variable, we can group all the 1s and 0s for a category. In this case, the data is not 0s and 1s but the counts of 0s and 1s.

     data <- data.frame(disease = c(55,42),
                        healthy = c(67,34),
                        treatment = c(1,0))
    
     ### the 'cbind method' gives glm the numbers in each categories
     glm2 <- glm(cbind(disease,healthy) ~ treatment, family = binomial, data = data)
    
     ### this method gives glm the fraction and uses 'weight' to describe the number of data
     glm3 <- glm(disease/(disease+healthy) ~ treatment, family = binomial, data = data, weight = disease+healthy)
    

The summaries of the above three different calls to glm give the same results. There is a difference in statistics like deviance and AIC, that is because the likelihood function is different by a factor $n \choose k$ but that is not important because the relevant measures in comparing models are the ratios of likelihood.

> summary(glm1)

Call:
glm(formula = observation ~ treatment, family = binomial, data = data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.268  -1.095  -1.095   1.262   1.262  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.2113     0.2307   0.916    0.360
treatment    -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 274.41  on 197  degrees of freedom
Residual deviance: 272.46  on 196  degrees of freedom
AIC: 276.46

Number of Fisher Scoring iterations: 3

> summary(glm2)

Call:
glm(formula = cbind(disease, healthy) ~ treatment, family = binomial, 
    data = data)

Deviance Residuals: 
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.2113     0.2307   0.916    0.360
treatment    -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2

> summary(glm3)

Call:
glm(formula = disease/(disease + healthy) ~ treatment, family = binomial, 
    data = data, weights = disease + healthy)

Deviance Residuals: 
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.2113     0.2307   0.916    0.360
treatment    -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2

> 
Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • Do you know how to generate the long table in your first method from a long dataframe in R? – Antoni Parellada Sep 13 '21 at 02:57
  • 1
    @AntoniParellada I know that you can create similar long tables with `gl` and `model.matrix`. But that creates tables with equal number in each categories. Maybe there is some smart function that generates these long tables with unequal numbers, but I simple used a `rep` command to create the table. – Sextus Empiricus Sep 13 '21 at 06:52