2

I was recently working on a homework assignment on binary GLMs and the following question came up when comparing solutions with a classmate. The data for the problem was given as a contingency table, say

Case Control
Exposed 1 2
Unexposed 3 4

for a small example. In order to make analysis in R easier, we both decided to turn this into a data frame format. I created the data frame like this:

df1 <- data.frame(
    exposure = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0),
    outcome = c(1, 0, 0, 1, 1, 1, 0, 0, 0, 0)
)

whereas my classmate did this:

df2 <- data.frame(
  count = c(1, 2, 3, 4),
  exposure = c(1, 1, 0, 0),
  outcome = c(1, 0, 1, 0)
)

So when running models I did something like

glm(outcome ~ exposure, data = df1, family = binomial)

and my classmate did

glm(outcome ~ exposure, data = df2, family = binomial, weights = count)

and I am not sure which of these is correct. Our models produced the same estimates of parameters, standard error, and deviance, with the exception that the residual degrees of freedom for the weighted model was much smaller than the residual degrees of freedom for the unweighted model.

So my overall question is: which of these approaches is correct? Is this the correct way to use weights in a glm? Is there a way to have R account for the repeated observations in the degrees of freedom if weights is wrong?

To share my thoughts: I think that using the weights instructs R what the entries for the $W$ matrix in a weighted regression model should be. My linear algebra is a bit rusty but I think it just so happens that this turns out to produce the same results as having the repeated observations in the $X$ matrix for an unweighted regression. Thus, R performs the weighted regression and gets the same estimates, but only counts degrees of freedom for observations that are actually there. I.e. the "weights" in GLM are just sampling weights, not actual weights that can replace physically observing the same thing multiple times.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
wz-billings
  • 248
  • 1
  • 8

1 Answers1

1

There is no difference. Both are correct (as can be seen by the fact that the outputs are the same). One treats the data as 10 Bernoulli data and the other treats them as 2 binomial data constituting, collectively, 10 Bernoulli trials.


Update: The question seems to be about which residual df is correct to use to assess the possibility of overdispersion.

Overdispersion doesn't really apply for Bernoulli data (see: Overdispersion in logistic regression). It does potentially apply to binomial data that really are binomial. This was a canned example and it's not clear whether it is more correct to consider these data as Bernoulli or binomial. Overdispersion also doesn't apply with only one categorical regressor (in this case, Exposed). Even with a continuous $X$ and binomial data, inferring overdispersion can be ambiguous, it may just be that polynomial or interaction terms are needed (cf., Test logistic regression model using residual deviance and degrees of freedom, and Understanding lack of fit in logistic regression).

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • They objectively compute different degrees of freedom. Is this just a quirk of R not taking the weights into account for df calculation? – wz-billings Apr 10 '21 at 15:53
  • @wz-billings, I'm not sure what you mean by "objectively". The same set of Bernoulli trials can equivalently be considered a set of Bernoulli data w/ 1 df, or a set of binomial data w/ a different df. There is no difference. Both are correct (as can be seen by the fact that the outputs are the same). – gung - Reinstate Monica Apr 10 '21 at 17:40
  • When I run the two models they generate different degrees of freedom in R. – wz-billings Apr 11 '21 at 03:28
  • @wz-billings, right, because, again, the degrees of freedom for the same numbers, when treated as Bernoulli, is different from the degrees of freedom for them when treated as binomial. But you get the same results both ways, because both are correct. There is no difference. You are free to say they are either Bernoulli or binomial. – gung - Reinstate Monica Apr 11 '21 at 06:07
  • Sorry, I am confused because if the degrees of freedom are different then they are not exactly equivalent? Does this mean if I specify the data as a binomial, I can still use the larger degrees of freedom for inference, regardless of the R output? My main confusion here is whether R should update the degrees of freedom with the weights or if the different degrees of freedom outputs are intended. If the two models are equivalent I don't understand why they do have the same degrees of freedom. – wz-billings Apr 12 '21 at 01:07
  • @wz-billings, the degrees of freedom that are correct depends on whether the data are specified as Bernoulli or binomial. You use the binomial df w/ the binomial specification, & the Bernoulli df w/ the Bernoulli specification. The betas are the same in both models, as are the p-values. – gung - Reinstate Monica Apr 12 '21 at 11:16
  • Then I don't understand how they can both be correct. Different degrees of freedom will have a big impact on inference about the model's deviance. In this case, the bernoulli model would give a p-value about 0.1 for the deviance, while the binomial model would give a p-value about 0.001 for the deviance. I know that if I compare two models, then this doesn't matter and it works out either way. But which result for the over-dispersion of this model do I accept? – wz-billings Apr 12 '21 at 20:59
  • 1
    @wc-billings, have you tried running the models? Both give the same Null deviance: `13.46`; both give the same Residual deviance: `13.38`. The dfs, when treated as Bernoulli, are `9` & `8`; when treated as binomial, are `3` & `2`. Thus, the chi-squared tests are the same (difference b/t deviances `.08`, assessed against the chi-squared distribution on the difference b/t dfs `1`, yields the same p-value: `.78`). The p-values for both variables are also the same. – gung - Reinstate Monica Apr 13 '21 at 00:57
  • 1
    Yes, I have tried running the models--I said that I understand when I compare two models (e.g. null model with current model), the df doesn't matter. But I am confused about how to interpret the distribution of the deviance on its own--the deviance should be $\chi^2_{n-p}$ and we can use this as a hypothesis test to determine where there is overdispersion in the model. Sorry if I did not make this clear in my previous comment, but I am not sure which interpretation of the deviance is correct. Does overdispersion really depend on how the trials are conceptualized? – wz-billings Apr 14 '21 at 16:54
  • 1
    @wz-billings, I see. I didn't understand that's what your question was. Overdispersion doesn't really apply for Bernoulli data (see: [Overdispersion in logistic regression](https://stats.stackexchange.com/q/91597/)). It does potentially apply to binomial data that really are binomial. This was a canned example & it's not clear that it is more correct to consider these data as Bernoulli or binomial. – gung - Reinstate Monica Apr 14 '21 at 17:53
  • 1
    Overdispersion also doesn't apply w/ only 1 categorical regressor (exposed), even w/ a continuous X & binomial data, inferring overdispersion can be ambiguous, it may just be that polynomial or interaction terms are needed (cf, [Test logistic regression model using residual deviance and degrees of freedom](https://stats.stackexchange.com/a/248978/7290) & [Understanding lack of fit in logistic regression](https://stats.stackexchange.com/a/233826/7290)). – gung - Reinstate Monica Apr 14 '21 at 17:57