10

Whats the difference between a GLM model (logistic regression) with a binary response variable which includes subject and time as covariates and the analogous GEE model which takes into account correlation between measurements at multiple time points?

My GLM looks like:

Y(binary) ~ A + B1X1(subject id) + B2X2(time) 
              + B3X3(interesting continuous covariate)

with logit link function.

I'm looking for a simple (aimed at the social scientist) explanation of how and why time is treated differently in the two models and what the implications would be for interpretation.

chl
  • 50,972
  • 18
  • 205
  • 364
N26
  • 1,705
  • 3
  • 18
  • 22
  • 7
    I found those responses to related questions ([What is the difference between generalized estimating equations and GLMM?](http://stats.stackexchange.com/a/17403/930), [When to use generalized estimating equations vs. mixed effects models?](http://stats.stackexchange.com/a/16415/930)) very comprehensive, although they are about GLM *with random effects* vs. GEE. – chl Jan 24 '12 at 11:28
  • 1
    Do you really want to fit subject id as a continuous covariate? It seems strange to have the response variable be an increasing or decreasing function of id. – guest Jan 25 '12 at 08:51
  • Population averaged effects vs. subject specific effects. – Will Mar 24 '12 at 04:51
  • here's a link to an article discussing the differences between the two. http://aje.oxfordjournals.org/content/147/7/694.full.pdf+html – Will Mar 24 '12 at 04:57
  • 1
    In addition to the questions @chl links to above, this question also discusses these ideas: [Difference between generalized linear models & generalized linear mixed models in SPSS](http://stats.stackexchange.com/questions/32419/). – gung - Reinstate Monica Oct 19 '12 at 01:56

1 Answers1

14

There may be a better and more detailed answer out there, but I can give you some simple, quick thoughts. It appears that you are talking about using a Generalized Linear Model (e.g., a typical logistic regression) to fit to fit data gathered from some subjects at multiple time points. At first blush, I see two glaring problems with this approach.

First, this model assumes that your data are independent given the covariates (that is, after having accounted for a dummy code for each subject, akin to an individual intercept term, and a linear time trend that is equal for everybody). This is wildly unlikely to be true. Instead, there will almost certainly be autocorrelations, for example, two observations of the same individual closer in time will be more similar than two observations further apart in time, even after having accounted for time. (Although they may well be independent if you also included a subject ID x time interaction--i.e., a unique time trend for everybody--but this would exacerbate the next problem.)

Second, you are going to burn up an enormous number of degrees of freedom estimating a parameter for each participant. You are likely to have relatively few degrees of freedom left with which to try to accurately estimate your parameters of interest (of course, this depends on how many measurements you have per person).

Ironically, the first problem means that your confidence intervals are too narrow, whereas the second means your CIs will be much wider than they would have been if you hadn't wasted most of your degrees of freedom. However, I wouldn't count on these two balancing each other out. For what it's worth, I believe that your parameter estimates would be unbiased (although I may be wrong here).

Using the Generalized Estimating Equations is appropriate in this case. When you fit a model using GEE, you specify a correlational structure (such as AR(1)), and it can be quite reasonable that your data are independent conditional on both your covariates and the correlation matrix you specified. In addition, the GEE estimate the population mean association, so you needn't burn a degree of freedom for each participant--in essence you are averaging over them.

As for the interpretation, as far as I am aware, it would be the same in both cases: given that the other factors remain constant, a one-unit change in X3 is associated with a B3 change in the log odds of 'success'.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650