1

I have a data set with a binomial response and am trying to determine the best way to model it. Based off of this post I believe I need to use a GEE vs a mixed effect model since I am interested in population average effects.

I am trying to model if a sales person will stay employed for at least 1 year. I have monthly observations of the sales persons performance, as well as factor level variables of the sales person's boss. The data set looks somewhat similar to this:

Sales_ID Boss_ID   Date  Total_Sales   Dept_A_Sales  Boss_factor_1 Emplyed_1yr
      1       1  1/1/92         1000            100              A           Y
      1       1  2/1/92          900             90              A           Y

There can be up to 12 rows per sales_ID under each boss_ID. All Emplyed_1yr will not change for every observation of boss_id, sales_id combination.

There is a hierarchical structure here since the sales person is under the boss. There is also correlation between the variables. The factor variables for the boss will never change for each observation of sales_id under the boss_id. I'm not sure if that makes a difference or not. This will also not be balanced data. The Sales_IDs that are employed for a year will have 12 observations, anyone who leaves within the first year will only have as many observations as months they were employed.

So far from what I've read on GEE the model will need to be setup somewhat similar to

geeglm(Emplyed_1yr ~ Total_Sales + Depat_A_Sales + Boss_factor_1, family = binomial(link = "logit"), data = dat, corstr = "exchangeable")

What I'm not sure about is if the mixed effects need to be included somehow. Also, is it correct to leave this in longitudinal form? Or should these be collapsed and maybe take the average Total_Sales and Dept_A_Sales? And, is the unbalanced data going to cause any issues?

Kristofersen
  • 431
  • 3
  • 10
  • The Huber-White standard errors used in GEE adjust for the within-cluster (or individual, etc.) correlation. You cannot include random effects in a GEE because they are not individual-specific models. – GoF_Logistic Mar 01 '17 at 16:28
  • @GoF_Logistic Sorry, going to show off how little I understand about this. So in the geeglm is the ID to identify how each group fits together? e.g. two different sales people in their first month should both have an ID = 1? And this is not a unique ID for each Sales_ID/boss_ID? – Kristofersen Mar 01 '17 at 16:37
  • I've never used geeglm but I assume ID should refer to the cluster IDs. If the pockets of correlated data correspond to people with the same value for "month" then the answer to our "e.g." is yes. – GoF_Logistic Mar 01 '17 at 16:40
  • @GoF_Logistic okay, ID is cluster ID, so that does make sense. So then for the correlation structure is that the correlation between clusters then? or is that the variable correlation? – Kristofersen Mar 01 '17 at 16:42
  • It's the working correlation structure, within cluster. Correlations between clusters are not modeled – GoF_Logistic Mar 01 '17 at 16:49

1 Answers1

2

GEEs do not account for random effects. Random effects are estimated only in mixed models to provide individual level inference as you allude to. The concept of population level inference primarily concerns "integrating" or marginalizing over random effects in some fashion. The rigorous formulation of this is somewhat involved but it has been done. See Miglioretti Heagerty.

When the link function is collapsible like an identity or log link, then the relation between conditional and marginal parameters is more clear. However, logit curves are non-collapsible and so the relation between conditional and marginal parameters is less clear, except that most of the time the marginal parameters are attenuated, providing more conservative inference about the nature of the population level effect.

In the GEE if you specify an exchangeable correlation structure, then the GLM formulation uses an EM algorithm estimation of inverse probability weighted least squares and maximum likelihood for the cross-correlation. This will produce different estimates, but the SEs for this estimate are right! In my experience, however, GEE GLMs work best with working independence correlation structure, which does not violate any assumptions of independence because of the robust error estimation.

AdamO
  • 52,330
  • 5
  • 104
  • 209