I have a data set with a binomial response and am trying to determine the best way to model it. Based off of this post I believe I need to use a GEE vs a mixed effect model since I am interested in population average effects.
I am trying to model if a sales person will stay employed for at least 1 year. I have monthly observations of the sales persons performance, as well as factor level variables of the sales person's boss. The data set looks somewhat similar to this:
Sales_ID Boss_ID Date Total_Sales Dept_A_Sales Boss_factor_1 Emplyed_1yr
1 1 1/1/92 1000 100 A Y
1 1 2/1/92 900 90 A Y
There can be up to 12 rows per sales_ID under each boss_ID. All Emplyed_1yr will not change for every observation of boss_id, sales_id combination.
There is a hierarchical structure here since the sales person is under the boss. There is also correlation between the variables. The factor variables for the boss will never change for each observation of sales_id under the boss_id. I'm not sure if that makes a difference or not. This will also not be balanced data. The Sales_IDs that are employed for a year will have 12 observations, anyone who leaves within the first year will only have as many observations as months they were employed.
So far from what I've read on GEE the model will need to be setup somewhat similar to
geeglm(Emplyed_1yr ~ Total_Sales + Depat_A_Sales + Boss_factor_1, family = binomial(link = "logit"), data = dat, corstr = "exchangeable")
What I'm not sure about is if the mixed effects need to be included somehow. Also, is it correct to leave this in longitudinal form? Or should these be collapsed and maybe take the average Total_Sales and Dept_A_Sales? And, is the unbalanced data going to cause any issues?