2

I have data from a two-stage selection. Stage 1 was a random selection of groups (classes), and stage 2 was a complete survey within each group. As our groups are very different, this resulted in a very different number of observations within each group (class). It ranges from 1 (most often) to 83.

We could, of course, calculate averages for each group and then do our analyses on the group level (level 2). However, we found that there's a large within-group variance. Regarding both, the independent and dependent variables. Therefore, working with averages would not do justice to the (legitimate) extreme cases on level 1.

  • No. of groups/clusters: ~ 1000
  • No. of level 1 observations: ~ 2500
  • No. of independent variables (only on level 1): ~ 16 (crosscorrelation < .6, mostly <.3)
    • 14 of them are dichotomous (boolean)
    • 2 of them are interval-scaled, ranging from 0 to ~2
  • Working with a Poisson distribution (negative binomial) as the dependent variable's distribution is over-dispersed (long tail characteristic)

Actually, I don't want to tell something about the groups in this step of analysis. We found that the level 2 observations are not independent from their groups, but still very different within each group. I would like to analyse the relations between a series of independent variables on level 2 on a dependent variable on level 2.

Well, of course I don't want to increase the alpha error a lot by working with a single-level model, while actually having multi-level data. So, I tried a mixed-effect model with the group and (to account for the over-dispersion) the observation ID as random effects, and the independent variables as fixed effects. Unfortunately, the model won't converge, unless I remove IVs. I already tried a different optimizer (Working with R and lme4), but as that did not help, I assume I got a general problem with my data.

Here are some questions...

  1. Can I, at all, do a multi-level model (random intercept) with such different number of observations per group? I already know that I'll overestimate the group-level effect.
  2. Did I miss any obvious - for example having too many fixed variables for a multi-level model?
  3. Is there another option to work with such data? I considered weighting down the observations from groups with lots of observations ...

Any other comments on my problem are, of course, welcome. I am still far from the point where I consider myself experienced with multi-level models ... And this seems to be a non-textbook situation, where experience would be really helpful.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
BurninLeo
  • 471
  • 3
  • 13
  • Maybe it is my ignorance, but how taking observation ID as a random effect would account for the over-dispersion? – T.E.G. May 02 '18 at 17:38
  • 1
    Well, actually that is a strategy, I found in several posts (e.g. https://stats.stackexchange.com/a/9670/17749). I did not yet understand the reason, myself. Removing the observation number/ID from the formula (well, I tried a lot of modifications...) does not change anything to the general problems I encounter. Therefore, I assume, it's at least not detrimental. – BurninLeo May 03 '18 at 07:48

0 Answers0