Do random effect levels need observations from all fixed effects levels?

Question

We want to model a plant functional group's presence probability (True/False) in six different environmental conditions based on data taken from over 200 studies with a binomial glmer from the lme4 package. Most studies have several observations/ plots, and various parameters differ between studies, such as sampling method and design, observed taxonomic groups, location etc. We would like to include study as random factor to account for these uncontrolled differences.

However, many studies only sampled in one or only few environmental conditions, i.e. most random effect levels only contained some of the fixed effect levels. The model including studies as random effect seems to massively underestimate of the functional group's occurrence probability, and possibly also estimates misleading effect sizes.

I read this relatively recent question, however, there it was a clear split plot design, which is not the case here: Do the levels of a random effect need to be present in all the levels of a fixed effect?

Is it valid to use random effect levels which only cover parts of the fixed effects? Which alternatives are there? Am I applying the random effects wrongly?

When I remove all random effect levels which contain less than 4 out of 6 fixed effect levels, the magnitudes of the model estimates start to make sense, however, are still on the low side. Plus, I discard more than 90% of my data.

I would be very thankful if you could point me towards a solution or explanation here.

To illustrate the story:

# Normal glm for overview
glm(Present ~ Environment, data = db , family = binomial) %>% 
  emmeans(specs = ~Environment , type = "response") %>%
  cld(sort = F)

Environment   prob      SE  df asymp.LCL asymp.UCL .group
 A           0.0562 0.00382 Inf    0.0492    0.0642  1    
 B           0.0561 0.00468 Inf    0.0476    0.0660  1    
 C           0.0416 0.00559 Inf    0.0319    0.0541  1    
 D           0.1221 0.00863 Inf    0.1062    0.1401   2   
 E           0.0923 0.00827 Inf    0.0773    0.1099   2   
 F           0.6261 0.03136 Inf    0.5629    0.6852    3 



# glmer including study as random effect. Clear underestimation of model parameters. Can the effect sizes be trusted? 
glmer(Present ~ Environment + (1|Study), data = db , family = binomial) %>% 
  emmeans(specs = ~Environment , type = "response") %>%
  cld(sort = F)

Environment     prob       SE  df asymp.LCL asymp.UCL .group
 A           2.97e-05 2.52e-05 Inf  5.65e-06  0.000157  1    
 B           5.71e-05 4.83e-05 Inf  1.09e-05  0.000299   2   
 C           6.50e-05 5.78e-05 Inf  1.13e-05  0.000372  123  
 D           1.17e-04 9.88e-05 Inf  2.22e-05  0.000613    3  
 E           2.56e-04 2.19e-04 Inf  4.78e-05  0.001371     4 
 F           4.68e-04 4.18e-04 Inf  8.11e-05  0.002693     4 



# Contingency table showing the number of observations per study and environmental condition. Most studies only cover a few environmental conditions.  
xtabs( ~ Study + Environment , db) %>% head(n = 10)

          Environment
Study        A   B   C   D   E   F
  study_1    0   0  20   0   0   0
  study_2    1   1   1   2   1   0
  study_3    1   1   1   2   1   0
  study_4   17   0   0   0   0   0
  study_5    2   0   0   3   0   0
  study_6  180   0   0   0   0   0
  study_7    3   0   0   0   0   3
  study_8   90   0   0   0   0   0
  study_9    0   0  30   0   0   0
  study_10   0   0   0   0   0  40



# glmer excluding studies with fewer than 4 environmental conditions. The magnitude of model estimates starts to become realistic. 
glmer(Present ~ Environment + (1|Study) , data = db[! db$Study %in% excludestudy , ] , family = binomial) %>% 
  emmeans(specs = ~Environment , type = "response") %>%
  cld(sort = F)

Environment    prob      SE  df asymp.LCL asymp.UCL .group
 A           0.00791 0.00862 Inf  0.000927    0.0642  1    
 B           0.01171 0.01287 Inf  0.001339    0.0948  1    
 C           0.01908 0.02167 Inf  0.002006    0.1584  12   
 D           0.03319 0.03511 Inf  0.004002    0.2267   2   
 E           0.08290 0.08260 Inf  0.010635    0.4319    3  
 F           0.11860 0.11807 Inf  0.014490    0.5518    3  



# glm excluding these studies for reference. 
glm(Present ~ Environment , data = db[! db$Study %in% excludestudy , ] , family = binomial) %>% 
  emmeans(specs = ~Environment , type = "response") %>% 
  cld(sort = F)

Environment   prob     SE  df asymp.LCL asymp.UCL .group
 A           0.1325 0.0144 Inf    0.1066    0.1634  1    
 B           0.0635 0.0137 Inf    0.0413    0.0964   2   
 C           0.1690 0.0445 Inf    0.0986    0.2745  123  
 D           0.1433 0.0191 Inf    0.1097    0.1850  1    
 E           0.3200 0.0311 Inf    0.2623    0.3837    3  
 F           0.8351 0.0377 Inf    0.7476    0.8964     4

UPDATE

Following the suggestions of @RussLenth and https://cran.r-project.org/web/packages/emmeans/vignettes/transformations.html#bias-adj, I added a bias adjustment for the back-transformation via emmeans based on the random effects sd extracted via VarCorr. The probabilities now look much more realistic, however, I'm not quite happy about them as they still seem very low (compared with the first glm from above). See below.

m <- glmer(Present ~ Environment + (1|Study) , data = db , family = binomial)

VarCorr(m)

Groups Name        Std.Dev.
 Study  (Intercept) 8.3418  


m %>% emmeans(specs = ~Environment , type = "response") %>%
  cld(bias.adj = T , sigma = 8.3418)

 Environment    prob       SE  df asymp.LCL asymp.UCL .group
 A           0.00106 0.000902 Inf  0.000202    0.0056  1    
 B           0.00204 0.001728 Inf  0.000390    0.0107   2   
 C           0.00232 0.002070 Inf  0.000406    0.0133  123  
 D           0.00418 0.003534 Inf  0.000796    0.0219    3  
 E           0.00915 0.007836 Inf  0.001709    0.0489     4 
 F           0.01672 0.014925 Inf  0.002902    0.0956     4 

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 
Bias adjustment applied based on sigma = 8.3418 
P value adjustment: tukey method for comparing a family of 6 estimates 
Tests are performed on the log odds ratio scale 
significance level used: alpha = 0.05 
NOTE: Compact letter displays can be misleading
      because they show NON-findings rather than findings.
      Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.

My questions now would be:

1. Can I trust the pair-wise comparisons from cld based on the model with random effects levels missing large parts of fixed effects levels? If so, I could use those to report significant differences between environments and show means and sd of each environment.

2. Is there another way to correct the back-transformed probabilities?

I guess just increasing the sigma value until everything fits nicely is not really a valid option?

Thanks for your remark. Do you mean that it's OK to have a majority of my random factors covering only one of six fixed factors? While the models run perfectly, the model estimates are obviously too low. Do you have an idea why that happens and how I could fix that? Any advice appreciated! — PhilS, Feb 08 '22 at 14:59
Run VarCorr and see what the Study SD is. Using that as the value of sigma and with bias.adj = TRUE would. end the estimates upward, I'd guess - but seems doubtful it'd be too dramatic. — Russ Lenth, Feb 08 '22 at 17:21
Thanks for your suggestion. This increases the back-transformed probabilities towards more realistic values, however, still on the very low side. My main concern now is: can I trust the pair-wise comparisons with the given model? See updated question. — PhilS, Feb 09 '22 at 14:13

Do random effect levels need observations from all fixed effects levels?

0 Answers0