We want to model a plant functional group's presence probability (True/False) in six different environmental conditions based on data taken from over 200 studies with a binomial glmer from the lme4 package. Most studies have several observations/ plots, and various parameters differ between studies, such as sampling method and design, observed taxonomic groups, location etc. We would like to include study as random factor to account for these uncontrolled differences.
However, many studies only sampled in one or only few environmental conditions, i.e. most random effect levels only contained some of the fixed effect levels. The model including studies as random effect seems to massively underestimate of the functional group's occurrence probability, and possibly also estimates misleading effect sizes.
I read this relatively recent question, however, there it was a clear split plot design, which is not the case here: Do the levels of a random effect need to be present in all the levels of a fixed effect?
Is it valid to use random effect levels which only cover parts of the fixed effects? Which alternatives are there? Am I applying the random effects wrongly?
When I remove all random effect levels which contain less than 4 out of 6 fixed effect levels, the magnitudes of the model estimates start to make sense, however, are still on the low side. Plus, I discard more than 90% of my data.
I would be very thankful if you could point me towards a solution or explanation here.
To illustrate the story:
# Normal glm for overview
glm(Present ~ Environment, data = db , family = binomial) %>%
emmeans(specs = ~Environment , type = "response") %>%
cld(sort = F)
Environment prob SE df asymp.LCL asymp.UCL .group
A 0.0562 0.00382 Inf 0.0492 0.0642 1
B 0.0561 0.00468 Inf 0.0476 0.0660 1
C 0.0416 0.00559 Inf 0.0319 0.0541 1
D 0.1221 0.00863 Inf 0.1062 0.1401 2
E 0.0923 0.00827 Inf 0.0773 0.1099 2
F 0.6261 0.03136 Inf 0.5629 0.6852 3
# glmer including study as random effect. Clear underestimation of model parameters. Can the effect sizes be trusted?
glmer(Present ~ Environment + (1|Study), data = db , family = binomial) %>%
emmeans(specs = ~Environment , type = "response") %>%
cld(sort = F)
Environment prob SE df asymp.LCL asymp.UCL .group
A 2.97e-05 2.52e-05 Inf 5.65e-06 0.000157 1
B 5.71e-05 4.83e-05 Inf 1.09e-05 0.000299 2
C 6.50e-05 5.78e-05 Inf 1.13e-05 0.000372 123
D 1.17e-04 9.88e-05 Inf 2.22e-05 0.000613 3
E 2.56e-04 2.19e-04 Inf 4.78e-05 0.001371 4
F 4.68e-04 4.18e-04 Inf 8.11e-05 0.002693 4
# Contingency table showing the number of observations per study and environmental condition. Most studies only cover a few environmental conditions.
xtabs( ~ Study + Environment , db) %>% head(n = 10)
Environment
Study A B C D E F
study_1 0 0 20 0 0 0
study_2 1 1 1 2 1 0
study_3 1 1 1 2 1 0
study_4 17 0 0 0 0 0
study_5 2 0 0 3 0 0
study_6 180 0 0 0 0 0
study_7 3 0 0 0 0 3
study_8 90 0 0 0 0 0
study_9 0 0 30 0 0 0
study_10 0 0 0 0 0 40
# glmer excluding studies with fewer than 4 environmental conditions. The magnitude of model estimates starts to become realistic.
glmer(Present ~ Environment + (1|Study) , data = db[! db$Study %in% excludestudy , ] , family = binomial) %>%
emmeans(specs = ~Environment , type = "response") %>%
cld(sort = F)
Environment prob SE df asymp.LCL asymp.UCL .group
A 0.00791 0.00862 Inf 0.000927 0.0642 1
B 0.01171 0.01287 Inf 0.001339 0.0948 1
C 0.01908 0.02167 Inf 0.002006 0.1584 12
D 0.03319 0.03511 Inf 0.004002 0.2267 2
E 0.08290 0.08260 Inf 0.010635 0.4319 3
F 0.11860 0.11807 Inf 0.014490 0.5518 3
# glm excluding these studies for reference.
glm(Present ~ Environment , data = db[! db$Study %in% excludestudy , ] , family = binomial) %>%
emmeans(specs = ~Environment , type = "response") %>%
cld(sort = F)
Environment prob SE df asymp.LCL asymp.UCL .group
A 0.1325 0.0144 Inf 0.1066 0.1634 1
B 0.0635 0.0137 Inf 0.0413 0.0964 2
C 0.1690 0.0445 Inf 0.0986 0.2745 123
D 0.1433 0.0191 Inf 0.1097 0.1850 1
E 0.3200 0.0311 Inf 0.2623 0.3837 3
F 0.8351 0.0377 Inf 0.7476 0.8964 4
UPDATE
Following the suggestions of @RussLenth and https://cran.r-project.org/web/packages/emmeans/vignettes/transformations.html#bias-adj, I added a bias adjustment for the back-transformation via emmeans
based on the random effects sd extracted via VarCorr
. The probabilities now look much more realistic, however, I'm not quite happy about them as they still seem very low (compared with the first glm
from above). See below.
m <- glmer(Present ~ Environment + (1|Study) , data = db , family = binomial)
VarCorr(m)
Groups Name Std.Dev.
Study (Intercept) 8.3418
m %>% emmeans(specs = ~Environment , type = "response") %>%
cld(bias.adj = T , sigma = 8.3418)
Environment prob SE df asymp.LCL asymp.UCL .group
A 0.00106 0.000902 Inf 0.000202 0.0056 1
B 0.00204 0.001728 Inf 0.000390 0.0107 2
C 0.00232 0.002070 Inf 0.000406 0.0133 123
D 0.00418 0.003534 Inf 0.000796 0.0219 3
E 0.00915 0.007836 Inf 0.001709 0.0489 4
F 0.01672 0.014925 Inf 0.002902 0.0956 4
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
Bias adjustment applied based on sigma = 8.3418
P value adjustment: tukey method for comparing a family of 6 estimates
Tests are performed on the log odds ratio scale
significance level used: alpha = 0.05
NOTE: Compact letter displays can be misleading
because they show NON-findings rather than findings.
Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
My questions now would be:
1. Can I trust the pair-wise comparisons from cld
based on the model with random effects levels missing large parts of fixed effects levels?
If so, I could use those to report significant differences between environments and show means and sd of each environment.
2. Is there another way to correct the back-transformed probabilities?
I guess just increasing the sigma value until everything fits nicely is not really a valid option?