How many groups are needed to reliably estimate variance parameters of random effects in a GLMM?

Question

I am looking at a panel data with binary outcomes in each year. The ultimate use of the model I build is for prediction. The cross-sections are quite tall (~100,000 non-cases and ~5,000-20,000 cases) but the number of years is few. I suspect that there may be both a time-varying intercept and one time-varying coefficient. Thus, I am using lme4::glmer. My questions are as follows

Can I estimate the three co-variance parameters despite a low number years (levels of the random effect factor). Particularly, does it matter that I have tall cross sections (large number of repeated measures)?
Should I suspect that the results are unreliable? Are there general guidelines for the number of needed levels to estimate co-variance parameters (references would be neat)? Again, does it matter that I have tall cross sections (large number of repeated measures)?

Here is an example to make it more clear

##### 
# simulate data
set.seed(71005391)
n_lvls    <- 16  # few number of year levels
n_per_lvl <- 1e4 # tall cross section
n <- n_lvls * n_per_lvl

# simulate covariates and random effects
X <- matrix(rnorm(n * 2), n, 2)

year   <- as.integer(gl(n_lvls, n_per_lvl))
Q      <- matrix(c(.3, .133, .133, .2), 2)
ranefs <- matrix(rnorm(n_lvls * 2), ncol = 2) %*% chol(Q)

# compute linear predictor and simulate outcome
lps <- 
  # fixed effects
  -1 +  X[, 1] + X[, 2] +  
  # random effects
  ranefs[year, 1] + ranefs[year, 2] * X[, 2]
df <- data.frame(Y = 1 / (1 + exp(-lps)) > runif(n), X, year = year)

#####
# fit model
library(lme4)
fit <- glmer(Y ~ X1 + X2 + (X2 | year), data = df, family = binomial())

# show estimates
list(est = VarCorr(fit), actual_var = Q, actual_cor = cov2cor(Q))
#R> $est
#R>  Groups Name        Std.Dev. Corr 
#R>  year   (Intercept) 0.43682       
#R>         X2          0.43690  0.709
#R> 
#R> $actual_var
#R>       [,1]  [,2]
#R> [1,] 0.300 0.133
#R> [2,] 0.133 0.200
#R> 
#R> $actual_cor
#R>           [,1]      [,2]
#R> [1,] 1.0000000 0.5429702
#R> [2,] 0.5429702 1.0000000

Here are confidence intervals using likelihood profiles for the variance parameters

conf. <- confint.merMod(
  fit, method = "profile", quiet = FALSE, oldNames = FALSE,
  parm = "theta_", parallel = "snow", ncpus = 7, verbose = TRUE)
#R> Computing profile confidence intervals ...
#R> Warning messages:
#R> 1: In if (parm == "theta_") { :
#R>   the condition has length > 1 and only the first element will be used
#R> 2: In if (parm == "beta_") { :
#R>   the condition has length > 1 and only the first element will be used
conf.
#R>                             2.5 %    97.5 %
#R> sd_(Intercept)|year     0.3195899 0.6469012
#R> cor_X2.(Intercept)|year 0.3642638 0.8832762
#R> sd_X2|year              0.3194962 0.6472190

My intuition is that we would need a lot of levels to estimate the co-variance parameters. E.g., as we need a lot of observation in a pure fixed effect model to estimate the coefficients. 3 random effects relative to the 16 levels as in the above does seem quite high. However, I gather that the large amount of information from the cross section may have an effect?

(+1) From what I gather people do in practice, 16 levels is actually quite a lot :-) — amoeba, Mar 05 '18 at 20:25
I would vote to close as a duplicate of this https://stats.stackexchange.com/questions/37647 but it's not possible to close a bountied question. See if that thread answers your Q. — amoeba, Mar 07 '18 at 13:20
@amoeba The question [you link to](stats.stackexchange.com/questions/37647) is different in that there a few observation for each level. E.g., the "_Does the fact that I have quite a few repeated measurements for each subject help in this regard (I don't see how it matters)?_" Further, my question is regarding two random effects and not one. — Benjamin Christoffersen, Mar 07 '18 at 22:40
OK, I appreciate these two differences but perhaps you could edit the question to reflect that you have already studied https://stats.stackexchange.com/questions/37647 and have some additional questions. I think the number of repeated measures does not matter at all. The question about more than 1 random effects makes sense; I guess one cannot reliably estimate variance-covariance matrix of e.g. 100 random effects with only 16 groups. — amoeba, Mar 07 '18 at 22:49

Martin Modrák · Answer 1 · 2018-03-07T13:14:21.130

You can actually dodge the question with a fully Bayesian approach. If you go full Bayes, the question is not "can I estimate X?" (which is basically always true) but "how precisely can I estimate X?". And a reliable estimate of the uncertainty in X is a part of the result of fitting the model. In this sense, fitting an overly complex Bayesian model is safe: you fit the model and if the posterior uncertainty in some of the parameters is too large you know you need more data or a simpler model. This is in contrast to lme4 where AFAIK estimates can be unrealiable and a too complex model may overfit.

Also note that the posterior uncertainty is not simply a function of the size of the dataset, but also of its content - if the groups (years) are very similar, the uncertainty will be smaller than if they differe a lot. Further, how much uncertainty is acceptable depends on your intended use of the model's results, so I don't think you can make a good general rule of how much data you need.

If you want to use full Bayes rstanarm provides methods that are almost drop-in replacements for lme4 (see the vignette). It is however possible that rstanarm will be too slow for your dataset (hard to guess without actually running it). If this is so, INLA will give you almost the same results with much less computing power required (and should be able to handle your model with little or no modification).

The classical reference is Gelman et al.: Bayesian Data Analysis, 3rd Edition

Thanks for your reply. I know I can put a prior on my co-variance matrix. I do not see how your answers relates to: 1) _Can I estimate the three co-variance parameters despite a low number years?_ It may slightly answer 2) _Should I suspect that the results are unrealiabel? Are there general guidelines for the number of needed levels to estimate co-variance parameters_? as in that I can look at the posterior of the co-variance paramters. — Benjamin Christoffersen, Mar 07 '18 at 12:20
Sorry, I guess my point was not clear. With a fully Bayesian approach, the question is not "can I estimate X?" (which is basically always true) but "how precisely can I estimate X?" and a reliable answer to the latter question is a part of the result of fitting the model. In this sense, fitting an overly complex Bayesian model is safe: you fit the model and if the posterior uncertainty is too large (for your intended use) you know you need more data or a simpler model. This is in contrast to lme4 where AFAIK estimates can be unrealiable and a too complex model may overfit. — Martin Modrák, Mar 07 '18 at 13:07

score 2 · Answer 2 · answered Mar 08 '18 at 08:25

Following this answer one can look at Ben Bolker's GLMM FAQ. This comment is useful in the context

Treating factors with small numbers of levels as random will in the best case lead to very small and/or imprecise estimates of random effects; in the worst case it will lead to various numerical difficulties such as lack of convergence, zero variance estimates, etc.. (A small simulation exercise shows that at least the estimates of the standard deviation are downwardly biased in this case; it’s not clear whether/how this bias would affect the point estimates of fixed effects or their estimated confidence intervals.) In the classical method-of-moments approach these problems may not arise (because the sums of squares are always well defined as long as there are at least two units), but the underlying problems of lack of power are there nevertheless.

The following simulation exercise confirms the downward bias when we have large cross sections

##### 
# simulate data
set.seed(71005391)
n_lvls    <- 16  # few number of year levels
n_per_lvl <- 1e4 # tall cross section
n <- n_lvls * n_per_lvl
Q <- matrix(c(.3, .133, .133, .2), 2)
year   <- as.integer(gl(n_lvls, n_per_lvl))

# run simulation
library(parallel)
cl <- makeCluster(6)

clusterExport(cl, names(environment())[names(environment()) != "cl"])
set.seed(4438182)
out <- parLapply(cl, 1:1000, function(...){
  require(lme4)

  # simulate covariates and random effects
  X <- matrix(rnorm(n * 2), n, 2)
  ranefs <- matrix(rnorm(n_lvls * 2), ncol = 2) %*% chol(Q)

  # compute linear predictor and simulate outcome
  lps <- 
    # fixed effects
    -3 +  X[, 1] + X[, 2] +  
    # random effects
    ranefs[year, 1] + ranefs[year, 2] * X[, 2]
  df <- data.frame(Y = 1 / (1 + exp(-lps)) > runif(n), X, year = year)

  #####
  # fit model  
  fit <- glmer(Y ~ X1 + X2 + (X2 | year), data = df, family = binomial())

  matrix(unlist(VarCorr(fit)), 2)
})

stopCluster(cl)

# make histograms and look at bias
covars <- array(
  unlist(out), dim = c(nrow(out[[1]]), ncol(out[[1]]), length(out)))
sfun <- function(x, actual) {
  r <- list(mean = mean(x), sem = sd(x)/sqrt(length(x)))
  r <- with(r, c(r, list(lwr = mean - 2 * sem, upr = mean + 2 * sem, 
                         bias = (mean - actual) / actual)))
  unlist(r)
}

for(v in list(c(1, 1), c(2, 1), c(2, 2))){
  lab <- bquote(Q[.(v[1])][.(v[2])])
  x <- apply(covars, 3, "[", i = v[1], j = v[2])
  actual <- Q[v[1], v[2]]
  hist(x, breaks = 25, xlim = range(x, 0), xlab = lab, main = lab)
  abline(v = actual, lty = 2, col = "red")
  cat("Stats for element ", v[1], ", ", v[2], "\n", sep = "")
  print(sfun(x, actual))
}
#R> Stats for element 1, 1
#R>         mean          sem          lwr          upr         bias 
#R>  0.269797815  0.003323067  0.263151682  0.276443949 -0.100673949 
#R> Stats for element 2, 1
#R>         mean          sem          lwr          upr         bias 
#R>  0.113846305  0.002184928  0.109476449  0.118216162 -0.144012742 
#R> Stats for element 2, 2
#R>         mean          sem          lwr          upr         bias 
#R>  0.186357075  0.002265719  0.181825637  0.190888512 -0.068214626

The three histograms are shown below. The red lines are the actual values

So I figure this somewhat answer question 1.

Can I estimate the three co-variance parameters despite a low number years (levels of the random effect factor)?

with a "yes" but there is downward bias and the uncertainty does seem large.

Decreasing n_per_lvl to 50 yields

#R> Stats for element 1, 1
#R>        mean         sem         lwr         upr        bias 
#R> 0.328784127 0.008340519 0.312103088 0.345465165 0.095947088 
#R> Stats for element 2, 1
#R>        mean         sem         lwr         upr        bias 
#R>  0.11253616  0.00428645  0.10396326  0.12110906 -0.15386345 
#R> Stats for element 2, 2
#R>        mean         sem         lwr         upr        bias 
#R> 0.237205976 0.006270416 0.224665143 0.249746809 0.186029879

and here is one of the histograms

So the size of the cross section does seem to matter.

I still have not addressed the second question.

A similar simulation with `n_per_lvl = 1e4` and only a random intercept effect with variance 0.4 also yields a bias. The result is `0.380294401 0.004070553 0.372153295 0.388435507 -0.049263998` (the last figure is the bias). — Benjamin Christoffersen, Mar 08 '18 at 21:00

How many groups are needed to reliably estimate variance parameters of random effects in a GLMM?

2 Answers2