What are the steps to simulate data for a linear model with random slopes and random intercepts

Question

I have searched this site for existing answers but so far I didn't find anything. I did see this one How to simulate a random slope model

Unfortunately it doesn't answer my question.

I would like to know what the steps are to simulate data for a model with random slopes and random intercepts. I know there are some software like the lme4 package in R that can simulate data, but I would like to do it myself. I would like to specify the fixed effects, number of groups, sample size, variances of the random effects (and the correlation between them), and simulate a dataset accordingly.

I don't need specific codes for any software. Just a description and explanation on the steps needed to do it.

What do you mean by "do it myself" ? This kind of simulation needs a quite complicated design matrix for the random effects. Do you want to form that matrix by yourself or are you happy using a mixed model package to do that part for you ? If the former, then it's quite easy, but the latter you will be in a world of pain. — Robert Long, Sep 19 '20 at 12:12
Yes I want to do it from scratch. I am using R if that is relevant. — camhsdoc, Sep 19 '20 at 12:42
Your question answers itself: you simply execute all the steps you list in `R`. For help with any particular step, work out the code and if it fails, post it on [SO] with a description of what's going wrong and what you need it to do. — whuber, Sep 19 '20 at 14:01
@whuber they appear to not know what the steps that you refer to are. They know about the components of a mixed model, but they don't seem to know how to formulate the steps. The steps are based on a statisitical understanding of mixed models, so it seems like a completely statistical question to me. I'm not sure how it is off topic ? — Robert Long, Sep 19 '20 at 19:22
Yes this is correct. I don't know what the steps are for doing it. — camhsdoc, Sep 20 '20 at 12:48
@Robert Could you explain what might be missing from the steps listed in "specify the fixed effects, number of groups, sample size, variances of the random effects (and the correlation between them), and simulate a dataset accordingly"? That might help give this question some focus. — whuber, Sep 20 '20 at 14:58
@whuber I agree that the question could do with some more focus. I don't really see the list you quoted as steps though, I see them as parts of a fitted model. They need to take the parts and put them together, which requires statistical understanding. If it were a linear model they would specify a model matrix $X$, a fixed effects coefficient vector $\beta$, compute $X\beta$ then simulate the outcome by adding residuals drawn from some distribution. With a mixed model it's similar but the presence of the random effects makes it quite a bit harder to do from scratch... — Robert Long, Sep 20 '20 at 15:20
@whuber ......basically the steps for a mixed model are to compute $X\beta$ as above, simulate some random effects $u$ based on the number of groups, their variances and covariances (some people find this a bit tricky), then form the model matrix $Z$ for the random effects (which is the *very* tricky part), then compute $Zu$ and add the residuals. I guess this kind of answers the question, but since they want to do it from scratch some details about how to form $Z$ when there are random slopes and random intercepts will need quite a lot of thought. — Robert Long, Sep 20 '20 at 15:28
@whuber Thanks for reopening. I've posted an answer. Hopefully it's what the OP is looking for, but I would be interested in your comments, if you have any ! — Robert Long, Sep 20 '20 at 18:35

score 10 · Accepted Answer · edited Jul 22 '21 at 16:18

The general approach to simulating data for a mixed model is as follows:

Create the variable(s) for the fixed effects
Create the variable(s) for the group(s)
The fixed effects coefficients will be supplied / given and these will be a column vector $\beta$
Create a model matrix, $X$, for the fixed effects
Simulate the random effects from the given variances and covariances. In mixed model theory these are typically multivariate normal, but there is no requirement for this when we are simulating the data.
Create a model matrix, $Z$, for the random effects
Simulate a residual error, $e$, from some distribution. This is typically a normally distributed variable with a given (constant) variance, but again, when we are simulating the data we could use any distribution we want, and the variance could be a function of the fixed effects, or they could be autocorrelated, or based on a more complex process.
Use the general mixed model formula: $y = X \beta + Zu + e$ to simulate the outcome $y$

This completes the necessary steps to simulate data for a mixed model.

The above steps are deliberately general. Unfortunately the devil is in the details. Step 6, in particular can be very tricky. The only way to properly understand it all, is to actually do it. I will go through an example from start to finish with a small dataset, without the need for any software or package. To start, let us have:

One grouping variable, $G$, with 3 levels A, B and C
A fixed effect for $a$, a continuous variable, taking the values from 1 to 4.
Random intercepts for $G$, and random slopes for $a$ with a correlation between them of $\rho$
A balanced design such that each group has every value of $a$ exactly once, so that we have 12 observations in total.

Following the steps above, step 1 and 2, the dataset will be:

In step 3 we have the fixed effects coefficients. Here we will fit a fixed intercept as well as the fixed effect for $a$, so there will be two values, let us say they are 3.1 and 1.8. Thus

$$ \beta = \begin{bmatrix} 3.1 \\ 1.8 \end{bmatrix} $$

In step 4, we form the model matrix $X$ for the fixed effects. The purpose of this is to map the fixed effect coefficients to the outcome variable. Each row of $X$ will multiple $\beta$, to give a single contribution to the outcome $y$. So the first column of $X$ will be all 1s for the intercept, so that each row gets the same value (3.1) for the intercept and the 2nd column will contain the values of $a$ which will be multiplied by the fixed effect coefficient for $a$ (1.8). Thus we will have:

$$ X = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 2 \\ 1 & 2 \\ 1 & 2 \\ 1 & 3 \\ 1 & 3 \\ 1 & 3 \\ 1 & 4 \\ 1 & 4 \\ 1 & 4 \end{bmatrix} $$

It is then easy to see that when we form the product $X\beta$, $X$ maps the correct values into the result. For example for row 1, we will have $1 \times 3.1 + 1 \times 1.8 = 4.9$ and for the last row we will have $1 \times 3.1 + 4 \times 1.8 = 10.3$

In step 5 we simulate the random effects. For simplicity, let us assume they will follow a multivariate normal distribution. Let us say that the random intercepts will have variance of 2.1 and the random slopes will have a variance of 1.8, with a correlation, $\rho$, of 0.5 between them and both will have a mean of zero. Then the random effects will be distributed:

$$ u \sim \mathcal{N}\left(0, \begin{bmatrix} 2.1 & 0.5\\ 0.5 & 1.8 \end{bmatrix} \right) $$

So we need to sample 3 times from this distribution (corresponding to the 3 levels in our grouping variable, $G$), and let us say that we obtain:

$$ u = \begin{bmatrix} 2.4 & 0.8 \\ -0.9 & 1.3 \\ -1.5 & -2.1 \end{bmatrix} $$

where the first column will be the random intercepts, let's call it $u_1$ and the 2nd column will be the random slopes, let's call it $u_2$

Now for the tricky part. In step 6 we form the model matrix $Z$ for the random effects. As with $X$ the purpose of this matrix is to map the correct values of the random effects $u$ to the outcome for each row in the data. Since we have 1 grouping variable (random intercepts) and one random slopes variable it is convenient to split $Z$ into 2. First we consider the random intercepts. Each group has it's own intercept and these are in $u_1$:

$$ u_1 = \begin{bmatrix} 2.4 \\ -0.9 \\ -1.5 \end{bmatrix} $$

So group A has an intercept of 2.4, group B has an intercept of -0.9 and group C has an intercept of -1.5. Now we need to bear in mind the structure of the dataset. It is reproduced again here:

It should therefore be easy to see, that $Z_1$ has to have the following structure to match that of the dataset and map the correct values into the result:

$$ Z_1= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} $$

so that when we form the product $Z_1 u_1$, we get, for example, for the first row (group A) $(2.4 \times 1) + (-1.9 \times 0) + (-1.5 \times 0) = 2.4$ and likewise for rows 4, 7 and 10. Applying the same logic for groups B and C we can see that they always receive -0.9 and -1.5 respectively.

For the random slopes things get a little more tricky. We have

$$ u_2 = \begin{bmatrix} 0.8 \\ 1.3 \\ -2.1 \end{bmatrix} $$

So the random slope for group A for variable $a$ is 0.8. This is a linear slope so it means that the values of $a$ must be multiplied by 0.8. For group B the values of $a$ must be multipled by 1.3 and for group C they must be multiplied by -2.1. Again, noting the structure of the dataset above, $Z_2$ will accomplish this mapping with the following structure:

$$ Z_2 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \\ 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 3 \\ 4 & 0 & 0 \\ 0 & 4 & 0 \\ 0 & 0 & 4 \end{bmatrix} $$

If we again consider group A which has a random slope of 0.8, the first row, when $a=1$, contributes $0.8 \times 1 + 1.3 \times 0 + (-2.1 \times 0) = 0.8 $, the 4th row, when $a=2$, contributes $0.8 \times 2 + 1.3 \times 0 + (-2.1 \times 0) = 1.6 $, the 7th row, when $a=3$, contributes $0.8 \times 3 + 1.3 \times 0 + (-2.1 \times 0) = 2.4 $ and the 10th row, when $a=4$, contributes $0.8 \times 4 + 1.3 \times 0 + (-2.1 \times 0) = 3.2 $ . Again the same logic applies for groups B and C.

If we wish we could combine $Z_1$ and $Z_2$ to form $Z$ and $u_1$ and $u_2$ to form $u$, and this could be done in many ways. But all we really have to do to complete the simulation is to sample from some distribution to obtain $e$ and then compute $y = X\beta + Z_1u_1 + Z_2u_2 + e$

Edit: to address Erik's request for R code to demonstrate the above.

I would never suggest to form $Z$ by hand / from scratch in all but the simplest of models. But here I will do so, and also check that the resulting data are constent with using software to create $Z$

set.seed(15)
n.group <- 3  #number of groups
dt <- expand.grid(G = LETTERS[1:n.group], a = 1:4)
X <- model.matrix(~ a, dt)   # model matrix for fixed effects
betas <- c(3.1, 1.8)   # fixed effects coefficient vector
Z1 <- model.matrix(~ 0 + G, dt)   # model matrix for random intercepts
Z2 <-  model.matrix(~ 0 + G, dt) * dt$a   # model matrix for random slopes

Here I have created $Z_1$ and $Z_2$, "manually" as per the main part of my answer.

s1 <- 2.1 #  SD of random intercepts
s2 <- 1.8 #  SD of random slopes
rho <- 0.5  # correlation between intercepts and slopes
cormat <-  matrix(c(s1, rho, rho, s2), 2, 2)  # correlation matrix 
covmat <- lme4::sdcor2cov(cormat)    # covariance matrix (needed for mvrnorm)
umat <- MASS::mvrnorm(n.group, c(0, 0), covmat, empirical = TRUE)  # simulate the random effects
u1 <- umat[, 1]
u2 <- umat[, 2]
e <- rnorm(nrow(dt), 0, 2)   # residual error
dt$Y_manual <- X %*% betas + Z1 %*% u1 + Z2 %*% u2 + e

So we have simulated Y from manually created $Z$ matrices

Now let's use lme4 to create $Z$

library(lme4)
lForm <- lFormula(Y_manual ~ a + (a|G), dt)    # lme4's function to process a model formula
Z <- t(as.matrix(lForm$reTrms$Zt))   # extract the Z matrix
u <- c(rbind(umat[, 1], umat[, 2]))  # lme4 needs the random effects in this order: interleaved)
dt$Y <- X %*% betas + Z %*% u + e
dt
   G a         Y  Y_manual
1  A 1  4.347903  4.347903
2  B 1  4.039412  4.039412
3  C 1  8.275563  8.275563
4  A 2  4.788965  4.788965
5  B 2  3.301834  3.301834
6  C 2 10.839260 10.839260
7  A 3  9.906717  9.906717
8  B 3 -1.159811 -1.159811
9  C 3 17.517209 17.517209
10 A 4 12.205023 12.205023
11 B 4  1.017939  1.017939
12 C 4 17.692258 17.692258

So as we can see, we obtain exactly the same simulated values for the outcome with the manual method and by using lme4's lFormula function

Now let's try actually fitting the model:

m0 <- lmer(Y ~ a + (a|G), dt) 
summary(m0)

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 G        (Intercept) 1.852    1.361         
          a           6.338    2.518    -0.44
 Residual             3.038    1.743         
Number of obs: 12, groups:  G, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)    3.557      1.462   2.433
a              1.670      1.522   1.097

Surprisingly it converges without warning and the estimates are not too bad considering the sample size !

Fantastic explanation, Robert! +1 Would you be willing to add the relevant R code for others to try out? — Erik Ruzek, Sep 20 '20 at 18:49
This explanation give me the wish to study mixed models: something that I never studied. Nice answer. — igorkf, Sep 20 '20 at 18:51
@ErikRuzek Hi ! And thank you. I am happy to share R code, but I wrote the answer without using R ! If I were doing it in R I would not try to create $Z$ "by hand". I mean, in this contrived example it would be easy, but for a bigger dataset I would use one of the mixed model packages to create $Z$. To create $Z$ by hand for a bigger dataset would be a kind of masochism :D Anyway, I'm happy to post some code, but do you really want $Z$ to be constructed by hand ? If so I can use the exact dataset above ? — Robert Long, Sep 20 '20 at 18:55
@ErikRuzek I have added some R code and output, which hopefully makes sense ! — Robert Long, Sep 20 '20 at 19:56

What are the steps to simulate data for a linear model with random slopes and random intercepts

1 Answers1

Linked