The general approach to simulating data for a mixed model is as follows:
- Create the variable(s) for the fixed effects
- Create the variable(s) for the group(s)
- The fixed effects coefficients will be supplied / given and these will be a column vector $\beta$
- Create a model matrix, $X$, for the fixed effects
- Simulate the random effects from the given variances and covariances. In mixed model theory these are typically multivariate normal, but there is no requirement for this when we are simulating the data.
- Create a model matrix, $Z$, for the random effects
- Simulate a residual error, $e$, from some distribution. This is typically a normally distributed variable with a given (constant) variance, but again, when we are simulating the data we could use any distribution we want, and the variance could be a function of the fixed effects, or they could be autocorrelated, or based on a more complex process.
- Use the general mixed model formula: $y = X \beta + Zu + e$ to simulate the outcome $y$
This completes the necessary steps to simulate data for a mixed model.
The above steps are deliberately general. Unfortunately the devil is in the details. Step 6, in particular can be very tricky. The only way to properly understand it all, is to actually do it. I will go through an example from start to finish with a small dataset, without the need for any software or package. To start, let us have:
- One grouping variable, $G$, with 3 levels
A
, B
and C
- A fixed effect for $a$, a continuous variable, taking the values from 1 to 4.
- Random intercepts for $G$, and random slopes for $a$ with a correlation between them of $\rho$
- A balanced design such that each group has every value of $a$ exactly once, so that we have 12 observations in total.
Following the steps above, step 1 and 2, the dataset will be:
G a
1 A 1
2 B 1
3 C 1
4 A 2
5 B 2
6 C 2
7 A 3
8 B 3
9 C 3
10 A 4
11 B 4
12 C 4
In step 3 we have the fixed effects coefficients. Here we will fit a fixed intercept as well as the fixed effect for $a$, so there will be two values, let us say they are 3.1 and 1.8. Thus
$$
\beta = \begin{bmatrix}
3.1 \\
1.8
\end{bmatrix}
$$
In step 4, we form the model matrix $X$ for the fixed effects. The purpose of this is to map the fixed effect coefficients to the outcome variable. Each row of $X$ will multiple $\beta$, to give a single contribution to the outcome $y$. So the first column of $X$ will be all 1s for the intercept, so that each row gets the same value (3.1) for the intercept and the 2nd column will contain the values of $a$ which will be multiplied by the fixed effect coefficient for $a$ (1.8). Thus we will have:
$$
X = \begin{bmatrix}
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 2 \\
1 & 2 \\
1 & 2 \\
1 & 3 \\
1 & 3 \\
1 & 3 \\
1 & 4 \\
1 & 4 \\
1 & 4
\end{bmatrix}
$$
It is then easy to see that when we form the product $X\beta$, $X$ maps the correct values into the result. For example for row 1, we will have $1 \times 3.1 + 1 \times 1.8 = 4.9$ and for the last row we will have $1 \times 3.1 + 4 \times 1.8 = 10.3$
In step 5 we simulate the random effects. For simplicity, let us assume they will follow a multivariate normal distribution. Let us say that the random intercepts will have variance of 2.1 and the random slopes will have a variance of 1.8, with a correlation, $\rho$, of 0.5 between them and both will have a mean of zero. Then the random effects will be distributed:
$$
u \sim \mathcal{N}\left(0, \begin{bmatrix}
2.1 & 0.5\\
0.5 & 1.8
\end{bmatrix} \right)
$$
So we need to sample 3 times from this distribution (corresponding to the 3 levels in our grouping variable, $G$), and let us say that we obtain:
$$
u = \begin{bmatrix}
2.4 & 0.8 \\
-0.9 & 1.3 \\
-1.5 & -2.1
\end{bmatrix}
$$
where the first column will be the random intercepts, let's call it $u_1$ and the 2nd column will be the random slopes, let's call it $u_2$
Now for the tricky part. In step 6 we form the model matrix $Z$ for the random effects. As with $X$ the purpose of this matrix is to map the correct values of the random effects $u$ to the outcome for each row in the data. Since we have 1 grouping variable (random intercepts) and one random slopes variable it is convenient to split $Z$ into 2. First we consider the random intercepts. Each group has it's own intercept and these are in $u_1$:
$$
u_1 = \begin{bmatrix}
2.4 \\
-0.9 \\
-1.5
\end{bmatrix}
$$
So group A
has an intercept of 2.4, group B
has an intercept of -0.9 and group C
has an intercept of -1.5. Now we need to bear in mind the structure of the dataset. It is reproduced again here:
G a
1 A 1
2 B 1
3 C 1
4 A 2
5 B 2
6 C 2
7 A 3
8 B 3
9 C 3
10 A 4
11 B 4
12 C 4
It should therefore be easy to see, that $Z_1$ has to have the following structure to match that of the dataset and map the correct values into the result:
$$
Z_1= \begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$
so that when we form the product $Z_1 u_1$, we get, for example, for the first row (group A) $(2.4 \times 1) + (-1.9 \times 0) + (-1.5 \times 0) = 2.4$ and likewise for rows 4, 7 and 10. Applying the same logic for groups B
and C
we can see that they always receive -0.9 and -1.5 respectively.
For the random slopes things get a little more tricky. We have
$$
u_2 = \begin{bmatrix}
0.8 \\
1.3 \\
-2.1
\end{bmatrix}
$$
So the random slope for group A
for variable $a$ is 0.8. This is a linear slope so it means that the values of $a$ must be multiplied by 0.8. For group B
the values of $a$ must be multipled by 1.3 and for group C
they must be multiplied by -2.1. Again, noting the structure of the dataset above, $Z_2$ will accomplish this mapping with the following structure:
$$
Z_2 = \begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
2 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 2 \\
3 & 0 & 0 \\
0 & 3 & 0 \\
0 & 0 & 3 \\
4 & 0 & 0 \\
0 & 4 & 0 \\
0 & 0 & 4
\end{bmatrix}
$$
If we again consider group A
which has a random slope of 0.8, the first row, when $a=1$, contributes $0.8 \times 1 + 1.3 \times 0 + (-2.1 \times 0) = 0.8 $, the 4th row, when $a=2$, contributes $0.8 \times 2 + 1.3 \times 0 + (-2.1 \times 0) = 1.6 $, the 7th row, when $a=3$, contributes $0.8 \times 3 + 1.3 \times 0 + (-2.1 \times 0) = 2.4 $ and the 10th row, when $a=4$, contributes $0.8 \times 4 + 1.3 \times 0 + (-2.1 \times 0) = 3.2 $ . Again the same logic applies for groups B
and C
.
If we wish we could combine $Z_1$ and $Z_2$ to form $Z$ and $u_1$ and $u_2$ to form $u$, and this could be done in many ways. But all we really have to do to complete the simulation is to sample from some distribution to obtain $e$ and then compute $y = X\beta + Z_1u_1 + Z_2u_2 + e$
Edit: to address Erik's request for R code to demonstrate the above.
I would never suggest to form $Z$ by hand / from scratch in all but the simplest of models. But here I will do so, and also check that the resulting data are constent with using software to create $Z$
set.seed(15)
n.group <- 3 #number of groups
dt <- expand.grid(G = LETTERS[1:n.group], a = 1:4)
X <- model.matrix(~ a, dt) # model matrix for fixed effects
betas <- c(3.1, 1.8) # fixed effects coefficient vector
Z1 <- model.matrix(~ 0 + G, dt) # model matrix for random intercepts
Z2 <- model.matrix(~ 0 + G, dt) * dt$a # model matrix for random slopes
Here I have created $Z_1$ and $Z_2$, "manually" as per the main part of my answer.
s1 <- 2.1 # SD of random intercepts
s2 <- 1.8 # SD of random slopes
rho <- 0.5 # correlation between intercepts and slopes
cormat <- matrix(c(s1, rho, rho, s2), 2, 2) # correlation matrix
covmat <- lme4::sdcor2cov(cormat) # covariance matrix (needed for mvrnorm)
umat <- MASS::mvrnorm(n.group, c(0, 0), covmat, empirical = TRUE) # simulate the random effects
u1 <- umat[, 1]
u2 <- umat[, 2]
e <- rnorm(nrow(dt), 0, 2) # residual error
dt$Y_manual <- X %*% betas + Z1 %*% u1 + Z2 %*% u2 + e
So we have simulated Y from manually created $Z$ matrices
Now let's use lme4
to create $Z$
library(lme4)
lForm <- lFormula(Y_manual ~ a + (a|G), dt) # lme4's function to process a model formula
Z <- t(as.matrix(lForm$reTrms$Zt)) # extract the Z matrix
u <- c(rbind(umat[, 1], umat[, 2])) # lme4 needs the random effects in this order: interleaved)
dt$Y <- X %*% betas + Z %*% u + e
dt
G a Y Y_manual
1 A 1 4.347903 4.347903
2 B 1 4.039412 4.039412
3 C 1 8.275563 8.275563
4 A 2 4.788965 4.788965
5 B 2 3.301834 3.301834
6 C 2 10.839260 10.839260
7 A 3 9.906717 9.906717
8 B 3 -1.159811 -1.159811
9 C 3 17.517209 17.517209
10 A 4 12.205023 12.205023
11 B 4 1.017939 1.017939
12 C 4 17.692258 17.692258
So as we can see, we obtain exactly the same simulated values for the outcome with the manual method and by using lme4'
s lFormula
function
Now let's try actually fitting the model:
m0 <- lmer(Y ~ a + (a|G), dt)
summary(m0)
Random effects:
Groups Name Variance Std.Dev. Corr
G (Intercept) 1.852 1.361
a 6.338 2.518 -0.44
Residual 3.038 1.743
Number of obs: 12, groups: G, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.557 1.462 2.433
a 1.670 1.522 1.097
Surprisingly it converges without warning and the estimates are not too bad considering the sample size !