How to simulate data based on a linear mixed model fit object in R?

Question

Assume, I have a data set, which is similar to

require(nlme) 
?Orthodont

and my model is

fm2 <- lme(distance ~ age + Sex, data = Orthodont, random = ~ 1)

How can I use the model fit object fm2 to generate several datasets, which have sample sizes 300, 400, 500, ... ?

I read this great answer on r-sig-mixed-models help but it seems incomplete.

I would try to go for `library(boot)` aiming to bootstrap the results from `lme` fit. This link could probably help https://stat.ethz.ch/pipermail/r-help/2004-November/060444.html — Dmitrij Celov, May 25 '11 at 15:27
There are two ways of getting more data, simulating from the multivariate distribution and bootstrapping. If you simulate from the multivariate distribution you risk adding assumptions about the data generating process. In general, if I have a bit of data I use the bootstrap, if I have none I simulate from a multivariate distribution. — Seth, May 30 '12 at 14:23
Here is a lead http://glmm.wikidot.com/basic-glmm-simulation — Etienne Low-Décarie, Jun 18 '12 at 15:44
Ben Bolker's a big wheel on these... here's his approach http://rpubs.com/bbolker/4187 — Wes McClintick, Dec 20 '13 at 21:23

score 7 · Accepted Answer · edited Apr 13 '17 at 12:44

Note: the simulated data using simulate.lme does not match elements of the original data structure or model fit (eg. variance, effect size...) nor does it creation of data de novo for experimental design testing.

require(nlme) 

?nlme::simulate.lme

fit <- lme(distance ~ age + Sex, data = Orthodont, random = ~ 1) 

orthSim <-  simulate.lme(fit, nsim = 1)

This produces a simulated fitting (with a possible alternative model).

This is thanks to an answer by @Momo on one of my questions: Is there a general method for simulating data from a formula or analysis available?

If you require the simulated data, you will need to create a new function from the simulate.lme function.

simulate.lme.data<-edit(simulate.lme)

add the following line right before the last bracket

return(base2)

You can then create as much data as you want:

orthSimdata <-  simulate.lme.data(fit, nsim = 1)

Note this is from my (possibly mis-)interpretation of the un-commented code in simulate.lme.

Though this is useful, this seems to do little less than add gaussian noise to your existing data.

This can not be used to directly simulate data de novo. I currently create the start data by adding the numeric value of the factors levels of my experimental design data frame (eg. response=as.numeric(factor1)+as.numeric(factor2)+as.numeric(factor1)*as.numeric(factor1)+rnorm(sd=2)...).

score 3 · Answer 2 · answered May 16 '12 at 15:48

Here is one approach that takes all the values from fm2. You could add more arguments to the function to allow you to change values in the simulations.

library(nlme)

fm2 <- lme(distance ~ age + Sex, data = Orthodont, random = ~ 1) 

simfun <- function(n) {
    # n is the number of subjects, total rows will be 4*n

    # sig.b0 is the st. dev. of the random intercepts
    # it might be easier to just copy from the output
    sig.b0 <- exp(unlist(fm2$modelStruct$reStruct))*fm2$sigma
    b0 <- rnorm(n, 0, sig.b0)

    sex <- rbinom(n, 1, 0.5)  # assign sex at random

    fe <- fixef(fm2)
    my.df <- data.frame( Subject=rep(1:n, each=4), 
        int = fe[1] + rep(b0, each=4), 
        Sex=rep(sex,each=4), age=rep( c(8,10,12,14), n ) )
    my.df$distance <- my.df$int + fe[2] * my.df$age + 
        fe[3]*my.df$Sex + rnorm(n*4, 0, fm2$sigma)

    my.df$int <- NULL
    my.df$Sex <- factor( my.df$Sex, levels=0:1,
        labels=c('Male','Female') )
    my.df
}

Orthodont2 <- simfun(100)

Thank you for this. One question though, when you're simulating the response with `my.df$distance — Adrian, Aug 04 '17 at 20:53
I have tried both ways. (1) simulating the outcome without taking into account the SE of the fixed effects (like in your original response) and (2) simulating the outcome while taking into account the SE of the fixed effects. Interestingly, (1) looks more similar to my actual data set than (2). I wonder why? — Adrian, Aug 04 '17 at 20:56

Seth · Answer 3 · 2012-05-16T17:16:04.093

I would probably just sample randomly with replacement from the Subjects in your data until I had the right sample size. This is the bootstrap method. It is simpler than identifying the multivariate distribution of the variables and then sampling from it. Also the bootstrap does not make additional assumptions about the multivariate structure of the data.

first set the number of participants in your big simulated study

nits=300

get the unique participants in the small study

sub=unique(Orthodont$Subject)

sample the unique participants randomly with replacement

subs=sample(sub,nits,rep=T)

make an empty data frame

df=Orthodont[-(1:dim(Orthodont)[1]),]

loop through the sample size and bind it together.

for( i in 1:nits) {  
df=rbind(df,Orthodont[which(Orthodont$Subject==subs[i]),])
}

This last for loop is slow, there is prolly a better way of writing it.

Now you can run your regression on the bigger dataset and watch your confidence intervals get smaller.

boot and simulate.lme are both simple in their way. I wrote this because it explicates the process better than a call to a single function and makes minimal assumptions about the data structure. — Seth, May 30 '12 at 13:59

How to simulate data based on a linear mixed model fit object in R?

3 Answers3

Linked