Simulate data for power analysis of logistic regression model - include covariance variance of variables?

Question

I've tried to simulate data for a power analysis of a logistic regression. The results of the power analysis look reasonable: power=90% for a sample of 6000 persons. But I feel that the analysis lacks something. So, my question is: when generating the data should I include something about how the variables are correlated, or their covariance, other than just defining their linear relationship as I have done in the example below, and if so where do I write that into the code?

I know other questions look like this but I'm not confident that they answer this question.

library(plyr) # functions
## Define Function
simfunktion <- function() {
   # Number in each sample
  antal <- 6000
  beta0 <- log(0.16) # logit in reference group
  beta1 <- log(1.1)  # logit given smoking
  beta2 <- log(1.1)  # logit given SNP(genevariation)
  beta3 <- log(1.2)  # logit for interactioncoefficient for SNP*rygning
   ## Smoking variable, with probabilities defined according to empirical studies.
  smoking  <- sample(x = 0:2, size = antal, replace = TRUE, prob = c(40,25,40))
   ## SNP variables with probabilities defined according to empirical studies
  SNP      <- sample(x = 0:2, size = antal, replace = TRUE, prob = c(40,40,20))
   ## calculated probabilites given the model:
  pi.x     <- exp( beta0 + beta1*smoking + beta2*SNP + beta3*smoking*SNP) / 
              ( 1 + exp(beta0 + beta1*smoking + beta2*SNP + beta3*smoking*SNP) )
   ## binoial events given the probabilities:
  sim.y    <- rbinom( n = antal, size = 1, prob = pi.x)  
  sim.data <- data.frame(sim.y, smoking, SNP)
   #################### p-value of the interaction is extracted:
   ## the model is run:
  glm1     <- glm( data = sim.data, formula = sim.y ~ smoking + SNP + smoking:SNP, 
                  family=binomial )
   ## p-value of the interactionterm is extracted:
  summary(    glm( data = sim.data, formula = sim.y ~ smoking + SNP + smoking:SNP, 
                  family=binomial ))$coef[4,4]
}
pvalue     <- as.vector(replicate( 100 , simfunktion()))
mean(pvalue < 0.05)

Readers here may be interested in the following general overview of topics related to power analyses via simulation: [Simulation of logistic regression power analysis: designed experiments](http://stats.stackexchange.com/questions/35940/). — gung - Reinstate Monica, Jun 20 '13 at 17:11

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

Let me throw out some thoughts, and we'll see if something helps you.

Some preliminaries:

I find it a bit odd that you are defining your true betas as the log of some number; is that because you are using reported odds ratios? (If so, this is perfectly appropriate.)
It's important to realize when doing power analyses based on effects reported in the literature that the results are optimistic. I discuss that here: Desired effect size vs. expected effect size; you may also want to read this thread: Logistic regression model manipulation.
I notice that you are simulating the expected distribution of your covariates. That's not typically done; in general, we assume that our covariates are a set of known constants. However, if you will be doing observational (e.g., epidemiological) research, these can well vary and this strategy is appropriate.
Your covariates have the values 0, 1, 2. Are these levels of a factor, or are they equal interval? I ask because they look like the former, but are treated as the latter in the data generating process.

Your main question:

If your covariates will be correlated with each other, your power will decrease, so yes, you should definitely try to incorporate that information into the simulation.
You want to do that where you are generating the distribution of your covariates.
Generating correlated data is slightly more complicated. There is a general thread on this topic on CV here: How to generate correlated random numbers (given means, variances, and degree of correlation).
Your situation is potentially even a little more complicated still, because your values are only 0, 1, 2. This can be done using copulas (search the site, I think there are some informative threads somewhere). Alternatively, you could list the probabilities of the nine possible combinations, use sample() (similarly to what you have already), and then assign the associated values.

Excellent info and links, enough to get me going. And yes you are right this is for epidemiological studies :) — Rasmus Larsen, Jun 23 '13 at 18:10

Simulate data for power analysis of logistic regression model - include covariance variance of variables?

1 Answers1