I've tried to simulate data for a power analysis of a logistic regression. The results of the power analysis look reasonable: power=90% for a sample of 6000 persons. But I feel that the analysis lacks something. So, my question is: when generating the data should I include something about how the variables are correlated, or their covariance, other than just defining their linear relationship as I have done in the example below, and if so where do I write that into the code?
I know other questions look like this but I'm not confident that they answer this question.
library(plyr) # functions
## Define Function
simfunktion <- function() {
# Number in each sample
antal <- 6000
beta0 <- log(0.16) # logit in reference group
beta1 <- log(1.1) # logit given smoking
beta2 <- log(1.1) # logit given SNP(genevariation)
beta3 <- log(1.2) # logit for interactioncoefficient for SNP*rygning
## Smoking variable, with probabilities defined according to empirical studies.
smoking <- sample(x = 0:2, size = antal, replace = TRUE, prob = c(40,25,40))
## SNP variables with probabilities defined according to empirical studies
SNP <- sample(x = 0:2, size = antal, replace = TRUE, prob = c(40,40,20))
## calculated probabilites given the model:
pi.x <- exp( beta0 + beta1*smoking + beta2*SNP + beta3*smoking*SNP) /
( 1 + exp(beta0 + beta1*smoking + beta2*SNP + beta3*smoking*SNP) )
## binoial events given the probabilities:
sim.y <- rbinom( n = antal, size = 1, prob = pi.x)
sim.data <- data.frame(sim.y, smoking, SNP)
#################### p-value of the interaction is extracted:
## the model is run:
glm1 <- glm( data = sim.data, formula = sim.y ~ smoking + SNP + smoking:SNP,
family=binomial )
## p-value of the interactionterm is extracted:
summary( glm( data = sim.data, formula = sim.y ~ smoking + SNP + smoking:SNP,
family=binomial ))$coef[4,4]
}
pvalue <- as.vector(replicate( 100 , simfunktion()))
mean(pvalue < 0.05)