0

for my bachelor thesis, I am doing a simulation study in order to compare analysis methods.

The generated data is a pre-post design data, with two groups measured at two times.

#parameters
b0 <- 0  
b1 <- .2

#treatment
X <- matrix(0,ncol=1,nrow=n)  
X[1:(n/2)] <- 0  
X[((n/2)+1):n] <- 1  

#mean structure
mean <- matrix(0,ncol=2,nrow=n)  
mean[,1] <- b0 + X  
mean[,2] <- b0 + b1*X 

When generating data, I would like to include a systematic bias due to missing randomization. Say, males are scoring higher on a particular variable and are more present in the treatment group. Also, the gender effect should be different across pre- and post tests. How would you include this bias in the generation of the data?

I hope this is the information you need, else I will give you any information that is needed.

hh32
  • 1,279
  • 1
  • 8
  • 19

1 Answers1

0

I'm not entirely sure I understand your question, but here is my attempt to answer. I'll simulate a treatment and a control group, where there is an over-representation of males in the treatment group and females in the control group. Males have a higher pre-test score, and the post-test score is modeled as a random variable with half of the mean of the individual pre-test score as the expected value. Treatment does not have any effect on the post-test score. The individual difference between pre-test and post-test is the outcome measure in the statistical test that follows. This means that males will have a larger difference than females, and if gender is not taken into account in the analysis, the treatment will appear to be associated with a higher difference in the test scores.

First, I create groups with different proportions of males and females:

set.seed(1)
group.size <- 150
trt <- c(rep(0, group.size), rep(1,group.size))
gender <- c(rbinom(group.size,1,0.4), rbinom(group.size,1,0.6))
prop.table(table(trt, gender), margin=1)
gender
trt         0         1
  0 0.5933333 0.4066667
  1 0.3333333 0.6666667

Now, pre-test and post-test scores are simulated. Pre-test score depends on gender (mean 10 for females and 14 for males). Post-test scores are based on the individual pre-test score for each individual:

pre.test <- rnorm(group.size*2, 10+gender*4,2)
post.test <- rnorm(group.size*2, pre.test/2, 1)

tapply(pre.test, gender, mean)
       0         1 
9.981824 14.089030 

tapply(post.test, gender, mean)
       0        1 
4.900903 6.994718 

The individual difference between pre-test and post-test is calculated, and a linear regression model is then run, with difference as the dependent variable and treatment as the independent variable:

diff <- pre.test - post.test
summary(lm(diff ~ trt))
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.8741     0.1447  40.591  < 2e-16 ***
trt           0.5748     0.2047   2.808  0.00531 ** 

The results are clearly significant. The treated group has around 10% higher difference. However, when taking gender into account:

summary(lm(diff ~ trt + gender))
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.06113    0.14159  35.744   <2e-16 ***
trt          0.05502    0.17807   0.309    0.758    
gender       1.99901    0.17855  11.196   <2e-16 ***

As you can see, the effect of treatment virtually disappears and is far from statistical significance.

I hope this was an answer to your question.

JonB
  • 2,658
  • 1
  • 9
  • 22