Simulate data that arise from mediation with specified conditions

Question

I am looking to create a function that simulates data arising from a mediation process, where a predictor (X) has an indirect effect on the outcome (Y) through the mediator (M).

I consulted the answers to the following questions:

I would like the function to simulate:

the mediator and outcome if the user inputs the predictor,
the predictor and outcome if the user inputs the mediator, or
the predictor and mediator if the user inputs the outcome

I would like the user to be able to specify various conditions for simulating the data arising from mediation, including the correlation between X and Y, the correlation between X and M, the correlation between M and Y, and the proportion of the effect mediated. The proportion of the effect mediated (Pm) is the ratio of the indirect effect (ab) to the total effect (Wen & Fan, 2015). I would like the function to simulate the data that would yield a mediation model with the conditions specified by the user.

For instance, I would like the function to estimate:

the total effect if the user inputs the correlation between X and M, the correlation between M and Y, and proportionMediated (Pm)
proportionMediated if the user inputs the correlation between X and M, the correlation between M and Y, and the correlation between X and Y
the correlation between X and M and the correlation between M and Y (assuming they are equal) if the user inputs the correlation between X and Y and proportionMediated
the correlation between X and M if the user inputs the correlation between M and Y, the correlation between X and Y, and proportionMediated
the correlation between M and Y if the user inputs the correlation between X and M, the correlation between X and Y, and proportionMediated

I used the answer to the first link (above) in writing the beginnings of a function:

simulateIndirectEffect <- function(x, m, y, a, b, cTotal, proportionMediated, seed){
  if(missing(seed)){
    seed <- round(runif(1, 0, 1000)*100)
  }

  if(missing(cTotal) == TRUE){
    cTotal <- (a * b) / proportionMediated
  } else if(missing(proportionMediated) == TRUE){
    proportionMediated <- (a * b) / cTotal
  } else if(missing(a) == TRUE & missing(b) == TRUE){
    a <- sqrt(proportionMediated * cTotal)
    b <- sqrt(proportionMediated * cTotal)
  } else if(missing(a) == TRUE){
    a <- (proportionMediated * cTotal) / b
  } else if(missing(b) == TRUE){
    b <- (proportionMediated * cTotal) / a
  }

  ab <- a * b
  cPrime <- cTotal - ab

  if(missing(x) == FALSE){
    sampleSize <- length(x)

    set.seed(seed + 1)
    m <- a*x + sqrt(1-a^2) * rnorm(sampleSize) #what should I change error term to?
    error <- 1 - (cPrime^2 + b^2 + 2*a*cPrime*b)

    set.seed(seed + 2)
    y <- cPrime*x + b*m + error*rnorm(sampleSize) #what should I change error term to?
  } else if(missing(m) == FALSE){
    sampleSize <- length(m)

    set.seed(seed + 1)
    #x <- #Not sure what to put here

    set.seed(seed + 2)
    #y <- #Not sure what to put here
  } else if(missing(y) == FALSE){
    sampleSize <- length(y)

    set.seed(seed + 1)
    #x <- #Not sure what to put here

    set.seed(seed + 2)
    #m <- #Not sure what to put here
  }

  simulatedData <- as.data.frame(cbind(x, m, y))

  return(simulatedData)
}

I have three questions:

How can we simulate m and y given x (and the conditions specified) in the above function?
How can we simulate x and y given m (and the conditions specified) in the above function?
How can we simulate x and m given y (and the conditions specified) in the above function?

Note that the function above does not appear to simulate the mediation data per the conditions specified. For instance, when I simulate data based on a total effect of .6 and a proportion of the effect mediated of .4, my correlations are way too high. I want my correlation between x and y to be .6 (i.e., the total effect), but it is .99 in the simulated data (see below). I suspect that using rnorm() to generate a random variable with a mean of 0 and SD of 1 is too small to add to the error term, but am not sure what to use instead.

> predictor <- rnorm(1000, mean = 50, sd = 10)
> myData <- simulateIndirectEffect(x = predictor, cTotal = .6, proportionMediated = .4, seed = 12345)
> round(cor(myData), 2)
     x    m    y
x 1.00 0.98 0.99
m 0.98 1.00 0.99
y 0.99 0.99 1.00

References:

Wen, Z., & Fan, X. (2015). Monotonicity of effect sizes: Questioning kappa-squared as mediation effect size measure. Psychological Methods, 20, 193-203. doi: 10.1037/met0000029

POC · Answer 1 · 2020-03-26T02:01:26.863

My answer is partly based on Caron & Valois (2018), which offers R code to simulate what you would like. I will use the usual convention for the explanation.

Let start by indicating $a$ as the correlation between $X$ to $M$, $b$ as the partial correlation between $M$ to $Y$ (while controlling for $X$) and $c'$ as the direct correlation between $X$ to $Y$.

We know that the total effect $c$ (the path between $X$ and $Y$ when not controlling for $M$ is $c'+ab=c$, so as a proportion : $c' = \frac{ab}{\rho}-ab$, where $\rho$ is the proportion or percentage you want. The total effect would be $c'+ab=c$ or $\frac{c'}{\rho}+\frac{ab}{\rho}=100\%$. If $\rho=100\%$, then it is a full mediation.

The correlation between $X$ and $M$ is $a$, the correlation between $M$ and $Y$ is $=b(1-a^2)+ac$ (remember $b$ is a partial correlation so we have to "unpartial" the value), and the correlation between $X$ and $Y$ is the total effect, $c'+ab=c$

Here is some R code that may help you :

> n = 100000
> a = .4
> b = .4
> ab = a*b
> rho = .75
> cp = ab/rho-ab #cp = c'
> x = rnorm(n)
> m = a*x + sqrt(1-a^2)*rnorm(n)
> ey = 1-(cp^2 +b^2 + 2*a*cp*b)
> y = cp*x + b*m + ey*rnorm(n)
> res1 = summary(lm(m~x))
> res1

Call:
lm(formula = m ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4982 -0.6181 -0.0007  0.6164  4.0573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.001670   0.002893   0.577    0.564    
x           0.403221   0.002902 138.964   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9148 on 99998 degrees of freedom
Multiple R-squared:  0.1619,    Adjusted R-squared:  0.1618 
F-statistic: 1.931e+04 on 1 and 99998 DF,  p-value: < 2.2e-16

> res2 = summary(lm(y~x+m))
> res2

Call:
lm(formula = y ~ x + m)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4128 -0.5526 -0.0013  0.5546  4.0239 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0005503  0.0025881   0.213    0.832    
x           0.0541750  0.0028354  19.107   <2e-16 ***
m           0.4011823  0.0028290 141.810   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8184 on 99997 degrees of freedom
Multiple R-squared:  0.2128,    Adjusted R-squared:  0.2128 
F-statistic: 1.352e+04 on 2 and 99997 DF,  p-value: < 2.2e-16

> A = res1$coefficients[2,1]
> B = res2$coefficients[3,1]
> CP = res2$coefficients[2,1]
> ab
[1] 0.16
> A*B
[1] 0.1617652
> cp
[1] 0.05333333
> CP
[1] 0.05417502

The actual example used standardized coefficients, you can then unstandardize the variables by adding means and multiplying by their standard deviation.

References

Caron, P.-O., & Valois, P. (2018). A computational description of simple mediation analysis. The Quantitative Methods for Psychology, 14, 147-158. doi:10.20982/tqmp.14.2.p147

Simulate data that arise from mediation with specified conditions

1 Answers1