1

I am looking to create a function that simulates data arising from a mediation process, where a predictor (X) has an indirect effect on the outcome (Y) through the mediator (M).

I consulted the answers to the following questions:

I would like the function to simulate:

  • the mediator and outcome if the user inputs the predictor,

  • the predictor and outcome if the user inputs the mediator, or

  • the predictor and mediator if the user inputs the outcome

I would like the user to be able to specify various conditions for simulating the data arising from mediation, including the correlation between X and Y, the correlation between X and M, the correlation between M and Y, and the proportion of the effect mediated. The proportion of the effect mediated (Pm) is the ratio of the indirect effect (ab) to the total effect (Wen & Fan, 2015). I would like the function to simulate the data that would yield a mediation model with the conditions specified by the user.

For instance, I would like the function to estimate:

  • the total effect if the user inputs the correlation between X and M, the correlation between M and Y, and proportionMediated (Pm)

  • proportionMediated if the user inputs the correlation between X and M, the correlation between M and Y, and the correlation between X and Y

  • the correlation between X and M and the correlation between M and Y (assuming they are equal) if the user inputs the correlation between X and Y and proportionMediated

  • the correlation between X and M if the user inputs the correlation between M and Y, the correlation between X and Y, and proportionMediated

  • the correlation between M and Y if the user inputs the correlation between X and M, the correlation between X and Y, and proportionMediated

I used the answer to the first link (above) in writing the beginnings of a function:

simulateIndirectEffect <- function(x, m, y, a, b, cTotal, proportionMediated, seed){
  if(missing(seed)){
    seed <- round(runif(1, 0, 1000)*100)
  }

  if(missing(cTotal) == TRUE){
    cTotal <- (a * b) / proportionMediated
  } else if(missing(proportionMediated) == TRUE){
    proportionMediated <- (a * b) / cTotal
  } else if(missing(a) == TRUE & missing(b) == TRUE){
    a <- sqrt(proportionMediated * cTotal)
    b <- sqrt(proportionMediated * cTotal)
  } else if(missing(a) == TRUE){
    a <- (proportionMediated * cTotal) / b
  } else if(missing(b) == TRUE){
    b <- (proportionMediated * cTotal) / a
  }

  ab <- a * b
  cPrime <- cTotal - ab

  if(missing(x) == FALSE){
    sampleSize <- length(x)

    set.seed(seed + 1)
    m <- a*x + sqrt(1-a^2) * rnorm(sampleSize) #what should I change error term to?
    error <- 1 - (cPrime^2 + b^2 + 2*a*cPrime*b)

    set.seed(seed + 2)
    y <- cPrime*x + b*m + error*rnorm(sampleSize) #what should I change error term to?
  } else if(missing(m) == FALSE){
    sampleSize <- length(m)

    set.seed(seed + 1)
    #x <- #Not sure what to put here

    set.seed(seed + 2)
    #y <- #Not sure what to put here
  } else if(missing(y) == FALSE){
    sampleSize <- length(y)

    set.seed(seed + 1)
    #x <- #Not sure what to put here

    set.seed(seed + 2)
    #m <- #Not sure what to put here
  }

  simulatedData <- as.data.frame(cbind(x, m, y))

  return(simulatedData)
}

I have three questions:

  1. How can we simulate m and y given x (and the conditions specified) in the above function?
  2. How can we simulate x and y given m (and the conditions specified) in the above function?
  3. How can we simulate x and m given y (and the conditions specified) in the above function?

Note that the function above does not appear to simulate the mediation data per the conditions specified. For instance, when I simulate data based on a total effect of .6 and a proportion of the effect mediated of .4, my correlations are way too high. I want my correlation between x and y to be .6 (i.e., the total effect), but it is .99 in the simulated data (see below). I suspect that using rnorm() to generate a random variable with a mean of 0 and SD of 1 is too small to add to the error term, but am not sure what to use instead.

> predictor <- rnorm(1000, mean = 50, sd = 10)
> myData <- simulateIndirectEffect(x = predictor, cTotal = .6, proportionMediated = .4, seed = 12345)
> round(cor(myData), 2)
     x    m    y
x 1.00 0.98 0.99
m 0.98 1.00 0.99
y 0.99 0.99 1.00

References:

Wen, Z., & Fan, X. (2015). Monotonicity of effect sizes: Questioning kappa-squared as mediation effect size measure. Psychological Methods, 20, 193-203. doi: 10.1037/met0000029

itpetersen
  • 447
  • 8
  • 17

1 Answers1

1

My answer is partly based on Caron & Valois (2018), which offers R code to simulate what you would like. I will use the usual convention for the explanation.

Let start by indicating $a$ as the correlation between $X$ to $M$, $b$ as the partial correlation between $M$ to $Y$ (while controlling for $X$) and $c'$ as the direct correlation between $X$ to $Y$.

We know that the total effect $c$ (the path between $X$ and $Y$ when not controlling for $M$ is $c'+ab=c$, so as a proportion : $c' = \frac{ab}{\rho}-ab$, where $\rho$ is the proportion or percentage you want. The total effect would be $c'+ab=c$ or $\frac{c'}{\rho}+\frac{ab}{\rho}=100\%$. If $\rho=100\%$, then it is a full mediation.

The correlation between $X$ and $M$ is $a$, the correlation between $M$ and $Y$ is $=b(1-a^2)+ac$ (remember $b$ is a partial correlation so we have to "unpartial" the value), and the correlation between $X$ and $Y$ is the total effect, $c'+ab=c$

Here is some R code that may help you :

> n = 100000
> a = .4
> b = .4
> ab = a*b
> rho = .75
> cp = ab/rho-ab #cp = c'
> x = rnorm(n)
> m = a*x + sqrt(1-a^2)*rnorm(n)
> ey = 1-(cp^2 +b^2 + 2*a*cp*b)
> y = cp*x + b*m + ey*rnorm(n)
> res1 = summary(lm(m~x))
> res1

Call:
lm(formula = m ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4982 -0.6181 -0.0007  0.6164  4.0573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.001670   0.002893   0.577    0.564    
x           0.403221   0.002902 138.964   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9148 on 99998 degrees of freedom
Multiple R-squared:  0.1619,    Adjusted R-squared:  0.1618 
F-statistic: 1.931e+04 on 1 and 99998 DF,  p-value: < 2.2e-16

> res2 = summary(lm(y~x+m))
> res2

Call:
lm(formula = y ~ x + m)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4128 -0.5526 -0.0013  0.5546  4.0239 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0005503  0.0025881   0.213    0.832    
x           0.0541750  0.0028354  19.107   <2e-16 ***
m           0.4011823  0.0028290 141.810   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8184 on 99997 degrees of freedom
Multiple R-squared:  0.2128,    Adjusted R-squared:  0.2128 
F-statistic: 1.352e+04 on 2 and 99997 DF,  p-value: < 2.2e-16

> A = res1$coefficients[2,1]
> B = res2$coefficients[3,1]
> CP = res2$coefficients[2,1]
> ab
[1] 0.16
> A*B
[1] 0.1617652
> cp
[1] 0.05333333
> CP
[1] 0.05417502

The actual example used standardized coefficients, you can then unstandardize the variables by adding means and multiplying by their standard deviation.

References

Caron, P.-O., & Valois, P. (2018). A computational description of simple mediation analysis. The Quantitative Methods for Psychology, 14, 147-158. doi:10.20982/tqmp.14.2.p147

POC
  • 346
  • 1
  • 8
  • 23