Sample Size Estimation/Power Analysis Using Simulation in R

Question

I am looking for a way to estimate the number of observations needed for a regression analysis. My hypothesised model is y_post ~ y_pre + x1*x2. My interest lies in whether the interaction term x1:x2 is statistically significant. I assumed a small-to-medium effect size for this x1:x2 term.

My question is: assuming a small-to-medium effect size, what is the sample size needed to detect the effect with a power of 80% and alpha 0.05?

After looking into similar questions, I couldn't find a good and systematic reference books for power analysis. Could anyone suggest a good reference book/book chapter on how to conduct a sample size estimation using simulation in R. I want to learn more about simulation because when I encounter different experimental designs in the future, I could simulate the the sample size again by myself.

My background is psychology. It would be wonderful if the references is also from this field (although it is not necessary).

Many similar questions remained unanswered or not satisfactorily (e.g.,Power Analysis By Simulation, How to simulate a custom power analysis of an lm model (using R), Simulating responses from a factorial experiment for power analysis). Your help would be very much appreciated. Thank you.

What effect size do you intend to use? There are quite a few effect sizes available for regression models, the coefficients themselves are among them. To simulate a regression model, you probably need to assume quite a few things such as the distribution of the predictors, the coefficients and the residual variance. Once you have decided on those assumptions, the simulations themselves are not that hard. — COOLSerdash, Mar 13 '19 at 12:13
@COOLSerdash Thank you for your comment. I am not very sure about different effect size measures. Let's assume I'm referring to Partial Eta squared. Thank you for pointing out those parameters I will need to assume. However, as if often in real life, most of those parameters are unknown when I am doing an a priori power analysis (but it is almost a must to do the power analysis prior to data collection nowadays). Do you know of any suggestions? Thanks. — JetLag, Mar 13 '19 at 12:20
To be honest, I would just use G*Power instead of simulations in your case. — COOLSerdash, Mar 13 '19 at 12:24
@COOLSerdash Thank you. I think I'll just turn to G*Power then. Thanks. — JetLag, Mar 13 '19 at 12:25
@COOLSerdash: do you want to post your comment(s) as an answer? [Better to have a short answer than no answer at all.](https://stats.meta.stackexchange.com/a/5326/1352) Anyone who has a better answer can post it. — Stephan Kolassa, Nov 04 '21 at 08:49

COOLSerdash · Answer 1 · 2021-11-04T09:33:22.627

What effect size do you intend to use? There are quite a few effect sizes available for regression models, the coefficients themselves are among them. To simulate a regression model, you probably need to assume quite a few things such as the distribution of the predictors, the coefficients and the residual variance. Once you have decided on those assumptions, the simulations themselves are not that hard.

Another possibility is using G*Power. The drawback is that it is quite restrictive concerning effect sizes.

To give a specific example of how you could use simulations to assess the power in a regression model, let's assume that the true model is as follows: $$ \texttt{y_post}_i = 10 + 0.85\times\texttt{y_pre}_i -0.5\times\texttt{x}_{1,i} + 0.6\times\texttt{x}_{2,i} + 0.1\times\texttt{x}_{1,i}\times\texttt{x}_{2,i} + \epsilon_i $$

Furthermore, I'm assuming the following distributions for the involved variables:

$$ \begin{array}{l|l} \text{Variable} & \text{Distribution} \\ \hline \texttt{y_post} & \operatorname{N}(100, 15^2) \\ \texttt{y_pre} & \operatorname{N}(115, 14^2) \\ \texttt{x1} & \operatorname{U}(10, 50) \\ \texttt{x2} & \operatorname{N}(15, 7.5^2) \\ \epsilon & \operatorname{N}(0, 25^2) \end{array} $$

Finally, I'm assessing the power for a sample size of $100$ using $1000$ replications.

The R package simglm makes it easy to set up the simulations:

library(tidyverse)
library(simglm)
library(future)

# The simulation setup
sim_args <- list(
  formula = y_post ~ 1 + y_pre + x1*x2 # The formula
  , fixed = list( # Regression variables
    y_post = list(var_type = "continuous", dist = "rnorm", mean = 100, sd = 15)
    , y_pre = list(var_type = "continuous", dist = "rnorm", mean = 115, sd = 14)
    , x1 = list(var_type = "continuous", dist = "runif", min = 10, max = 50)
    , x2 = list(var_type = "continuous", dist = "rnorm", mean = 15, sd = 7.5)
  )
  , sample_size = 100
  , error = list(variance = 25^2) # Residual variance
  , reg_weights = c( # The coefficients
    10    # Intercept
    , 0.85 # y_pre
    , -0.5 # x1
    , 0.6 # x2
    , 0.1 # x1:x2
  )
  , replications = 1000 # Number of replications
  , extract_coefficients = TRUE
  , model_fit = list(formula = y_post ~ 1 + y_pre + x1*x2, model_function = "lm")
  , power = list( # Hypothesis test for the coefficients
    dist = "qt"
    , alpha = 0.05
    , opts = list(df = 100 - 5)
  )
)

Let's perform the simulations and inspect the power:

plan(multisession, workers = 10)

res <- replicate_simulation(sim_args, future.seed = 142857) %>%
  compute_statistics(sim_args, type_1_error = FALSE, precision = FALSE)

res
  term        avg_estimate power avg_test_stat crit_value_power replications
  <chr>              <dbl> <dbl>         <dbl>            <dbl>        <dbl>
1 (Intercept)       7.57   0.072         0.118             1.99         1000
2 x1               -0.481  0.175        -0.932             1.99         1000
3 x1:x2             0.0994 1             6.18              1.99         1000
4 x2                0.663  0.072         0.360             1.99         1000
5 y_pre             0.854  0.999         5.03              1.99         1000

Under these assumptions, the power for the interaction is $1$ (third column).

By using the argument vary_arguments in the setup, we could vary the sample size (see the vignette "Simulation Argument Details for 'simglm'" of the package).

Sample Size Estimation/Power Analysis Using Simulation in R

1 Answers1