1

I want to generate a data set with a pre-specified significance level. Let's say we have 2 covariates x1, x2, and an outcome variable y. We fit a linear regression model as follow:

create_data <- function(beta_1 = 0.01,
                        beta_2 = 0.04,
                        n = 1000,
                        seed = 2020) {
  set.seed(seed)
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  y <- beta_1 * x1 + beta_2 * x2  + rnorm(n, sd = 1)
  data.frame(y = y, x1 = x1, x2 = x2)
}
dat <- create_data()
fit_full<- lm(y ~ ., data = dat)
summary(fit_full)

##  Call:
##  lm(formula = y ~ ., data = dat)
## 
##  Residuals:
##       Min       1Q   Median       3Q      Max 
##  -3.02487 -0.65281 -0.00333  0.66791  2.91865 
## 
##  Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
##  (Intercept) -0.04601    0.03192  -1.441    0.150  
##  x1           0.03745    0.03081   1.216    0.224  
##  x2           0.05338    0.03175   1.681    0.093 .
##  ---
##  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##  
##  Residual standard error: 1.009 on 997 degrees of freedom
##  Multiple R-squared:  0.004443,  Adjusted R-squared:  0.002446 
##  F-statistic: 2.225 on 2 and 997 DF,  p-value: 0.1086

I want to generate a data set for which the overall p-value is close to 0.05, (in the given example, it is 0.1086). Since this is required for a simulation purpose, I want to have a nominal significance of 0.05 in the long run.

0 Answers0