I want to generate a data set with a pre-specified significance level. Let's say we have 2 covariates x1, x2, and an outcome variable y. We fit a linear regression model as follow:
create_data <- function(beta_1 = 0.01,
beta_2 = 0.04,
n = 1000,
seed = 2020) {
set.seed(seed)
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- beta_1 * x1 + beta_2 * x2 + rnorm(n, sd = 1)
data.frame(y = y, x1 = x1, x2 = x2)
}
dat <- create_data()
fit_full<- lm(y ~ ., data = dat)
summary(fit_full)
## Call:
## lm(formula = y ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.02487 -0.65281 -0.00333 0.66791 2.91865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04601 0.03192 -1.441 0.150
## x1 0.03745 0.03081 1.216 0.224
## x2 0.05338 0.03175 1.681 0.093 .
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.009 on 997 degrees of freedom
## Multiple R-squared: 0.004443, Adjusted R-squared: 0.002446
## F-statistic: 2.225 on 2 and 997 DF, p-value: 0.1086
I want to generate a data set for which the overall p-value is close to 0.05, (in the given example, it is 0.1086). Since this is required for a simulation purpose, I want to have a nominal significance of 0.05 in the long run.