tl;dr: Starting with a dataset generated under the null, I resampled cases with replacement and conducted a hypothesis test on each resampled dataset. These hypothesis tests reject the null more than 5% of the time.
In the below, very simple simulation, I generate datasets with $X \sim N(0,1) \amalg Y \sim N(0,1)$, and I fit a simple OLS model to each. Then, for each dataset, I generate 1000 new datasets by resampling rows of the original dataset with replacement (an algorithm specifically described in Davison & Hinkley's classic text as being appropriate for linear regression). For each of those, I fit the same OLS model. Ultimately, about 16% of the hypothesis tests within the bootstrap samples reject the null, whereas we should get 5% (as we do in the original datasets).
I suspected it has something to do with repeated observations causing inflated associations, so for comparison, I tried two other approaches in the below code (commented out). In Method 2, I fix $X$, then replace $Y$ with resampled residuals from the OLS model on the original dataset. In Method 3, I draw a random subsample without replacement. Both of these alternatives work, i.e., their hypothesis tests reject the null 5% of the time.
My question: Am I right that repeated observations are the culprit? If so, given that this is a standard approach to bootstrapping, where exactly are we violating standard bootstrap theory?
Update #1: More simulations
I tried an even simpler scenario, an intercept-only regression model for $Y$. The same problem occurs.
# note: simulation takes 5-10 min on my laptop; can reduce boot.reps
# and n.sims.run if wanted
# set the number of cores: can change this to match your machine
library(doParallel)
registerDoParallel(cores=8)
boot.reps = 1000
n.sims.run = 1000
for ( j in 1:n.sims.run ) {
# make initial dataset from which to bootstrap
# generate under null
d = data.frame( X1 = rnorm( n = 1000 ), Y1 = rnorm( n = 1000 ) )
# fit OLS to original data
mod.orig = lm( Y1 ~ X1, data = d )
bhat = coef( mod.orig )[["X1"]]
se = coef(summary(mod.orig))["X1",2]
rej = coef(summary(mod.orig))["X1",4] < 0.05
# run all bootstrap iterates
parallel.time = system.time( {
r = foreach( icount( boot.reps ), .combine=rbind ) %dopar% {
# Algorithm 6.2: Resample entire cases - FAILS
# residuals of this model are repeated, so not normal?
ids = sample( 1:nrow(d), replace=TRUE )
b = d[ ids, ]
# # Method 2: Resample just the residuals themselves - WORKS
# b = data.frame( X1 = d$X1, Y1 = sample(mod.orig$residuals, replace = TRUE) )
# # Method 3: Subsampling without replacement - WORKS
# ids = sample( 1:nrow(d), size = 500, replace=FALSE )
# b = d[ ids, ]
# save stats from bootstrap sample
mod = lm( Y1 ~ X1, data = b )
data.frame( bhat = coef( mod )[["X1"]],
se = coef(summary(mod))["X1",2],
rej = coef(summary(mod))["X1",4] < 0.05 )
}
} )[3]
###### Results for This Simulation Rep #####
r = data.frame(r)
names(r) = c( "bhat.bt", "se.bt", "rej.bt" )
# return results of each bootstrap iterate
new.rows = data.frame( bt.iterate = 1:boot.reps,
bhat.bt = r$bhat.bt,
se.bt = r$se.bt,
rej.bt = r$rej.bt )
# along with results from original sample
new.rows$bhat = bhat
new.rows$se = se
new.rows$rej = rej
# add row to output file
if ( j == 1 ) res = new.rows
else res = rbind( res, new.rows )
# res should have boot.reps rows per "j" in the for-loop
# simulation rep counter
d$sim.rep = j
} # end loop over j simulation reps
##### Analyze results #####
# dataset with only one row per simulation
s = res[ res$bt.iterate == 1, ]
# prob of rejecting within each resample
# should be 0.05
mean(res$rej.bt); mean(s$rej)
Update #2: The answer
Several possibilities were proposed in the comments and answers, and I did more simulations to empirically test them. It turns out that JWalker is correct that the problem is that we needed to center the bootstrap statistics by the original data's estimate in order to get the correct sampling distribution under $H_0$. However, I also think that whuber's comment about violating the parametric test assumptions is also correct, though in this case we actually do get nominal false positives when we fix JWalker's problem.