Why do hypothesis tests on resampled datasets reject the null too often?

Question

tl;dr: Starting with a dataset generated under the null, I resampled cases with replacement and conducted a hypothesis test on each resampled dataset. These hypothesis tests reject the null more than 5% of the time.

In the below, very simple simulation, I generate datasets with $X \sim N(0,1) \amalg Y \sim N(0,1)$, and I fit a simple OLS model to each. Then, for each dataset, I generate 1000 new datasets by resampling rows of the original dataset with replacement (an algorithm specifically described in Davison & Hinkley's classic text as being appropriate for linear regression). For each of those, I fit the same OLS model. Ultimately, about 16% of the hypothesis tests within the bootstrap samples reject the null, whereas we should get 5% (as we do in the original datasets).

I suspected it has something to do with repeated observations causing inflated associations, so for comparison, I tried two other approaches in the below code (commented out). In Method 2, I fix $X$, then replace $Y$ with resampled residuals from the OLS model on the original dataset. In Method 3, I draw a random subsample without replacement. Both of these alternatives work, i.e., their hypothesis tests reject the null 5% of the time.

My question: Am I right that repeated observations are the culprit? If so, given that this is a standard approach to bootstrapping, where exactly are we violating standard bootstrap theory?

Update #1: More simulations

I tried an even simpler scenario, an intercept-only regression model for $Y$. The same problem occurs.

# note: simulation takes 5-10 min on my laptop; can reduce boot.reps
#  and n.sims.run if wanted
# set the number of cores: can change this to match your machine
library(doParallel)
registerDoParallel(cores=8)
boot.reps = 1000
n.sims.run = 1000

for ( j in 1:n.sims.run ) {

  # make initial dataset from which to bootstrap
  # generate under null
  d = data.frame( X1 = rnorm( n = 1000 ), Y1 = rnorm( n = 1000 ) )

  # fit OLS to original data
  mod.orig = lm( Y1 ~ X1, data = d )
  bhat = coef( mod.orig )[["X1"]]
  se = coef(summary(mod.orig))["X1",2]
  rej = coef(summary(mod.orig))["X1",4] < 0.05

  # run all bootstrap iterates
  parallel.time = system.time( {
    r = foreach( icount( boot.reps ), .combine=rbind ) %dopar% {

      # Algorithm 6.2: Resample entire cases - FAILS
      # residuals of this model are repeated, so not normal?
      ids = sample( 1:nrow(d), replace=TRUE )
      b = d[ ids, ]

      # # Method 2: Resample just the residuals themselves - WORKS
      # b = data.frame( X1 = d$X1, Y1 = sample(mod.orig$residuals, replace = TRUE) )

      # # Method 3: Subsampling without replacement - WORKS
      # ids = sample( 1:nrow(d), size = 500, replace=FALSE )
      # b = d[ ids, ]

      # save stats from bootstrap sample
      mod = lm( Y1 ~ X1, data = b ) 
      data.frame( bhat = coef( mod )[["X1"]],
                  se = coef(summary(mod))["X1",2],
                  rej = coef(summary(mod))["X1",4] < 0.05 )

    }
  } )[3]


  ###### Results for This Simulation Rep #####
  r = data.frame(r)
  names(r) = c( "bhat.bt", "se.bt", "rej.bt" )

  # return results of each bootstrap iterate
  new.rows = data.frame( bt.iterate = 1:boot.reps,
                         bhat.bt = r$bhat.bt,
                         se.bt = r$se.bt,
                         rej.bt = r$rej.bt )
  # along with results from original sample
  new.rows$bhat = bhat
  new.rows$se = se
  new.rows$rej = rej

  # add row to output file
  if ( j == 1 ) res = new.rows
  else res = rbind( res, new.rows )
  # res should have boot.reps rows per "j" in the for-loop

  # simulation rep counter
  d$sim.rep = j

}  # end loop over j simulation reps



##### Analyze results #####

# dataset with only one row per simulation
s = res[ res$bt.iterate == 1, ]

# prob of rejecting within each resample
# should be 0.05
mean(res$rej.bt); mean(s$rej)

Update #2: The answer

Several possibilities were proposed in the comments and answers, and I did more simulations to empirically test them. It turns out that JWalker is correct that the problem is that we needed to center the bootstrap statistics by the original data's estimate in order to get the correct sampling distribution under $H_0$. However, I also think that whuber's comment about violating the parametric test assumptions is also correct, though in this case we actually do get nominal false positives when we fix JWalker's problem.

In standard bootstrap, you would only consider the bootstrap distribution of the coefficient of X1, not its associated p values. Thus it is not a problem of the bootstrap. Nevertheless your observation is interesting and unintuitive. — Michael M, Jan 17 '18 at 16:08
@MichaelM, that's true. But since the joint CDF of the data in the resamples should converge in n and the number of bootstrap iterates to the true CDF that generated the original data, I wouldn't expect the p-values to differ either. — half-pass, Jan 17 '18 at 18:33
Right. I am quite certain the effect comes from observations being non-independent (as you said), yielding too optimistic standard errors. In your simulation, it seems to be the only violated assumption of the normal linear model. Maybe we can even derive the corresponding variance deflating factor. — Michael M, Jan 17 '18 at 18:54
I think that must be right, and the variance deflation factor idea is interesting. I still can't quite understand how the observations can have the right joint CDF yet have the wrong dependence structure. I was also wondering whether the repeated observations violate the normality assumption used for standard OLS inference, since having a bunch of repeated observations seems like it would lead to non-normal residuals. I'd like to stop waving my hands so much, though. — half-pass, Jan 17 '18 at 19:16
One thing that is clear in Method 1 is the violation of the iid error assumption: when resampling with replacement, the residuals for any given $x$ value are *perfectly* correlated rather than independent! Thus you aren't bootstrapping correctly, that's all. As a demonstration, after computing `ids` replace them with `ids — whuber, Jan 17 '18 at 23:20
@whuber. I see. And that would explain why resampling residuals with replacement works despite the repeated observations: the residuals of that model are once again independent of X. If you'd like to make this into an answer, I'd be happy to accept. — half-pass, Jan 18 '18 at 12:10

JWalker · Accepted Answer · 2018-01-28T12:57:09.470

When you resample the null, the expected value of the regression coefficient is zero. When you resample some observed dataset, the expected value is the observed coefficient for that data. It's not a type I error if P <= 0.05 when you resample the observed data. In fact, it's a type II error if you have P > 0.05.

You can gain some intuition by computing the correlation between the abs(b) and the mean(P). Here is simpler code to replicate what you did plus compute the correlation between b and "type I" error over the set of simulations

boot.reps = 1000
n.sims.run = 10
n <- 1000
b <- matrix(NA, nrow=boot.reps, ncol=n.sims.run)
p <- matrix(NA, nrow=boot.reps, ncol=n.sims.run)
for(sim_j in 1:n.sims.run){
  x <- rnorm(n)
  y <- rnorm(n)
  inc <- 1:n
  for(boot_i in 1:boot.reps){
    fit <- lm(y[inc] ~ x[inc])
    b[boot_i, sim_j] <- abs(coefficients(summary(fit))['x[inc]', 'Estimate'])
    p[boot_i, sim_j] <- coefficients(summary(fit))['x[inc]', 'Pr(>|t|)']
    inc <- sample(1:n, replace=TRUE)
  }
}
# note this is not really a type I error but whatever
type1 <- apply(p, 2, function(x) sum(x <= 0.05))/boot.reps
# correlation between b and "type I"
cor(b[1, ], type1)

update the answer by grand_chat is not the reason the frequency of P <= 0.05 is > 0.05. The answer is very simple and what I've said above -- the expected value of the mean of each resample is the original, observed mean. This is the whole basis of the bootstrap, which was developed to generate standard errors/confidence limits on an observed mean and not as a hypothesis test. Since the expectation is not zero, of course the "type I error" will be greater than alpha. And this is why there will be a correlation between the magnitude of the coefficient (how far from zero) and the magnitude of the deviation of the "type I error" from alpha.

Mmmmm. So we should be doing the hypothesis tests with $H_0: \beta = \widehat{\beta}$, not the original $H_0: \beta = 0$. That makes sense and is in line with existing literature. I'll have to try that. — half-pass, Jan 29 '18 at 17:09
$H0:β=βˆ$ tests for equivalence and requires a different study design approach. $H0:β=0$ is used when the important thing is to make sure your observed differences aren't fluke, equivalence when you want to make sure your prediction is right. Unfortunately it's often seen as one size fits all, but it depends on the risks in your situation. It is typical to use $H0:β=0$ in early stage research to filter out flukes when you don't know enough to define an alternative hypothesis then when more is known it may make sense to change over to testing the correctness of your knowledge. — ReneBt, Jun 14 '18 at 05:53

grand_chat · Answer 2 · 2018-01-17T22:46:59.407

If you sample with replacement from your original normal sample, the resulting bootstrap sample isn't normal. The joint distribution of the bootstrap sample follows a gnarly mixture distribution that is very likely to contain duplicate records, whereas duplicate values have probability zero when you take iid samples from a normal distribution.

As a simple example, if your original sample is two observations from a univariate normal distribution, then a bootstrap sample with replacement will half the time consist of the original sample, and half the time will consist one of the original values, duplicated. It's clear that the sample variance of the bootstrap sample will on average be less than that of the original -- in fact it will be half the original.

The main consequence is that the inference that you're doing based on normal theory returns the wrong $p$-values when applied to the bootstrap sample. In particular the normal theory yields anticonservative decision rules, because your bootstrap sample will produce $t$ statistics whose denominators are smaller than would be expected under normal theory, owing to the presence of duplicates. As a result, the normal theory hypothesis test ends up rejecting the null hypothesis more than expected.

But if this is the case, then wouldn't we have exactly the same problem when resampling residuals with replacement? Yet in fact, that approach rejects with nominal probability. — half-pass, Jan 17 '18 at 23:14
Also, a t-test with n=1000 should have no problem with non-normal data. — half-pass, Jan 17 '18 at 23:22

score 0 · Answer 3 · answered Jun 14 '18 at 03:46

I totally agree with @JWalker's answer.

There's another aspect of this problem. That is in your resampling process. You expect the regression coefficient to be centered around zero because you assume X and Y are independent. However, in your resampling you do

ids = sample( 1:nrow(d), replace=TRUE )
  b = d[ ids, ]

which creates correlation because you are sampling X and Y together. For example, say the first row of dataset d is (x1, y1), In the resampled dataset, P(Y = y1|X = x1) = 1, while if X and Y are independent, P(Y|X = x1) follows a normal distribution.

So another way of fix this is to use

b = data.frame( X1 = rnorm( n = 1000 ), Y1 = rnorm( n = 1000 ) )

the same code you use to generate d, in order to make X and Y independent from each other.

The same reason explains why it works with residual resampling (because X is independent from the new Y).

For a while, I also thought that the resampled observations might be non-independent, but upon thinking about it a lot more, I actually don't think this is the case: https://stats.stackexchange.com/questions/339237/are-observations-independent-in-bootstrapped-resamples — half-pass, Jul 08 '18 at 11:48
The problem I am describing above is different from your post. What you referred to is the independence of `x's`. What I referred to is the correlation between `X`s and `Y`s. — Tianxia Zhou, Jul 10 '18 at 22:09

Doug Dame · Answer 4 · 2018-01-27T19:35:10.670

The biggest issue here is that the model results are spurious and therefore highly unstable, because the model is just fitting noise. In a very literal sense. Y1 is not a dependent variable due to how the sample data was generated.

Edit, in response to the comments.Let me make another try at explaining my thinking.

With an OLS the general intent is to discover and quantify the underlying relationships in the data. With real-world data, we usually do not know those exactly.

But this is an artificial test situation. We do know the EXACT data generating mechanism, it's right there in the code posted by the O.P. It's

X1 = rnorm( n = 1000 ), Y1 = rnorm( n = 1000 )

If we express that in the familiar form of an OLS regression, i.e.

Y1 = intercept + Beta1 * X1 + Error
that becomes
Y1 = mean(X1) + 0(X1) + Error

So in my mind, this is a model expressed in linear FORM, but it is NOT actually a linear relationship/model, because there is no slope. Beta1=0.000000.

When we generate the 1000 random data points, the scatterplot is going to look like the classic shotgun circular spray. There could be some correlation between X1 and Y1 in the specific sample of 1000 random points that was generated, but if so it is random happenstance. If the OLS does find a correlation, i.e., rejects the null hypothesis that there is no slope, since we know definitively that there really isn't any relationship between these two variables, then the OLS has literally found a pattern in the Error Component. I characterized that as "fitting the noise" and "spurious."

In addition, one of the std assumptions/requirements of an OLS is that (+/-) "the linear regression model is “linear in parameters.” Given the data, my take is that is we do not satisfy that assumption. Hence the underlying test statistics for significance are inaccurate. My belief is that the violation of the linearity assumption is the direct cause of the non-intuitive results of the bootstrap.

When I first read this problem, it did not sink in that the O.P. was intending to test under the null [hypothesis].

But would the same non-intuitive results happen had the dataset been generated as

X1 = rnorm( n = 1000 ), Y1 = X1*.4 + rnorm( n = 1000 ) ?

I think this answer is wrong in every respect. The results are neither "spurious"--unless you think OLS is a bad procedure--nor are they any more "unstable" than would be predicted from the error variance. $Y_1$ is definitely a dependent variable: there is no requirement, anywhere in the theory, that $Y_1$ have some *causal* relationship with the other variables. Indeed, the null hypothes conventionally tested by all regression software is that there is no dependence--precisely as simulated here. — whuber, Jan 24 '18 at 14:38
Response to the final question in your edit: yes, definitely. Try it yourself with a simulation. (But be careful about the interpretation, because you have to consider what the null is and what the real state of affairs is.) — whuber, Jan 27 '18 at 19:53

Why do hypothesis tests on resampled datasets reject the null too often?

Update #1: More simulations

Update #2: The answer

4 Answers4

Linked