Two ways of using bootstrap to estimate the confidence interval of coefficients in regression

Question

I am applying a linear model to my data: $$ y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}, \quad\epsilon_{i} \sim N(0,\sigma^{2}). $$

I would like to estimate the confidence interval (CI) of the coefficients ($\beta_{0}$, $\beta_{1}$) using bootstrap method. There are two ways that I can apply the bootstrap method:

Sample paired response-predictor: Randomly resample pairs of $y_{i}-x_{i}$, and apply linear regression to each run. After $m$ runs, we obtain a collection of estimated coefficients ${\hat{\beta_{j}}}, j=1,...m$. Finally, compute the quantile of ${\hat{\beta_{j}}}$.
Sample error: First apply linear regression on the original observed data, from this model we obtain $\hat{\beta_{o}}$ and the error $\epsilon_{i}$. Afterwards, randomly resample the error $\epsilon^{*}_{i}$ and compute the new data with $\hat{\beta_{o}}$ and $y^{*}_{i}=\hat{\beta_{o}}x_{i}+\epsilon^{*}_{i}$. Apply once again linear regression. After $m$ runs, we obtain a collection of estimated coefficeints ${\hat{\beta_{j}}}, j=1,...,m$. Finally, compute the quantile of ${\hat{\beta_{j}}}$.

My questions are:

How are these two methods different?
Under which assumption are these two methods giving the same result?

I would personally not use either as the default approach but instead would recommend the basic bootstrap confidence interval. See p. 8 of www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf‎ . I've been doing a lot of simulations for the binary logistic model and have seen better confidence interval coverage using the basic bootstrap than using the percentile or BCa bootstrap. — Frank Harrell, Jul 19 '13 at 11:26
@FrankHarrell to be clear, by "basic" you are referring to the non-parametric bootstrap? — ndoogan, Jul 19 '13 at 13:08
It is unclear what method (1) is. If you *only* resample $y_i$ or $x_i$, then exactly what set of paired values will be regressing at each iteration? You can't possibly mean that you take an ordered sample (with replacement) $y_{s(1)}, y_{s(2)}, \ldots, y_{s(n)}$ and regress the pairs $(x_i, y_{s(i)})$, but that's what it reads like. — whuber, Jul 19 '13 at 13:49
@whuber, thanks for the comments, I have edited my question to make it clear. Resampling either $y_{i}$ or $x_{i}$ is possibly a way to obtain null distributions, providing a way to compute p-values, rather than CI of the coefficients. — tiantianchen, Jul 19 '13 at 14:17
Thanks. That edit might change the interpretation of @Frank Harrell's comment, because now (1) looks like the "basic bootstrap CI." I would just like to point out that (1) does *not* correspond to your model, though: by resampling the $x$'s as well as the $y$'s, you are treating the $x_i$ as *random*, whereas your model fixes them. — whuber, Jul 19 '13 at 14:26
(1) is the bootstrap percentile nonparametric confidence interval, not the basic bootstrap. Note that sampling from $(x,y)$ is the unconditional bootstrap, which is more assumption-free than the conditional bootstrap that resamples residuals. — Frank Harrell, Jul 19 '13 at 14:59
I'm really not an expert, but as far as I understand it, 1) is often called "case-resampling" whereas the 2) is called "residual resampling" or "fixed-$x$" resampling. The basic choice of the method doesn't imply the method of how to calculate the confidence intervals after the procedure. I got this info mainly from the [tutorial of John Fox](http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf). As far as I see it, after either bootstrap, you could calculate the basic bootstrap CIs (e.g. with `boot.ci(my.boot, type="basic")` in `R`). Or do I miss anything here? — COOLSerdash, Jul 19 '13 at 16:23
Thanks for all the discussions. I agree with @ COOLSerdash that my question was more related to the differences between "case-resampling" (or unconditional bootstrap) and "residual-resampling" (or conditional bootstrap). It's difficult for me to find any theory about this topic. Thanks @Frank Harrell for the detailed information about computing bootstrap confidence intervals. — tiantianchen, Jul 19 '13 at 20:26
Since "case-resampling" (or unconditional bootstrap) is more assumption-free, may I apply it in case of more complex error structures, for instance clustered samples or data with correlated errors? I have a feeling that by resampling paired x-y, the error structure is preserved, although the assumption of bootstrapping is a simple random sampling. — tiantianchen, Jul 19 '13 at 20:29
Yes, you can do cluster bootstrapping. This is implemented in the R `rms` `validate` and `calibrate` functions. — Frank Harrell, Jul 19 '13 at 20:38
I don't think anyone has pointed out that the first method violates the assumption that X's are deterministic. — Dole, May 26 '17 at 22:43

Hibernating · Accepted Answer · 2014-01-13T03:14:26.863

If the response-predictor pairs have been obtained from a population by random sample, it is safe to use case/random-x/your-first resampling scheme. If predictors were controlled for, or the values of the predictors were set by the experimenter, you may consider using residual/model-based/fixed-x/your-second resampling scheme.

How do the two differ? An introduction to the bootstrap with applications in R by Davison and Kounen has a discussion pertinent to this question (see p.9). See also the R code in this appendix by John Fox, particularly functions boot.huber on p.5 for the random-x scheme and boot.huber.fixed on p.10 for the fixed-x scheme. While in the lecture notes by Shalizi the two schemes are applied to different datasets/problems, Fox's appendix illustrate how little difference the two schemes may often make.

When can the two be expected to deliver near identical results? One situation is when the regression model is correctly specified, e.g., there is no unmodelled nonlinearity and the usual regression assumptions (e.g., iid errors, no outliers) are satisfied. See chapter 21 of Fox's book (in which the aforementioned appendix with the R code indirectly belongs), particularly the discussion on page 598 and exercise 21.3. entitled "Random versus ﬁxed resampling in regression". To quote from the book

By randomly reattaching resampled residuals to ﬁtted values, the [fixed-x/model-based]
procedure implicitly assumes that the errors are identically distributed. If, for
example, the true errors have non-constant variance, then this property will not be  
reﬂected in the resampled residuals. Likewise, the unique impact of a high-leverage
outlier will be lost to the resampling.

You will also learn from that discussion why fixed-x bootstrap implicitly assumes that the functional form of the model is correct (even though no assumption is made about the shape of the error distribution).

See also slide 12 of this talk for Society Of Actuaries in Ireland by Derek Bain. It also has an illustration of what should be considered "the same result":

The approach of re-sampling cases to generate pseudo data is the more usual form of   
bootstrapping. The approach is robust in that if an incorrect model is fitted an
appropriate measure of parameter meter uncertainty is still obtained. However re
sampling residuals is more efficient if the correct model has been fitted.

The graphs shows both approaches in estimating the variance of a 26 point data sample
mean and a 52 point sample mean. In the larger sample the two approaches are  
equivalent.

Two ways of using bootstrap to estimate the confidence interval of coefficients in regression

1 Answers1

Linked