101

Suppose I have a set of sample data from an unknown or complex distribution, and I want to perform some inference on a statistic $T$ of the data. My default inclination is to just generate a bunch of bootstrap samples with replacement, and calculate my statistic $T$ on each bootstrap sample to create an estimated distribution for $T$.

What are examples where this is a bad idea?

For example, one case where naively performing this bootstrap would fail is if I'm trying to use the bootstrap on time series data (say, to test whether I have significant autocorrelation). The naive bootstrap described above (generating the $i$th datapoint of the nth bootstrap sample series by sampling with replacement from my original series) would (I think) be ill-advised, since it ignores the structure in my original time series, and so we get fancier bootstrap techniques like the block bootstrap.

To put it another way, what is there to the bootstrap besides "sampling with replacement"?

cardinal
  • 24,973
  • 8
  • 94
  • 128
raegtin
  • 9,090
  • 12
  • 48
  • 53
  • 1
    If you want to do inference for the mean of i.i.d. data, the bootstrap is a great tool. Everything else is questionable, and requires case-by-case proof of weak convergence. – StasK Apr 22 '15 at 13:45

3 Answers3

79

If the quantity of interest, usually a functional of a distribution, is reasonably smooth and your data are i.i.d., you're usually in pretty safe territory. Of course, there are other circumstances when the bootstrap will work as well.

What it means for the bootstrap to "fail"

Broadly speaking, the purpose of the bootstrap is to construct an approximate sampling distribution for the statistic of interest. It's not about actual estimation of the parameter. So, if the statistic of interest (under some rescaling and centering) is $\newcommand{\Xhat}{\hat{X}_n}\Xhat$ and $\Xhat \to X_\infty$ in distribution, we'd like our bootstrap distribution to converge to the distribution of $X_\infty$. If we don't have this, then we can't trust the inferences made.

The canonical example of when the bootstrap can fail, even in an i.i.d. framework is when trying to approximate the sampling distribution of an extreme order statistic. Below is a brief discussion.

Maximum order statistic of a random sample from a $\;\mathcal{U}[0,\theta]$ distribution

Let $X_1, X_2, \ldots$ be a sequence of i.i.d. uniform random variables on $[0,\theta]$. Let $\newcommand{\Xmax}{X_{(n)}} \Xmax = \max_{1\leq k \leq n} X_k$. The distribution of $\Xmax$ is $$ \renewcommand{\Pr}{\mathbb{P}}\Pr(\Xmax \leq x) = (x/\theta)^n \>. $$ (Note that by a very simple argument, this actually also shows that $\Xmax \to \theta$ in probability, and even, almost surely, if the random variables are all defined on the same space.)

An elementary calculation yields $$ \Pr( n(\theta - \Xmax) \leq x ) = 1 - \Big(1 - \frac{x}{\theta n}\Big)^n \to 1 - e^{-x/\theta} \>, $$ or, in other words, $n(\theta - \Xmax)$ converges in distribution to an exponential random variable with mean $\theta$.

Now, we form a (naive) bootstrap estimate of the distribution of $n(\theta - \Xmax)$ by resampling $X_1, \ldots, X_n$ with replacement to get $X_1^\star,\ldots,X_n^\star$ and using the distribution of $n(\Xmax - \Xmax^\star)$ conditional on $X_1,\ldots,X_n$.

But, observe that $\Xmax^\star = \Xmax$ with probability $1 - (1-1/n)^n \to 1 - e^{-1}$, and so the bootstrap distribution has a point mass at zero even asymptotically despite the fact that the actual limiting distribution is continuous.

More explicitly, though the true limiting distribution is exponential with mean $\theta$, the limiting bootstrap distribution places a point mass at zero of size $1−e^{-1} \approx 0.632$ independent of the actual value of $\theta$. By taking $\theta$ sufficiently large, we can make the probability of the true limiting distribution arbitrary small for any fixed interval $[0,\varepsilon)$, yet the bootstrap will (still!) report that there is at least probability 0.632 in this interval! From this it should be clear that the bootstrap can behave arbitrarily badly in this setting.

In summary, the bootstrap fails (miserably) in this case. Things tend to go wrong when dealing with parameters at the edge of the parameter space.

An example from a sample of normal random variables

There are other similar examples of the failure of the bootstrap in surprisingly simple circumstances.

Consider a sample $X_1, X_2, \ldots$ from $\mathcal{N}(\mu,1)$ where the parameter space for $\mu$ is restricted to $[0,\infty)$. The MLE in this case is $\newcommand{\Xbar}{\bar{X}}\Xhat = \max(\bar{X},0)$. Again, we use the bootstrap estimate $\Xhat^\star = \max(\Xbar^\star, 0)$. Again, it can be shown that the distribution of $\sqrt{n}(\Xhat^\star - \Xhat)$ (conditional on the observed sample) does not converge to the same limiting distribution as $\sqrt{n}(\Xhat - \mu)$.

Exchangeable arrays

Perhaps one of the most dramatic examples is for an exchangeable array. Let $\newcommand{\bm}[1]{\mathbf{#1}}\bm{Y} = (Y_{ij})$ be an array of random variables such that, for every pair of permutation matrices $\bm{P}$ and $\bm{Q}$, the arrays $\bm{Y}$ and $\bm{P} \bm{Y} \bm{Q}$ have the same joint distribution. That is, permuting rows and columns of $\bm{Y}$ keeps the distribution invariant. (You can think of a two-way random effects model with one observation per cell as an example, though the model is much more general.)

Suppose we wish to estimate a confidence interval for the mean $\mu = \mathbb{E}(Y_{ij}) = \mathbb{E}(Y_{11})$ (due to the exchangeability assumption described above the means of all the cells must be the same).

McCullagh (2000) considered two different natural (i.e., naive) ways of bootstrapping such an array. Neither of them get the asymptotic variance for the sample mean correct. He also considers some examples of a one-way exchangeable array and linear regression.

References

Unfortunately, the subject matter is nontrivial, so none of these are particularly easy reads.

P. Bickel and D. Freedman, Some asymptotic theory for the bootstrap. Ann. Stat., vol. 9, no. 6 (1981), 1196–1217.

D. W. K. Andrews, Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space, Econometrica, vol. 68, no. 2 (2000), 399–405.

P. McCullagh, Resampling and exchangeable arrays, Bernoulli, vol. 6, no. 2 (2000), 285–301.

E. L. Lehmann and J. P. Romano, Testing Statistical Hypotheses, 3rd. ed., Springer (2005). [Chapter 15: General Large Sample Methods]

cardinal
  • 24,973
  • 8
  • 94
  • 128
  • The behaviour of the order statistics bootstrap seems reasonable to me, given that the exponential distribution has a similar "point mass" at zero - The mode of an exponential distribution is 0, so it seems reasonable that the probability should be non-zero at the most likely value! The bootstrap would probably be something more like an geometric distribution which is a discrete analogue of the exponential. I wouldn't take this as a "failure" of the bootstrap here - for the estimated quantity of $\theta$ always lies in the appropriate interval $\theta\geq X_{(n)}$ – probabilityislogic Apr 19 '11 at 06:31
  • 1
    @cardinal - the asymptotic distribution is not the appropriate benchmark - unless you have an infinite sample. The bootstrap distribution should be compared to the finite sample distribution that it was designed to approximate. What you want to show is that as number of bootstrap iterations goes to infinity, the bootstrap distribution converges to the *finite sampling distribution*. letting $n\to\infty$ is an approximate solution not an exact one. – probabilityislogic Apr 19 '11 at 13:36
  • 5
    @cardinal +1, I've upvoted the question earlier, but I just want to thank for a very good answer, examples and links to the articles. – mpiktas Apr 19 '11 at 13:45
  • @probabilityislogic, of course in general application of asymptotic theory depends on the convergence rate, if it is slow, then it is not applicable. But you have then to demonstrate that the rate is slow, since I suspect that for example with uniform distribution taking sample size 100 you will encounter the problems @cardinal outlined. – mpiktas Apr 19 '11 at 13:48
  • @probabilityislogic, it's easy to see that your convergence statement is **never** true (modulo, perhaps, some trivial scenarios). This is because there are "only" $n^n$ possible bootstrap samples that could be drawn (and that's even considering the ordering; many statistics of interest are permutation invariant). So fixing $n$ and taking the number of bootstrap iterations $B\to\infty$ is not a useful sense of consistency. – cardinal Apr 19 '11 at 13:51
  • @cardinal - using your logic I could reject the use of any continuous distribution because we only have a finite number of observations. Perhaps we would both be help by you producing a numerical example demonstrating that the bootstrap "fails" in the case of the maximum - I honestly don't think it will. But happy to be proved wrong :) – probabilityislogic Apr 20 '11 at 00:01
  • 1
    I'll try to post an example in a day or so. But, I think you're likely misunderstanding. The logic is that the coverage implied by the bootstrap distribution is wrong, **even** asymptotically. That leaves little hope for it to work in finite samples. You might reread the second-to-last paragraph of that section, which should make it clear that the bootstrap can behave **arbitrarily badly** in terms of the approximation of the sampling distribution. – cardinal Apr 20 '11 at 01:56
  • @probabilityislogic, One final note: Hopefully it is clear that this is not *my* logic, but rather that of the collective statistical community with 30+ years of studying the topic. Cheers. – cardinal Apr 20 '11 at 01:57
  • @cardinal - the reason I say the logic is a bit odd is that you first say $X_{(n)}\to\theta$ in probability, but then say that it is "bad" if we asymptotically have a point mass at $X_{(n)}$ in the bootstrap. Doesn't that mean that we effectively have a point mass at the true value (or arbitrarily close to that value)? How is that "bad behavior"? Perhaps we are speaking of different things - you say its bad because the sampling distribution is "wrong" and I say "who cares because we get the correct value of $\theta$ asymptotically". – probabilityislogic Apr 20 '11 at 12:43
  • @cardinal - and by the same argument, in order to make the interval arbitrarily small you need to make the sample arbitrarily large - meaning that the convergence in probability of $X_{(n)}$ will get arbitrarily better – probabilityislogic Apr 20 '11 at 12:45
  • @probabilityislogic, No, the interval $[0,\varepsilon)$ stays fixed. The value of $\theta$ stays fixed, but, is large. The distribution of $n(\theta-X_{(n)})$ is nondegenerate. So, fix $\varepsilon > 0$ and $0 < \delta < 1$. Then, there exists $\theta \equiv \theta(\varepsilon,\delta)$ such that $1-\exp(-\varepsilon/\theta) < \delta$. Consider an i.i.d. sample from $\mathcal{U}[0,\theta]$. Then, the bootstrap distribution has mass $> 1-e^{-1}$ in the interval $[0,\varepsilon)$, but the true limiting distribution has mass $< \delta$. – cardinal Apr 20 '11 at 12:56
  • 4
    @probabilityislogic, at first, I only saw the the latter of your two most recent comments. To address the former, you can see the first two sentences of the section above with heading "What it means for the bootstrap to 'fail'", where this is addressed explicitly. The bootstrap is not about estimating the parameter. We assume we have a good way to estimate the desired parameter (in this case, $X_{(n)}$ works fine). The bootstrap is about knowing something about the *distribution* of the parameter so that we can do inference. Here, the bootstrap gets the distribution (**very!**) wrong. – cardinal Apr 20 '11 at 13:06
  • ...and that should be "... *distribution* of the parameter *estimate*..." above. Sorry about dropping a word midstream. – cardinal Apr 20 '11 at 13:36
  • @cardinal I think a helpful reference is "When does the bootstrap work" by Mammen. In particular, if the statistic is linear in the sense that $T(X) = \sum_i g_n(X_i)$, the bootstrap works if and only if its asymptotic distribution is Gaussian. A normal limit also seems to be required for smooth statistics but formal arguments are more difficult. – orizon Apr 01 '16 at 06:05
  • @cardinal Is there any way to characterize the more common conditions where bootstrap fails? For example, is there a characterization that at least would help identify put the cases you described in your answer into the "bootstrap will fail" category? – max May 20 '16 at 00:39
  • [Related discussion](https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping). – ayorgo Aug 11 '19 at 13:31
12

The following book has a chapter (Ch.9) devoted to "When Bootstrapping Fails Along with Remedies for Failures":

M. R. Chernick, Bootstrap methods: A guide for practitioners and researchers, 2nd ed. Hoboken N.J.: Wiley-Interscience, 2008.

The topics are:

  1. Too Small of a Sample Size
  2. Distributions with Infinite Moments
  3. Estimating Extreme Values
  4. Survey Sampling
  5. Data Sequences that Are M-Dependent
  6. Unstable Autoregressive Processes
  7. Long-Range Dependence
Sadeghd
  • 403
  • 4
  • 8
  • 1
    Have you seen [this comment](http://stats.stackexchange.com/questions/9664/what-are-examples-where-a-naive-bootstrap-fails#comment70554_9678) to an answer in this thread? Incidentally, that comment links to an Amazon page for Chernick's book; the reader reviews are enlightening. – whuber Dec 30 '13 at 15:58
  • @whuber Well, I didn't notice that comment. Should I remove my answer? – Sadeghd Dec 31 '13 at 15:21
  • 2
    Because your answer is more detailed than the reference in the comment, it potentially has value: but in keeping with SE policies and aims, it would be nice to see it amplified with some explanation of why you are recommending this book or--even better--to include a summary of the information in it. Otherwise it adds little and should be deleted or converted into a comment to the question. – whuber Dec 31 '13 at 15:24
2

The naive bootstrap depends on the sample size being large, so that the empirical CDF for the data are a good approximation to the "true" CDF. This ensures that sampling from the empirical CDF is very much like sampling from the "true" CDF. The extreme case is when you have only sampled one data point - bootstrapping achieves nothing here. It will become more and more useless as it approaches this degenerate case.

Bootstrapping naively will not necessarily fail in times series analysis (although it may be inefficient) - if you model the series using basis functions of continuous time (such a legendre polynomials) for a trend component, and sine and cosine functions of continuous time for cyclical components (plus normal noise error term). Then you just put in what-ever times you happen to have sampled into the likelihood function. No disaster for bootstrapping here.

Any auto-correlation or ARIMA model has a representation in this format above - this model is just easier to use and I think to understand and interpret (easy to understand cycles in sine and cosine functions, hard to understand coefficients of an ARIMA model). For example the auto-correlation function is the inverse Fourier transform of the power spectrum of a time series.

mpiktas
  • 33,140
  • 5
  • 82
  • 138
probabilityislogic
  • 22,555
  • 4
  • 76
  • 97
  • @probabilityislogic -1, I accidentally upvoted the answer earlier (blame Opera mini) so I had to edit it to be able to downvote, I am sorry for using such tactics. I did this only because I did not like the answer at first, but did not downvote because I wanted to prepare my arguments, which I will give in the following comment. – mpiktas Apr 19 '11 at 08:34
  • 1
    @probabilityislogic, for time-series processes the time plays important role, so distribution of vector $(X_t,X_{t+1})$ is different from $(X_{t+1},X_t)$. The resampling as done in naive bootstrap destroys this structure, so for example if you try to fit AR(1) model, after resampling you might get that you are trying to fit $Y_{10}$ as $\rho Y_{15}$, which is does not seem natural. If you google for "bootstrapping time series" [the second article](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.1402&rep=rep1&type=pdf) gives example of how estimate of variance of time series has... – mpiktas Apr 19 '11 at 08:51
  • ... wrong limit when naive bootstrap is applied (page 221). Furthermore ARIMA models do not have representation in trend and cyclical components as you have described. Usually ARIMA models are applied **after** trend and cyclical components are removed. In its standard definition ARIMA have constant mean, hence no trend. Also if the time series has a deterministic time trend, bootstrap it is more evident that naive bootstrap will fail. Pick $n$ points from a linear trend (i.e. line ) and shuffle them, the end result will definitely not look like the original line. – mpiktas Apr 19 '11 at 09:01
  • @mpiktas - thanks for the comments (at least you give reason for -1). When you re-sample in a bootstrap, surely you would retain the time of the observation, as it contains important information. In time series, the "time" is essentially a covariate. So the situation you speak of with "shuffling" does not apply. ARIMA models can be represented in the frequency domain, which is what the sine and cosine functions do, and they are of continuous time - so "holes" in the time series don't matter. – probabilityislogic Apr 19 '11 at 10:09
  • The reason why ARIMA models can be represented in frequency domain is that the ACF is a sufficient statistic for the model - because with MVN only the covariance matters - and the Fourier transform of the ACF is the equivalent "frequency domain" sufficient statistic. Because Fourier transform is 1-to-1, the ARIMA model has an equivalent representation in terms of sine and cosine waves. – probabilityislogic Apr 19 '11 at 10:13
  • @probabilityislogic, I do not understand. You say that quote: "when you re-sample in a bootstrap, you would retain the time of the observation", how can you do that? If we have the sample $X_1,...,X_n$, the naive bootstrapped sample is $X_1^*,...,X_n^*$, where each $X_i^*$ can be any of $X_1,...,X_n$, surely then the time of observation is lost? Furthermore I cannot agree that "time" is a covariate in time series. When we talk about covariates we speak about the iid sample of two variables, this is not the case for time series. – mpiktas Apr 19 '11 at 10:51
  • @mpiktas - time is a special covariate in a time series model - used in a particular way. And in your data row, you would have $X$ in one column, and then the time $t$ in another column, plus other covariates in other columns. How on earth could the time be lost??? You sample the whole row - just because you have relabelled $X^*_1$ as "point one" does not mean that it comes from time $t_{1}$. It could well be that $X^{*}_{1}=\{X_{7},t_{7}\}$. The notation $X_{i}$ is inappropriate in time series, because it obscures that time is important. $\{x_{i},t_{i}\}$ is much better – probabilityislogic Apr 19 '11 at 11:13
  • what you are describing in time series is analogous in OLS regression to bootstrapping $Y_{i}$ and $X_{i}$ independently. – probabilityislogic Apr 19 '11 at 11:14
  • 1
    @probabilityislogic, ok so assume that we have a sample of vectors $(X_1,1),(X_2,2),(X_3,3)$, which comes from AR(1) model $Y_t=\rho Y_{t-1}+\varepsilon_t$. Suppose we estimate $\rho$ using linear regression without the intercept. For the given sample the estimate will be $(X_2X_1+X_3X_2)/(X_1^2+X_2^2)$. Suppose now we have a bootstrap sample $(X_1,1),(X_1,1),(X_3,3)$, how would one define the the estimate? Since time structure is kept, I should take it into account, but how? I cannot pair $(X_3,3)$ with $(X_1,1)$, since then I will estimate AR(2) model $Y_t=\alpha Y_{t-2}+u_t$. – mpiktas Apr 19 '11 at 13:38
  • 2
    @probabilityislogic, would it be possible for you to demonstrate your idea in your answer for naive bootstrap estimate of $\rho$ in AR(1) model $Y_t=\rho Y_{t-1}+u_t$? I do not think that it is possible, hence the basic reason for downvote. I would be glad to get proven wrong. – mpiktas Apr 19 '11 at 13:41
  • No because the AR(1) model is recursive, so you have $Y_{t}=\rho(\rho Y_{t-2}+\epsilon_{t-1})+\epsilon_{t}$. – probabilityislogic Apr 19 '11 at 13:45
  • 1
    @probabilityislogic, and? What will be the estimate of $rho$ in that case? I am sorry for pestering, but I genuinely do not see how can you show that naive bootstrap will not fail in this case. – mpiktas Apr 19 '11 at 14:00
  • Seems like in the stationary case, the estimate is $0$. the likelihood is normal with mean $X_{3}-\rho^{2}X_{1}$ and variance $1+\rho^{2}$ - always has local mode at zero. But the likelihood is explosive - diverges outside $\pm 1$. Prob due to integration over $\epsilon_2$ – probabilityislogic Apr 19 '11 at 15:06
  • @mpiktas - the example you are describing should fail here with bootstrap because the sample size is small - just like my answer said. If the time series was longer a more reasonable answer would be found. – probabilityislogic Apr 20 '11 at 00:06
  • 4
    My book [here](http://www.amazon.com/Bootstrap-Methods-Practitioners-Researchers-Probability/dp/0471756210/ref=sr_1_2?s=books&ie=UTF8&qid=1346169992&sr=1-2&keywords=bootstrap+methods) has a chapter on when the bootstrap fails and also a chapter on how the bootstrap is applied in time series. For time series the bootstrap can be applied to residuals from a model in the model based approach. The other nonparametric time domain approach is block bootstrap of which there are many types. – Michael R. Chernick Aug 28 '12 at 16:29