Why does my bootstrap interval have terrible coverage?

Question

I wanted to do a class demonstration where I compare a t-interval to a bootstrap interval and calculate the coverage probability of both. I wanted the data to come from a skewed distribution so I chose to generate the data as exp(rnorm(10, 0, 2)) + 1, a sample of size 10 from a shifted lognormal. I wrote a script to draw 1000 samples and, for each sample, calculate both a 95% t-interval and a 95% bootstrap percentile interval based on 1000 replicates.

When I run the script, both methods give very similar intervals and both have coverage probability of 50-60%. I was surprised because I thought the bootstrap interval would be better.

My question is, have I

made a mistake in the code?
made a mistake in calculating the intervals?
made a mistake by expecting the bootstrap interval to have better coverage properties?

Also, is there a way to construct a more reliable CI in this situation?

 tCI.total <- 0
 bootCI.total <- 0
 m <- 10 # sample size
 true.mean <- exp(2) + 1

for (i in 1:1000){
 samp <- exp(rnorm(m,0,2)) + 1
 tCI <- mean(samp) + c(1,-1)*qt(0.025,df=9)*sd(samp)/sqrt(10)

 boot.means <- rep(0,1000)
 for (j in 1:1000) boot.means[j] <- mean(sample(samp,m,replace=T))
 bootCI <- sort(boot.means)[c(0.025*length(boot.means), 0.975*length(boot.means))]

 if (true.mean > min(tCI) & true.mean < max(tCI)) tCI.total <- tCI.total + 1
 if (true.mean > min(bootCI) & true.mean < max(bootCI)) bootCI.total <- bootCI.total + 1 
}
tCI.total/1000     # estimate of t interval coverage probability
bootCI.total/1000  # estimate of bootstrap interval coverage probability

People often forget another use of the bootstrap: to identify and correct *bias*. I suspect that if you were to include a bias correction within your bootstrapping you might get much better performance out of the CI. — whuber, Mar 31 '14 at 00:34
@whuber: nice point, +1. As far as I remember, *Bootstrap Methods and Their Applications* by Davison & Hinkley gives a nice and accessible introduction to bias correction and other improvements on the bootstrap. — Stephan Kolassa, Mar 31 '14 at 10:36
It is worth trying the other bootstrap variants, especially the basic bootstrap. — Frank Harrell, Apr 05 '14 at 20:16
Bootstrapping is a large-sample procedure. $n = 10$ is not large, [especially for log-normal data](http://stats.stackexchange.com/questions/186957/is-there-a-reliable-nonparametric-confidence-interval-for-the-mean-of-a-skewed-d/188157#188157). — Cliff AB, Feb 15 '16 at 21:13

score 17 · Answer 1 · edited Apr 13 '17 at 12:44

Bootstrap diagnostics and remedies by Canto, Davison, Hinkley & Ventura (2006) seems to be a logical point of departure. They discuss multiple ways the bootstrap can break down and - more importantly here - offer diagnostics and possible remedies:

Outliers
Incorrect resampling model
Nonpivotality
Inconsistency of the bootstrap method

I don't see a problem with 1, 2 and 4 in this situation. Let's look at 3. As @Ben Ogorek notes (although I agree with @Glen_b that the normality discussion may be a red herring), the validity of the bootstrap depends on the pivotality of the statistic we are interested in.

Section 4 in Canty et al. suggests resampling-within-resamples to get a measure of bias and variance for the parameter estimate within each bootstrap resample. Here is code to replicate the formulas from p. 15 of the article:

library(boot)
m <- 10 # sample size
n.boot <- 1000
inner.boot <- 1000

set.seed(1)
samp.mean <- bias <- vars <- rep(NA,n.boot)
for ( ii in 1:n.boot ) {
    samp <- exp(rnorm(m,0,2)) + 1
    samp.mean[ii] <- mean(samp)
    foo <- boot(samp,statistic=function(xx,index)mean(xx[index]),R=inner.boot)
    bias[ii] <- mean(foo$t[,1])-foo$t0
    vars[ii] <- var(foo$t[,1])
}

opar <- par(mfrow=c(1,2))
    plot(samp.mean,bias,xlab="Sample means",ylab="Bias",
        main="Bias against sample means",pch=19,log="x")
    abline(h=0)
    plot(samp.mean,vars,xlab="Sample means",ylab="Variance",
        main="Variance against sample means",pch=19,log="xy")
par(opar)

bootstrap diagnostics

Note the log scales - without logs, this is even more blatant. We see nicely how the variance of the bootstrap mean estimate goes up with the mean of the bootstrap sample. This to me looks like enough of a smoking gun to attach blame to nonpivotality as a culprit for the low confidence interval coverage.

However, I'll happily admit that one could follow up in lots of ways. For instance, we could look at how whether the confidence interval from a specific bootstrap replicate includes the true mean depends on the mean of the particular replicate.

As for remedies, Canty et al. discuss transformations, and logarithms come to mind here (e.g., bootstrap and build confidence intervals not for the mean, but for the mean of the logged data), but I couldn't really make it work.

Canty et al. continue to discuss how one can reduce both the number of inner bootstraps and the remaining noise by importance sampling and smoothing as well as add confidence bands to the pivot plots.

This might be a fun thesis project for a smart student. I'd appreciate any pointers to where I went wrong, as well as to any other literature. And I'll take the liberty of adding the diagnostic tag to this question.

NRH · Answer 2 · 2014-04-05T18:04:09.477

While I agree with Stephan Kolassa's analysis and conclusion, $$\hat{\mu} - \mu$$ with $\hat{\mu}$ the sample mean is definitely not an approximate pivot, let me make an additional remark. I investigated the use of the $t$-statistic $$\sqrt{m} \frac{\hat{\mu} - \mu}{\hat{\sigma}}$$ together with bootstrapping. The result was a coverage of around 0.8. Not a complete solution, but an improvement.

Then I thought a little more about the whole setup. With only 10 observations and an extremely skewed distribution, is it then not basically impossible to nonparametrically estimate the mean let alone construct confidence intervals with the right coverage?

The log-normal distribution considered has mean $e^2 + 1 = 8.39$. Since $P(X \leq 2) = 0.84$ when $X \sim \mathcal{N}(0,4)$ the mean is the $0.84$-quantile of the distribution! It means that the probability that all 10 observations are smaller than the mean is $0.84^{10} = 0.178$. So in a little less than 18% of the cases, the largest observation is smaller than the mean. To get a coverage greater than 0.82 we need a construction of a confidence interval for the mean the extends beyond the largest observation. I have a hard time imagining how such a construction can be made (and justified) without prior assumptions that the distribution is extremely skewed. But I welcome any suggestions.

I agree with you. I really wanted to think about it from the point of view of someone who has got one sample from this distribution. How would I know that it is unsafe to blithely use the bootstrap in this case? The only think I can think of is that I might well have taken logs before doing the analysis, but one of the other answerers says that this doesn't really help. — Flounderer, Apr 05 '14 at 23:15
You won't know if it is safe or unsafe from the 10 data points alone. If you suspect skewness or heavy tails the solution might be to focus on a different parameter than the mean. For instance the log-mean or the median. This won't give you an estimate of (or confidence interval for) the mean unless you make additional assumptions, but it might be a better idea altogether to focus on a parameter that is less sensitive to the tails of the distribution. — NRH, Apr 07 '14 at 06:45

lambruscoAcido · Answer 3 · 2014-04-02T23:47:14.317

The calculations were right, I cross-checked with the well-known package boot. Additionally I added the BCa-interval (by Efron), a bias-corrected version of the percentile bootstrap interval:

for (i in 1:1000) {
  samp <- exp(rnorm(m, 0, 2)) + 1

  boot.out <- boot(samp, function(d, i) sum(d[i]) / m, R=999)
  ci <- boot.ci(boot.out, 0.95, type="all")

  ##tCI <- mean(samp) + c(1,-1)*qt(0.025,df=9)*sd(samp)/sqrt(10)
  tCI <- ci$normal[2:3]
      percCI <- ci$perc[4:5]
  bcaCI <- ci$bca[4:5]
      boottCI <- ci$student[4:5]

  if (true.mean > min(tCI) && true.mean < max(tCI)) tCI.total <- tCI.total + 1
  if (true.mean > min(percCI) && true.mean < max(percCI)) percCI.total <- percCI.total + 1 
  if (true.mean > min(bcaCI) && true.mean < max(bcaCI)) bcaCI.total <- bcaCI.total + 1
}

tCI.total/1000     # estimate of t interval coverage probability
0.53
percCI.total/1000  # estimate of percentile interval coverage probability
0.55
bcaCI.total/1000  # estimate of BCa interval coverage probability
0.61

I assume the intervals would be far better if the original sample size is larger then 10, say 20 or 50.

Furthermore the bootstrap-t method usually leads to better results for skewed statistics. However it needs a nested loop and therefore 20+ times more computational time.

For hypothesis testing it is also very important that the 1-sided coverages are good. So looking only at the 2-sided coverages can often be misleading.

Further to your comment on sample size: Good, in his *Resampling Methods* (3rd ed., 2006, p. 19) notes that the bootstrap may be unstable for sample sizes $n<100$. Unfortunately, I don't have the book at hand, so I can't look up his argumentation or any references. — Stephan Kolassa, Apr 05 '14 at 12:54

score 6 · Answer 4 · answered Apr 03 '14 at 04:53

6

I was confused about this too, and I spent a lot of time of on the 1996 DiCiccio and Efron paper Bootstrap Confidence Intervals, without much to show for it.

It actually led me to think less of the bootstrap as a general purpose method. I used to think of it as something that would pull you out of a jam when you were really stuck. But I've learned its dirty little secret: bootstrap confidence intervals are all based on normality in some way or another. Allow me to explain.

The bootstrap gives you an estimate of the sampling distribution of the estimator, which is all you could ever hope for, right? But recall that the classical link between the sampling distribution and the confidence interval is based on finding a pivotal quantity. For anyone who's rusty, consider the case where $$x \sim N(\mu, \sigma^2)$$ and $\sigma$ is known. Then the quantity $$z = \frac{x - \mu}{\sigma} \sim N(0,1)$$ is pivotal, i.e., its distribution doesn't depend on $\mu$. Therefore, $\Pr(-1.96 \le \frac{x - \mu}{\sigma} \le 1.96) = 0.95$ and the rest is history.

When you think about what justifies the percentiles of the normal distribution being related to confidence intervals, it is entirely based on this convenient pivotal quantity. For an arbitrary distribution, there is no theoretical link between the percentiles of the sampling distribution and confidence intervals, and taking raw proportions of the bootstrap sampling distribution doesn't cut it.

So Efron's BCa (bias corrected) intervals use transformations to get to approximate normality and bootstrap-t methods rely on the resulting t-statistics being approximately pivotal. Now the bootstrap can estimate the hell out of moments, and you can always assume normality and use the standard +/-2*SE. But considering all the work that went into going non-parametric with the bootstrap, it doesn't seem quite fair, does it?

answered Apr 03 '14 at 04:53

Ben Ogorek

4,629
1
21
41

2

It's possible I missed something, but the fact that bootstrapping is associated with pivotal or near pivotal quantities does not of itself imply any association with normality. Pivotal quantities may have all manner of distributions in particular circumstances. I also don't see how the italicized sentence in your second last paragraph follows. – Glen_b Apr 03 '14 at 04:57
You are right that pivotal quantities come in all shapes and forms, but that the bootstrap is fully non-parametric means that you can't take advantage of any of them. – Ben Ogorek Apr 03 '14 at 05:07
1

How then does the assertion relating to normality follow? – Glen_b Apr 03 '14 at 05:08
A line from the Efron paper: "most confidence intervals are approximate, with by far the favorite approximation being the standard interval [estimate +/ 2*s.e.]." So you've got me, technically, in that I did not prove that all bootstrap interval methods end up as a normal approximation. I guess the t-interval would be one quick counter-example. But I think I did illustrate the key properties that make the ubiquitous standard interval work, and why they are not automatically satisfied. If you do find a bootstrap interval that's not rooted in normal approximation, please share. – Ben Ogorek Apr 03 '14 at 05:21
Can you define what you count as "rooted in a normal approximation", please, so I don't spend time fulfilling a condition from the one you have in mind? – Glen_b Apr 03 '14 at 05:24
Okay, in the paper I linked to, check out the paragraph under equation 2.4 that starts with, "The following argument motivates..". You'll see that BCa intervals are motivated by hypothesizing an exact transformation to normality. – Ben Ogorek Apr 03 '14 at 05:33
1

Since every continuous distribution $F$ has an exact transformation to normality ($\Phi^{-1}[F(X)]$ is always standard normal), it looks like you just excluded all continuous distributions as being rooted in a normal approximation. – Glen_b Apr 03 '14 at 05:41
Back to the Efron paper (second paragraph of p.194), "In practice we do not expect [either approximation] to hold exactly, but the broader assumptions [a transformation exists to normality] are likely to be a better approximation to the truth." If the normal approximation is trivial, like you imply, why would Efron and his coauthor say that? – Ben Ogorek Apr 03 '14 at 15:16
2

It's not trivial to identify $F$ if we don't already know it; the point was simply that such transformations clearly exist. Efron is trying to obtain better intervals; just because he goes via transform-to-approx-normal/make-an-interval/transform-back doesn't of itself imply that he's assuming any special connection to normality. – Glen_b Apr 03 '14 at 22:33
2

To add to @Glen_b: the transformation to a normal distribution only needs to exist to prove the method correct. You don't need to find it to use the method. Additionally, if you don't like normal distributions, you could rewrite the whole proof with some other symmetric, continuous distribution. The use of normal distributions is technically useful, but not strictly necessary, it doesn't say anything about the source of the data, or the sample mean. – Peter Jul 31 '15 at 17:10

score 0 · Answer 5 · answered Mar 08 '19 at 01:08

Check out Tim Hesterberg's article in The American Statistician at http://www.timhesterberg.net/bootstrap#TOC-What-Teachers-Should-Know-about-the-Bootstrap:-Resampling-in-the-Undergraduate-Statistics-Curriculum.

Essentially, the bootstrap percentile interval does not have strong coverage probability for skewed data unless n is large.

Why does my bootstrap interval have terrible coverage?

5 Answers5

Linked