When sample size is large, can I get away with 10 bootstrap resamples?

Question

The ggplot2 package in R includes a dataset called diamonds. The dataset can be accessed by loading ggplot2 like this:

library(ggplot2)

I'm using the boot package to calculate a 95% confidence interval for the mean of the table variable. The table variable has 53,940 observations, and therefore when I tried to use 10,000 bootstrap replicates R crashed:

library(boot)

boot_diamonds_10000 <- boot(diamonds,function(data,indices) mean(data[indices,]$table), R=10000)

I then tried using 1000, 100 and 10 bootstrap replicates like as below. The 1000 and 100 replicates are still slow function calls, but 10 replicates is faster:

boot_diamonds_1000 <- boot(diamonds,function(data,indices) mean(data[indices,]$table), R=1000)

boot_diamonds_100 <- boot(diamonds,function(data,indices) mean(data[indices,]$table), R=100)

boot_diamonds_10 <- boot(diamonds,function(data,indices) mean(data[indices,]$table), R=10)

These all give pretty much the same 95% confidence intervals:

quantile(boot_diamonds_1000$t, c(0.025,0.975))

# 2.5%    97.5% 
# 57.43890 57.47682

quantile(boot_diamonds_100$t, c(0.025,0.975))

# 2.5%    97.5% 
# 57.43638 57.47438 

quantile(boot_diamonds_10$t, c(0.025,0.975))

# 2.5%    97.5% 
# 57.44636 57.46841

To avoid crashing R or waiting for slow functions calls, is it reasonable to use 10 bootstrap replicates when the sample size (53,940) is so high?

No, 10 bootstrap replications is not enough. With that big sample you should consider using more powerful computer (more RAM). The other thing you could do is to write bootstrap function that saves only aggregated data and calculates things on-the-fly (instead of storing everything) to save RAM. — Tim, Oct 02 '15 at 08:11
@Tim the 95% confidence interval is the same regardless of whether I use 10000, 1000, 100 or 10 replicates. So why is 10 not enough? — luciano, Oct 02 '15 at 09:55
No, you need a lot more then 10, regardless of sample size. However, With a larger sample size the central limit theorem really kicks in and your standard errors get super small (why all your confidence intervals look the same). So I would say instead, that when the sample size is large, you may get away without doing a bootstrap at all...of course that may not apply as well to more complex estimates. — Zachary Blumenfeld, Oct 02 '15 at 09:58
Let me put it another way. The reason your CI look so similar is because the standard errors are so incredibly small. "similar" is a relative term, if you take another look at the difference between the confidence intervals relative to the length of each of the confidence intervals you may think they are rather large. for example, subtract 57 from each confidence interval and multiply by 1000. — Zachary Blumenfeld, Oct 02 '15 at 10:17
What I mean with the whole central limit thing is that parametric estimate for the standard error of the sample average $\frac{\hat \sigma^2}{N}$ will actually approximate really well here because you have so much data. Given that, you can cut down on computation time by not even using a bootstrap in the first place. If you don't believe me plot a density of the 1000 bootstrap sample over the theoretical distribution of the sample average using $\frac{\hat \sigma^2}{N}$, they will probably look very similar. — Zachary Blumenfeld, Oct 02 '15 at 10:28
The bootstrap is introduced with sample mean estimation primarily for didactical purpose. In application doing so is often inefficient and not worth it when the sample is extremely large . In application, bootstrapping is better applied to more complex estimates, such as non-linear functions of other estimates, quantile effects, etc. where analytical solutions are not available and/or asymptotic estimates are extremely bias. — Zachary Blumenfeld, Oct 02 '15 at 10:33
Sometimes you can save time by doing a simulation-free bootstrap, using saddlepoint methods. — kjetil b halvorsen, Oct 02 '15 at 11:59

When sample size is large, can I get away with 10 bootstrap resamples?

0 Answers0