2

I build confidence bounds for estimating PDF of the empirical sample using bootstrapping:

data <- rnorm(1000)
d <- density(data)

boot <- replicate(100, 
                   { x <- sample(data, replace=TRUE); 
                     density(x, from=min(d$x), to=max(d$x))$y}) 

CI <- apply(boot, 1, quantile, c(0.025,0.975) ) 


hist(data, freq=F, ylim=c(0,max(CI[2,])))
lines(d, lwd = 2)
lines(d$x, CI[1,], lty=2)
lines(d$x, CI[2,], lty=2)

I have questions:

  1. From what reasons I need to choose the number bootstrapping repeats?
  2. How can I use confidence bounds for the determination of the minimum required sample size?
Andy
  • 383
  • 1
  • 2
  • 10
  • If I understand this procedure correctly, it takes default choices of kernel type and width as correct. To my mind, the uncertainty associated with kernel density estimation is not just a property of the data but also very sensitive to whether I am making good choices of procedure. Delegating these choices to a program is also a choice. – Nick Cox Aug 20 '13 at 13:41

2 Answers2

3

Bootstrap is not a panacea!

And specifically it fails for KDE!

Kernel density estimator is an example of an estimator that converges at a rate different than the default $n^{-1/2}$ everybody is so used to from the Central Limit Theorem. For the KDE, the rate of the approximation error when the bandwidth is optimally chosen is $O(n^{-2/5})$, i.e., slower. The bootstrap will happily fake the rate of $O(n^{-1/2})$ instead, so the bootstrap result will be too close to the data, and won't provide enough variability for proper inference.

Canty et al. (2006) touch upon this case of the bootstrap inconsistency, but they did not offer a fully working solution. This is slightly surprising as the solution via subsampling is discussed in Politis, Romano and Wolf (1999), based on Romano's earlier work: you need to oversmooth the density, and the degree of oversmoothing needs to be chosen appropriately.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
StasK
  • 29,235
  • 2
  • 80
  • 165
1

Bootstrapping involves a random process, so there is uncertainty in your estimate of the confidence bounds. This will show up as these confidence bounds being rather ragged with only 100 replications. With 100 replications the lower bound will be based on the 2nd and 3rd smallest occurrences. So only a few "weird" samples can influence your estimate of the lower bound (and similarly the upper bound). If you take 20,000 replications the lower bound will be based on the 500th smallest occurrence, and thus tends to be a lot more stable.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Maarten Buis
  • 19,189
  • 29
  • 59