42

I have to find a 95% C.I. on the median and other percentiles. I don't know how to approach this. I mainly use R as a programming tool.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Dominic Comtois
  • 2,047
  • 5
  • 20
  • 25

5 Answers5

32

Here is an illustration on a classical R dataset:

> x       = faithful$waiting
> bootmed = apply(matrix(sample(x, rep=TRUE, 10^4*length(x)), nrow=10^4), 1, median)
> quantile(bootmed, c(.025, 0.975))
2.5% 97.5% 
 73.5    77 

which gives a (73.5, 77) confidence interval on the median.

(Note: Corrected version, thanks to John. I used $10^3$ in the nrow earlier, which led to the confusion!)

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • 9
    Seems suspiciously narrow to me. Using functions from `library(boot)` appears to confirm this: > boot.ci(boot(x,function(x,i) median(x[i]), R=1000)) Intervals : Level Normal Basic 95% (74.42, 78.22 ) (75.00, 78.49 ) Level Percentile BCa 95% (73.51, 77.00 ) (73.00, 77.00 ) – onestop Jan 15 '12 at 15:13
  • 2
    you're welcome Xi'an... As an aside, I always prefer to set the original N value in the matrix because that's a constant across various bootstrap sizes I might make. So, I would typically have said ncol = length(x). I find there's less chance for error that way. – John Jan 15 '12 at 23:44
  • 7
    This is just an inefficient way to compute the binomial quantiles as in [onestop's answer](http://stats.stackexchange.com/a/21116). – whuber Feb 10 '15 at 16:11
31

Another approach is based on quantiles of the binomial distribution.
e.g.:

> x=faithful$waiting
> sort(x)[qbinom(c(.025,.975), length(x), 0.5)]
[1] 73 77
onestop
  • 16,816
  • 2
  • 53
  • 83
15

Check out bootstrap resampling. Search R help for the boot function. Depending on your data with resampling you can estimate confidence intervals for just about anything.

tharen
  • 441
  • 3
  • 7
  • Agree. This is the best approach. Underused in the biomedical sciences, in my opinion. – pmgjones Jan 15 '12 at 13:41
  • 10
    Consider looking into the smoothed bootstrap for estimating population quantiles as the conventional boostrap seems to have problems in that case - references can be found [in this pdf](http://arxiv.org/pdf/math/0504516). If you were just interested in the theoretical Median, the Hodges-Lehman estimator can be used - as provided by, e.g., R's `wilcox.test(..., conf.int=TRUE)` function. – caracal Jan 15 '12 at 19:50
4

And there are other approaches: One is based on Wilcoxon Rank Sum test applied for one sample with continuity correction. In R this can be supplied as:

wilcox.test(x,conf.level=0.95,alternative="two.sided",correct=TRUE)

And there is the David Olive's CI for median discussed here:

CI for Median

James Stanley
  • 2,376
  • 20
  • 32
Germaniawerks
  • 1,027
  • 1
  • 10
  • 15
2

The result based on the qbinom approach isn't correct for small samples. Suppose that x has 10 components. Then qbinom(c(.025,.975),10,.5) gives 2 and 8. The resulting interval doesn't treat order statistics at the lower tail symmetrically with those from the upper tail; you should get either 2 and 9, or 3 and 8. The right answer is 2 and 9. You can check against proc univariate in SAS. Catch here is you need no more than .025 probability below and above; the lower quantile doesn't do this, as it gives at least .025 at or below. You get saved on the bottom because the count that should be 1 should get mapped to the second order statistic, counting 0, and so the "off by one" cancels. This fortuitous canceling does not happen on top, and so you get the wrong answer here. The code sort(x)[qbinom(c(.025,.975),length(x),.5)+c(0,1)] almost works, and .5 can be replaced by other quantile values to get confidence intervals for other quantiles, but it won't be right when there exists a such that P[X<=a]=.025. See, for ex, Higgins, Nonparametric Statisitcs.