I want to calculate CI for the median in R. I found a number of packages and functions doing that and noticed something interesting. I think the problem can be generalized to any statistical software. I just use R for presentation, but the mathematics behind is software-agnostic (so I decided to post here).
There seem to be 3 methods of calculating the exact CI. A number of packages use the 1st one, another packages the 2nd one and one package can find also the shortest CI, but differs from the 1 and 2. I provide examples, because you may recognize the problem looking at specific package.
I have two questions:
which exact method is better? I can see that about 50/50 of the methods use either kind of algorithm and are consistent - but which "family" is a better approach?
why neither of them (all using the same difference pbinom(A) - pbinom(B) can find the shortest CI, while the 3rd method can? What's so special in the 3 method? Is the last method the best? The code looks very similarly, so I guess it's about the range of searching the "i" and "j" indexes.
A note for non-R users: qbinom returns quantiles of the binomial distribution. pbinom stands for the CDF. Other functions should be more or less self-descriptive
Please note, it's only about the exact CIs, not the normal (asymptotic) approximation.
- Exact method of the form:
sort(x)[qbinom(c(.025, 0.975), length(x), 0.5)]
.
> rcompanion::quantileCI(x)
tau n Quantile Nominal.level Actual.level Lower.ci Upper.ci
0.5 100 -0.0688 0.95 0.954 -0.202 0.132
> EnvStats::eqnpar(x, ci = TRUE, ci.method = "exact")$interval$limits
LCL UCL
-0.2016340 0.1315312
> sort(x)[qbinom(c(.025, 0.975), length(x), 0.5)]
[1] -0.2016340 0.1315312
I also found this code: How to obtain a confidence interval for a percentile?
quantile.CI <- function(n, q, alpha=0.05) {
#
# Search over a small range of upper and lower order statistics for the
# closest coverage to 1-alpha (but not less than it, if possible).
#
u <- qbinom(1-alpha/2, n, q) + (-2:2) + 1
l <- qbinom(alpha/2, n, q) + (-2:2)
u[u > n] <- Inf
l[l < 0] <- -Inf
coverage <- outer(l, u, function(a,b) pbinom(b-1,n,q) - pbinom(a-1,n,q))
if (max(coverage) < 1-alpha) i <- which(coverage==max(coverage)) else
i <- which(coverage == min(coverage[coverage >= 1-alpha]))
i <- i[1]
#
# Return the order statistics and the actual coverage.
#
u <- rep(u, each=5)[i]
l <- rep(l, 5)[i]
return(list(Interval=c(l,u), Coverage=coverage[i]))
}
> sort(x)[quantile.CI(n=100, 0.5)$Interval]
[1] -0.2016340 0.1315312
- Exact method of the form:
sort(x)[qbinom(c(.025, 0.975), length(x), 0.5) + c(0, 1)]
.
> sort(x)[qbinom(c(.025, 0.975), length(x), 0.5) + c(0, 1)]
[1] -0.2016340 0.1788648
> DescTools::MedianCI(x)
median lwr.ci upr.ci
-0.0594199 -0.2016340 0.1788648
attr(,"conf.level")
[1] 0.9647998
> jmuOutlier::quantileCI(x)
lower upper
0.5 -0.201634 0.1788648
> asht::medianTest(x)$conf.int
[1] -0.2016340 0.1788648
I also found this code in a book, that was referenced in the description of the 3rd method, however the code in that book is... the method 2, actually, and not 3.
The screenshot from the book:
Source: Jürgen Hedderich, Lothar Sachs, "Angewandte Statistik: Methodensammlung mit R"
ci.perc <- function( data , p , level ) {
n <- length ( data ) ; plev <- 0
upper <- ceiling ( ( n +1)*p ) ; lower <- floor ( ( n +1)*p )
while ( plev < level ) { upper <- upper +1; lower<-lower-1
plev <- pbinom( upper-1, size =n , prob=p ) - pbinom( lower-1, size =n , prob=p ) }
x <- sort ( data )
cat (100* level , "%-KI zum" , p , "-Quantil aus den Daten : " , x [ lower ] , "-" , x [ upper ] , "\n " )
}
ci.perc(x,0.5, 0.95)
#95 %-KI zum 0.5 -Quantil aus den Daten : -0.201634 - 0.1788648
- The 3rd method, giving more exact CIs, but also able to find the shortest one. And neither of them match either of the above. The code, however, looks quite similar, only the search range is different. But is this better? Worse?
The most important part is as follows:
if(method == 1){ # exact
CI.mat <- matrix(NA, ncol = 2, nrow = n-1)
pcov.vec <- numeric(n-1)
for(i in 1:(n-1)){
for(j in (i+1):n){
pcov <- pbinom(j-1, size = n, prob = prob)-pbinom(i-1, size = n, prob = prob)
if(pcov > conf.level){
pcov.vec[i] <- pcov
CI.mat[i,] <- c(xs[i], xs[j])
break
}
}
}
Full source: https://rdrr.io/cran/MKmisc/src/R/quantileCI.R
And how it works:
> MKmisc::quantileCI(x, prob=0.5, method="exact")
exact confidence interval
95.23684 percent confidence intervals:
lower upper
50 % quantile -0.3598621 0.1298341
50 % quantile -0.1379296 0.2163679
sample estimate:
50 % quantile
-0.0594199
# Show only the shortest one:
> MKmisc::quantileCI(x, prob=0.5, method="exact", minLength = TRUE)
minimum length exact confidence interval
95.23684 percent confidence interval:
lower upper
50 % quantile -0.1379296 0.2163679
sample estimate:
50 % quantile
-0.0594199
The found CI is indeed, the shortest one, but different from the above ones. Is this the best one? Are the methods 1 and 2, so common in packages, textbooks, forums and tutorials, worse than this one? The code looks similarly complex, so I'm wondering why the 3rd method is so rare, and the methods 1 and 2 are so common.
I attach the data for reproducibility:
x <- c(-0.502192350531457, 0.131531165327303, -0.07891708981887, 0.886784809417845,
0.116971270510841, 0.318630087617032, -0.58179068471591, 0.714532710891568,
-0.825259425862769, -0.359862131395465, 0.0898861437775305, 0.0962744602851301,
-0.201633952183354, 0.739840499878431, 0.123379501088869, -0.0293167092293474,
-0.388854246903499, 0.510856257369925, -0.913814185369186, 2.31029682277906,
-0.438089981127317, 0.764060616410438, 0.261961291392879, 0.773404596778447,
-0.814379124875453, -0.438450569074609, -0.720221550216076, 0.230944532252633,
-1.15772946239983, 0.247075992728822, -0.0911135614728243, 1.75737562202805,
-0.137929611731146, -0.111193495282428, -0.690014320569417, -0.221794230019693,
0.182907683594697, 0.417323286226749, 1.06540232704555, 0.970202017340543,
-0.101629238461632, 1.40320348855821, -1.77677563229464, 0.62286739099101,
-0.522283351277008, 1.3222309556847, -0.363440326844656, 1.3190657425688,
0.0437790676442974, -1.87865588200619, -0.447062182019648, -1.7385979472079,
0.178864848627153, 1.89746570014621, -2.2719254860193, 0.980464138734608,
-1.39882561629245, 1.82487242294358, 1.38129872992883, -0.838851875034352,
-0.261995774514031, -0.0688440280170231, -0.378883556530106,
2.58195892772433, 0.129834137345586, -0.713024980216354, 0.637994242919495,
0.201691591615917, -0.0699169482402035, -0.0924898754027451,
0.448903273111376, -1.06435567068737, -1.16241932214223, 1.64852174670445,
-2.06209601933992, 0.0127497209649686, -1.08752834931043, 0.270539493204341,
1.00845187327274, -2.07440475425644, 0.896822271785611, -0.0499957668760446,
-1.34534931044337, -1.93121153434974, 0.709581583477091, -0.157905031916312,
0.216367872708493, 0.817362075775254, 1.72717575450325, -0.103770292463275,
-0.557122290824299, 1.42830142987646, -0.892957402100141, -1.15757124007267,
-0.530296454929057, 2.44568275766683, -0.832495797997412, 0.413519848829901,
-1.17868314072463, -1.17403475846922)
or simply:
set.seed(100)
x <- rnorm(100)