4

I am interested in the conditional probability that the median $X^*_{(m)}$ of a bootstrap sample $X_1^*,\ldots,X_n^*$, where $n=2m-1$ for integer $m$, equals the $k$th order statistic $X_{(k)}$ of the original sample $X_1,\ldots,X_n$, $$ p_k=P_*(X^*_{(m)}=X_{(k)}|X_1,\ldots,X_n) $$ The Jackknife and Bootstrap, Shao and Tu, p. 10 gives the following result: $$ p_k=\sum_{j=0}^{m-1} {n\choose j}\frac{(k-1)^j(n-k+1)^{n-j}-k^j(n-k)^{n-j}}{n^n} $$ I tried to verify this result based on https://en.wikipedia.org/wiki/Order_statistic#Dealing_with_discrete_variables (which seems sensible to me).

$F(x)$ there would then seem to correspond, given that $X_i^*$ are iid draws from the empirical cdf $F_n$ of the $X_i$, to $F_n(X_{(k)})=k/n$ and $f_n(X_{(k)})=1/n$.

I then obtain $$ p_k=\sum_{j=0}^{m-1} {n\choose j}\frac{(n-k)^jk^{n-j}-(n-k+1)^j(k-1)^{n-j}}{n^n} $$

These two expressions differ [but see EDIT below!], as can be illustrated by

bsprob <- function(m,n,k){
  j <- 0:(m-1)
  sum(choose(n,j)*(((k-1)^j*(n-k+1)^(n-j)-k^j*(n-k)^(n-j))/n^n))
}

bsprob2 <- function(m,n,k){
  j <- 0:(m-1)
  sum(choose(n,j)*(((n-k)^j*k^(n-j)-(n-k+1)^j*(k-1)^(n-j))/n^n))
}

bsprob(4,6,3)
bsprob2(4,6,3)

EDIT: They do appear to agree at n=2*m-1, where they are supposed to - so I am currently trying to figure out why:

m <- 6
bsprob(m,2*m-1,3)
bsprob2(m,2*m-1,3)

EDIT 2, in response to @whuber's helpful comment:

So basically the first formula can be motivated via, letting $P_*(X^*_{(m)}=X_{(k)}|X_1,\ldots,X_n)=P_*(m=k)$ for short,

$$\begin{eqnarray*} P_*(m=k)&=&P_*(m\leq k)-P_*(m\leq k-1)\\ &=&1-P_*(m\leq k-1)-(1-P_*(m\leq k))\\ &=&P_*(m> k-1)-P_*(m> k)\\ &=&P_*(\text{at most } m-1 \text{ obs.}\leq X_{(k-1)})-P_*(\text{at most } m-1 \text{ obs.}\leq X_{(k)})\\ &=&\sum_{j=0}^{m-1} {n\choose j}\frac{(k-1)^j(n-k+1)^{n-j}-k^j(n-k)^{n-j}}{n^n} \end{eqnarray*}$$

Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106
  • 2
    Perhaps reformulating the question will make the formula clearer: you ask for the chance that (simultaneously) (a) at least $m$ of the values in a simple random sample without replacement will fall in the lowest $k/n$ of the population *and* (b) it is not the case that at least $m$ of the values fall in the lowest $(k-1)/n$ of the population. – whuber Jul 06 '21 at 11:39
  • 2
    @whuber, thanks, that indeed helps me understand "my" version of the probability. I tried to write (my understanding of Shao/Tu's version of) your reasoning down in the 2nd edit, hopefully correctly. – Christoph Hanck Jul 06 '21 at 15:22

0 Answers0