The distribution of the largest index in a sample that is smaller than the median of the population

Question

Consider a sorted sample $\vec{x}(n)$ with size $n > 100$ ($n$ is odd) from some rv $X$. Let $m$ denote the median of $X$. Let $R(\vec{x}(n))$ be the largest index $i \in [n]$ s.t. $x_i \leq m$. What is the distribution of $R(\vec{x}(n))$?

I was able to determine empirically by running the experiment multiple times that $R(\vec{x}(n))$ is normally distributed with $\mu = \frac{n}{2}$ and $\sigma^2 = \frac{n}{4}$, however I was not able to prove that. I'm looking for a proof.

Here's the code I used for the experiments:

def r(chosen, median):
    chosen.sort()
    if chosen[0] > median:
        return 0
    index = 1
    for first, second in zip(chosen, chosen[1:]):
        if second > median:
            return index
        index += 1
    return index

data = range(50000)
median = np.median(data)
n = 101

indices = []
for experiment in range(50000):
    chosen = random.choices(data, k=n)
    indices.append(r(chosen, median))

_, _ = plt.subplots(figsize=(16, 6))
sns.histplot(data=indices, bins=40, kde=True)
plt.show()
print('Empiric mean: {}'.format(np.mean(indices)))
print('Formula mean: {}'.format(n/2))
print('Empiric std: {}'.format(np.std(indices)))
print('Formula std: {}'.format(math.sqrt(n)/2))

This result is incorrect unless $X$ has a density in a neighborhood of its median. For correct answers please see https://stats.stackexchange.com/questions/45124. — whuber, Dec 26 '21 at 21:01
The link in https://stats.stackexchange.com/questions/45124 examines the asymptotic distribution of the sample median, and not the actual distribution like this question — svendvn, Dec 27 '21 at 01:25
@svendm That is not correct: my answer in that thread gives the exact distribution for any sample size. — whuber, Dec 27 '21 at 16:26

score 1 · Accepted Answer · answered Dec 26 '21 at 23:05

Assuming the sample comes from a continuous distribution, that is for $P(x\le med)=0.5~~\forall x\in\vec{x_n}$ to hold, i'm gonna prove that $R(\vec{x_n})\sim B(n,p=0.5)$.

Consider the (unsorted) vector of samples $\vec{x_n}$. Every entry has a probability of $0.5$ to be located to the right of the median and $0.5$ probability to be located to the left of it. We consider being smaller equal than the median as a success.
Say we have $k$ successes, that means we have $k$ entries which are smaller equal than $m$ and $n-k$ entries bigger than $m$. We now sort $\vec{x_n}$. As $k$ entries are less equal than $m$, these must be $x_1,x_{2},...,x_k$. Hence $x_{k}$ is less equal to $m$ and it's the largest of all of them (as the list is now sorted). namely: $$R(\vec{x_n})=k$$

Let $Y\sim B(n,0.5)$. We actually got that: $$P\{R(\vec{x_n})=k\}=P\{Y=k\}= {n\choose k}p^k(1-p)^{n-k}=\frac{{n\choose k}}{2^n}$$ Moreover: $$E[R(\vec{x_n})]=np=\frac{n}{2}$$ $$V[R(\vec{x_n})]=np(1-p)=\frac{n}{4}$$ As you suggested.

If the population distribution is not continuous, the result is simply $R(\vec{x_n}) \sim B(n, P(X<=m))$ using the same arguments that you have used in this answer :) — svendvn, Dec 27 '21 at 01:27

The distribution of the largest index in a sample that is smaller than the median of the population

1 Answers1