9

This is a technical problem I encountered in research. Sorry if it reads simple for professional statisticians.

Let $X_1, \cdots, X_n$ be iid standard Cauchy random variables. What estimate or exact formula exists for the median of $\sum_{i=1}^n |X_i|$? For $n=1$, it can be worked out explicitly the median is $1$: $$ \mathbb{P}\left[X \leq t\right] = \frac{2}{\pi}\int_0^t\; \frac{1}{1+x^2}\; dx = \frac{2}{\pi} \left.\arctan x \right|_{0}^t = \frac{2}{\pi} \arctan t. $$

What about $n>1$? Any suggestions?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Ju Sun
  • 91
  • 2
  • Its pretty hopeless for $n=2$ to get analytic form $$Pr(X_1+X_2 – probabilityislogic Feb 05 '12 at 15:24
  • 1
    Thanks for the edits. Can you give some more detail on what you are looking for? For example: Are you trying to *prove* something, so, e.g., an upper and/or lower bound would be useful? What size of $n$ are you concerned with? Or do, you want something asymptotic, like a growth rate? Sorry for so many questions. I generally agree with @probabilityislogic that it looks unlikely you'll get a nice closed-form expression for general $n$. :) – cardinal Feb 05 '12 at 19:17
  • @cardinal: Thanks for the reply and clarification. I'm mostly concerned with how the median would grow with $n$. My simulations showed it's very roughly $n\log n$. I hope some rigorous analysis on this can be found. – Ju Sun Feb 05 '12 at 21:38

2 Answers2

3

One way to deduce the simulated $n\log(n)$ behavior is through truncation.

Generally, for any distribution $F$ and $0\le \alpha\lt 1$ let $F_\alpha$ be truncated symmetrically in both tails so that for $F^{-1}(\alpha/2)\le x \le F^{-1}(1-\alpha/2),$

$$F_\alpha(x) = \min{\left(1, \frac{F(x) - \alpha/2}{1-\alpha}\right)}.$$

Let $E_\alpha$ be the expectation of $F_\alpha$ defined by

$$E_\alpha = \int_{\mathbb R} x\,\mathrm{d}F_\alpha(x).$$

Because the support of $F_\alpha$ is bounded, it has finite expectation and variance. The Central Limit Theorem tells us that for sufficiently large sample sizes $n,$ the sum of $n$ iid random variables with $F_\alpha$ for their common distribution is approximately Normal with mean $n E_\alpha.$ This approximate Normality implies the median of the sum distribution also is approximately $n E_\alpha.$

Now, the median of a sum of independent variables with distribution $F$ won't be quite the same as this approximate median: it must lie somewhere between the $1/2-\alpha/2$ and $1/2+\alpha/2$ quantiles. In the present case, where $F$ is nonzero in a neighborhood of its median, this means that by taking $\alpha$ sufficiently small (but still nonzero), we can make the median of $F_\alpha$ as close as we like to the median of $F=F_0.$

It remains only to find the mean of $F_\alpha$ in the question, where $F$ is the half-Cauchy distribution determined by

$$F(x) = \frac{2}{\pi}\int_0^x \frac{\mathrm{d}t}{1+t^2} = \frac{2}{\pi}\tan^{-1}(x)$$

for $x\ge 0.$

The ensuing calculations are straightforward. First,

$$E_\alpha = \frac{1}{1-\alpha}\int_{x_0}^{x_1} \frac{x\,\mathrm{d}x}{1+x^2} = \frac{1}{\pi(1-\alpha)} \log\frac{1+x_1^2}{1+x_0^2}$$

where $F(x_0) = \alpha/2$ and $F(x_1) = 1-\alpha/2.$ Thus

$$x_0 = \tan\left(\frac{\pi\alpha}{4}\right) \approx \frac{\pi\alpha}{4},\quad x_1 = \tan\left(\frac{\pi}{2}-\frac{\pi\alpha}{4}\right) = \frac{1}{x_0} \approx \frac{4}{\pi\alpha},$$

whence

$$E_\alpha \approx \frac{1}{\pi} \log \frac{1 + \left(\frac{4}{\pi\alpha}\right)^2}{1 + \left(\frac{\pi\alpha}{4}\right)^2} \approx \frac{2}{\pi}\log\left(\frac{4}{\pi\alpha}\right).$$

Finally, when taking a sample of size $n$ from $F$ we would like to ensure that most of the time the sample really is a sample of $F_\alpha,$ uncontaminated by any values in its tails. This happens with probability $(1-\alpha)^n.$ To make this chance exceed some threshold $1-\epsilon,$ with a tiny value of $\epsilon,$ we must have

$$\alpha \approx \frac{\epsilon}{n}.$$

Using this value in the foregoing gives

$$E_\alpha \approx \frac{2}{\pi}\log\left(\frac{4}{\pi\epsilon/n}\right) = \frac{2}{\pi}\left(\log\left(\frac{4}{\pi\epsilon}\right) + \log(n)\right).$$

Consequently, the median of the sum of $n$ iid half-Cauchy variables must be close to

$$nE_\alpha \approx \frac{2}{\pi}n\log(n) + \left(\frac{2}{\pi}\log\frac{4}{\pi\epsilon}\right)n.$$

For any desired $\epsilon\gt 0,$ we may choose $n$ so large that the value is dominated by the first term: that is how the $n\log n$ behavior arises. Moreover, now we have the implicit constant $2/\pi$ along with the asymptotic order of the error term, $O(n).$

Let the "relative median" be $nE_\alpha$ divided by $\left(2/\pi\right) n\log(n).$ Here is a plot of such relative medians as observed in 5,000 simulations of samples up to size 5,000. The reference red curve is a multiple of $1/\log(n),$ which this analysis indicates is proportional to the asymptotic relative error. The fit is good.

Figure

The R code to produce it can be used for additional simulation studies.

n.sim <- 5e3  # Simulation count
n <- 5e3      # Maximum sample size
X <- apply(apply(matrix(abs(rt(n*n.sim, 1)), n), 2, cumsum), 1, median)

N <- seq_len(n)[-1]        # Can't divide by n log(n) when n==0
X <- X[-1]
x.0 <- 2/pi * N * log(N)   # Theory
#
# Plot the relative medians.
#
plot(N, X / x.0, xlab="n", log="xy", cex=0.5, pch=19,
     ylab="Value",
     main="Simulated Relative Medians")
#
# Draw a reference curve.
#
fit <- lm(X / x.0 - 1 ~ 0 + I(1/log(N)))
summary(fit)
b <- coefficients(fit)[1]
curve(1 + b/log(n) , add=TRUE, xname="n", lwd=2, col="Red")
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    Your arguments and mine lead to similar integrals here. My informal statements of $x \sim y$ seem to correspond roughly to your statements of $P[|x-y| < O(n)] – Matt F. Aug 12 '21 at 04:02
2

Here is one way to estimate the median in the question, using order statistics: \begin{align} \text{median}\left(\sum_{i=1}^n|X|_i\right) &=\text{median}\left(\sum_{i=1}^n|X|_{(i)}\right) && (1)\\ &\sim\sum_{i=1}^n\text{median}\left(|X|_{(i)}\right) && (2)\\ &\sim\sum_{i=1}^n F_{|X|}^{-1}\left(\frac{i}{n+1}\right) && (3)\\ &=\sum_{i=1}^n F_{X}^{-1}\left(\frac{1}{2}+\frac{i}{2n+2}\right) &&\\ &=\sum_{i=1}^n \tan\left(\frac{\pi i}{2n+2}\right) && \\ &\sim \sum_{i=1}^n \int_{x=i-1/2}^{i+1/2} \tan\left(\frac{\pi x}{2n+2}\right)dx && (4)\\ &=\int_{x=1/2}^{n+1/2} \tan\left(\frac{\pi x}{2n+2}\right)dx && \\ &=\frac{2n+2}{\pi}\log\left(\cot\left(\frac{\pi}{4n+4}\right)\right) && (5)\\ \end{align}

  • In (1), we wrote the sum as the sum of order statistics.
  • In (2), we replaced the median of the sum by the sum of the medians, which is the most questionable part of this estimate.
  • In (3), we used the standard estimate for the central value of an order statistic.
  • In (4), we estimated a tangent as the average of nearby tangents.
  • In (5), the cotangent of small $x$ is like $1/x$, so we could also write the final result as $O(n \log n)$, as predicted in the comments above.
Matt F.
  • 1,656
  • 4
  • 20