Uncertainty in Hierarchical CDFs

Question

Let's begin with a simple case with a known answer:

Suppose we have samples $x_1,...,x_N$ drawn i.i.d. from a real-valued distribution $D$. A common first step in analyzing this data might be to plot the empirical CDF of the data.

The empirical CDF is useful because it tells us a little about the true CDF of $D$. For example, if we draw $N=101$ samples, it would seem like the 51st largest sample (i.e., the median of the empirical distribution) is probably a reasonable proxy for the actual median of $D$. This intuition can be made precise: as long as $D$ is a continuous distribution, then the distribution of the actual quantile value (in $D$) for any sample is beta-distributed as a function of $N$ and the position of the sample in sorted order. (Cf, e.g., an introduction to order statistics.) For the remainder of this question, please take all real-valued distributions as continuous.

These beta distributions provide a strong characterization of $D$ with minimal assumptions-- it's great both in theory and practice. I'm looking for a similar result when the distribution is hierarchical.

As an example, suppose we choose $N$ people. Each person has a distribution for how many hours they sleep each night, say $D_k$ for person $k$. We collect $S_k$ samples $x_{k,1},x_{k,2},...,x_{k,S_k}$ from person $k$. For instance, we may collect a single data point from person 1, but 100 data points from person 2.

If we weight by samples (i.e., ignore the fact that repeated samples come from the same person), we would recover the situation above. If we weight by people, the empirical distribution is clear: we put probability $1/(NS_k)$ on sample $x_{k,i}$. But how do these values relate to the underlying distribution of quantiles? (I.e., what plays the role of the beta distributions?)

There seem to be (at least) three ways to make this question precise, and I'd very much appreciate any advice on thinking about these. I'll present them in (I believe) increasing order of complexity. Incidentally, to avoid some technicalities, it maybe useful to assume that the distributions (in addition to being continuous) all share the same range.

In all cases below, when I ask to estimate an underlying distribution, I mean: for an observation $x_{k,i}$, we wish to compute the probability that $x_{k,i}$ corresponds to any given quantile.

Question #1: Assume that our entire population consists of these $N$ people. We are trying to estimate quantiles for: $$F=(1/N)\sum_{k=1}^N D_k$$

We can answer this problem as follows. First, we can compute the quantile distribution for each person at each sampled point using the machinery above. The distribution across quantiles is simply the uniform mixture of the $N$ beta distributions. Note that this also works with a weighted distribution across people: we simply replace $1/N$ with the appropriate weight.

Question #2: Assume that there is an underlying population of $M\gg N$ people, and we are trying to estimate: $$G=(1/M)\sum_{k=1}^M D_k$$

Question #3: We have a possibly infinite population of people, where person $i$ is associated with some real-valued distribution $D_i$. We sample $i$ (and thus $D_i$ from some distribution $P$ over distributions; we are trying to estimate the distribution $$H=E_{p\sim P} [D_p]$$

I would most like the answer to Question #3, but would settle for Question #2. Any help (or advice on reframing this question, if appropriate) would be greatly appreciated.

First of all you say that for any continuous distribution D the order statistics have a beta distribution. This is only true for a uniform distribution on [0, 1]. Check out the Wikipedia link you gave. — Michael R. Chernick, May 24 '17 at 21:24
If the distribution D has a wider interval than [0, 1] then so will the distribution of the order statistics. — Michael R. Chernick, May 24 '17 at 21:29
@Michael The quantiles *do* have a uniform distribution, regardless of the underlying distribution, provided only that it is continuous. — whuber, May 24 '17 at 22:06
@whuber I see that the OP mentioned quantiles. I was considering the order statistic because that was what was discussed in the Wikipedia link. I guess I didn't read that part carefully enough. — Michael R. Chernick, May 24 '17 at 22:32
@whuber Looking at the OPs statement carefully he didn't say the quantiles had a uniform distribution. He said it has a beta distribution. That is why I thought he was referring to the order statistics. — Michael R. Chernick, May 24 '17 at 22:40
@Michael The quantile associated with the variable itself, $F_X(X)$, has a uniform distribution. That's a consequence of his assertion that the distribution is continuous. The Beta distribution follows from this, as illustrated in the last figure of my post at https://stats.stackexchange.com/a/86804/919. — whuber, May 24 '17 at 23:31
@whuber Thank you, you've summarized what I was trying to say clearly; I apologize if my reference to order statistics was misleading. And the figure you reference is gorgeous :) — Bill Bradley, May 25 '17 at 00:55

Uncertainty in Hierarchical CDFs

0 Answers0