Nonparametric Identification from Order Statistics

Question

Suppose a vector of random variables $(X_1,...,X_n,Y_1,...,Y_m)$ is such that $X\sim F(\cdot)$ and $Y\sim G(\cdot)$. So $X$ are distributed independently and identically as $F(\cdot)$ and $Y$ as $G(\cdot)$. We only observe $n+m$ ordered variables $(Z_1<Z_2<.....<Z_{n+m})$. The question is, can we recover the two CDFs $F(\cdot)$ and $G(\cdot)$ from the $Z's$?

Edit:

When I said recover I meant to identify, which is a different problem from estimate but related. the idea being that if I have "infinitely" many iid observations from $F$, I can identify and then estimate $F$ (using ECDF and invoke some consistency). Here I only know $Z_1<\ldots <Z_{n+m}$ and I know that there are two CDFs that generate $Z$'s. I am sorry if my statement was confusing.

You can't even "recover" $F$ from an iid sample of $F$--you can only estimate it based on assumptions about $F$. Unless you provide additional information about the relationship between $F$ and $G$, then isn't it obvious there is no way to separate the information for $F$ from that for $G$ in this setting? That leads me to suspect you have abstracted away some essential information about the problem you actually have. If that's so, could you edit this question to include enough information to obtain a solution? — whuber, Oct 25 '15 at 18:23

score 1 · Answer 1 · answered Oct 26 '15 at 12:33

While I do not provide a whole solution, I will at least try to provide a starting point for formalizing and answering your question. I hope you or someone else can fill in the details.

So you have distribution functions $F$ and $G$ and a random vector (for $n \in \mathbb N$):

$$ (X,Y) \sim F^n\otimes G^n$$

And you are asking whether $F$ and $G$ are identifiable from the distribution of the order statistics of $(X,Y)$ (after "unpacking" it).

I think that to answer this question it is more instructive to look at the case $n=1$.

$$ (X,Y) \sim F\otimes G$$

Now you are asking: Are $(F,G)$ identifiable from the distribution of $(\min(X,Y), \max(X,Y))$?

since

$$\Pr[\min(X,Y) \leq t] = F(t) + G(t) - F(t)G(t)$$

and

$$\Pr[\max(X,Y) \leq t] = F(t)G(t)$$

you see that you can identify some of the information about $F$ and $G$. You also have available information about the covariance of $(\min(X,Y), \max(X,Y))$ (more generally: their copula) which I have not analyzed here.

But is this enough to identify $F$ and $G$? Obviously it is not because you definitely cannot distinguish between $F\otimes G$ and $G\otimes F$. Is it identifiable up to "naming"? I think it might be with some further conditions, such as strict monotonicity of $F$ and $G$, but I have not thought through them.

I like your approach but my answer disagrees with yours, I'd appreciate any feedback. — Sergio Parreiras, Apr 07 '17 at 20:50

Sergio Parreiras · Answer 2 · 2017-04-08T15:20:51.557

The answer is yes and moreover, you can identify $F$ and $G$ only using two order statistics: $Z_1$ and $Z_{m+n}$.

As in air's answer, let's consider the case $m=n=1$. Let $H(t)=\Pr[Z_1\le t]$ and $I(t)=\Pr[Z_2\le t]$ be the known cdfs of $Z_1=\min(X,Y)$ and $Z_2=\max(X,Y)$. Again, as in air's answer, we have that $H$ and $I$ are related to $F$ and $G$ by the system of equations:

\begin{align*} H=&1-(1-F)(1-G)=F+G-F\,G\\ I=& F\,G \end{align*}

Thus, since $F=I/G$ by the second equation, we obtain from equation one that $H = \frac{I}{G}+G-I$, or $$ G^2- (H+I) G + I =0$$.

Solving the quadratic equation give us: $G=\frac{(H+I)+\sqrt{(H+I)^2-4I}}{2}$ (the negative root is not consistent with $1\ge H\ge I \ge 0$) and so $F=\frac{2I}{(H+I)+\sqrt{(H+I)^2-4I}}$. That is, we identified $F$ and $G$.

Now consider the general case, $H(t)=\Pr[Z_1\le t]$ and $I(t)=\Pr[Z_{m+n}\le t]$ and so:

\begin{align*} H=&1-(1-F)^n(1-G)^m\\ I=& F^n\,G^m \end{align*} Although, we might not be able able to explicitly solve for $F$ and $G$, it is still possible to solve it numerically for $F(t)$ and $G(t)$ given the values of $H(t)$ and $I(t)$. Thus, you can identify $F$ and $G$ only using two order statistics: $Z_1$ and $Z_{m+n}$.

Nonparametric Identification from Order Statistics

2 Answers2