Question on Covariance for sampling without replacement

Question

Suppose I have numbers 1,2... 10 and I sample 5 from them randomly without replacement noted as $X_1, X_2, X_3, X_4, X_5$ What is $Cov(X_i,X_j)$ for $i \not=j$

So $Cov(X_i,X_j)=E(X_iX_j)-E(X_i)E(X_j)$

I consider that any $X_i$ treated on its own is $Uniform(10)$ so $E(X_i)=E(X_j)=11/2$

For $E(X_iX_j)$ I am a bit stuck.

I first considered getting the joint $f(x_i,x_j)=f(x_i|x_j)f(x_j)=\frac{1}{(n-1)n}=1/90$ but this seems wrong.

@StasK It does not tally to the answer I'm expecting, but I may have made a mistake in finding $E(x_i,x_j)$ using this joint pdf. — user164144, Aug 21 '17 at 21:15
Well, then, show your derivations from that point on (edit the question to show more work). Since $X_i$ and $X_j$ are exchangeable, you should eventually get a covariance of zero... that's my intuition, at least. — StasK, Aug 21 '17 at 21:18
$E(X_i) = (1+2+\cdots+10)/10=11/2,$ not $11/5$. @StasK Please check your intuition: the lack of independence has to create a small *negative* correlation: each observation excludes any repetition of itself among all the other observations. — whuber, Aug 21 '17 at 21:21
Yes, my mistake. I corrected $E(X_i)$ already. So, yes I am certain they are not independent $(Cov\not=0)$ since they are sampled without replacement. Any hint on how I can get to $E(X_iX_j)$? — user164144, Aug 21 '17 at 21:25
This is a weird result (as many finite population/sampling results are). But this may be a manifestation of some sort of finite population correction for covariance. — StasK, Aug 22 '17 at 12:59
Aha @whuber, I figured where my intuition failed me. First, I am used to thinking about correlation of the basic random variables in sampling, which are the sampling indicators -- and these are indeed uncorrelated under SRSWOR. Second, conditional on $X_1=x_j$ where $X_1$ is the first random draw, and $x_j \in \mathcal P$ is the $j$-th unit in population, the expected value ${\mathbb E}\,[X_2|X_1]=1/(N-1) (T[x] - x_j) = 1/(N-1) (T[x]-X_1)$ which is clearly negatively correlated with $X_1$. I need to teach sampling again to refresh crap like that in my head :). — StasK, Sep 19 '17 at 14:22

score 9 · Accepted Answer · answered Aug 21 '17 at 22:06

Problems in sampling from finite populations without replacement can usually be solved in terms of the sample inclusion probabilities $\pi(x)$, $\pi(x,y)$, etc.

Let $\pi(x) = \Pr(X_1 = x)$ for any $x$ in the population $\mathcal P$ (with $n=10$ elements) and let $\pi(x,y)=\Pr((X_1,X_2)=(x,y))$ for any $x$ and $y$ in $\mathcal P$. By definition of expectation,

$$E(X_1) = \sum_{x\in\mathcal P} \pi(x)x\tag{1}$$

and

$$E(X_1X_2) = \sum_{(x,y)\in\mathcal{P}^2} \pi(x,y)x y \tag{2}.$$

For this sampling procedure $X_1$ has equal chances of being any of the $n$ elements of $\mathcal P $, whence $$\pi(x)=\frac{1}{n}\tag{3}$$ for all $x$. Because sampling is without replacement, only the pairs $(x,y)$ with $x\ne y$ are possible, but all $n(n-1)$ of those are equally likely. Therefore

$$\pi(x,y) = \left\{\matrix{\frac{1}{n(n-1)} & x\ne y \\ 0 & x=y} \right.\tag{4}$$

That's the general result. For any particular population, you just have to do the arithmetic implied by formulae $(1)$ through $(4)$.

Suppose now that $\mathcal{P} = \{1,2,\ldots, n\}$. Formulae $(1)$ and $(3)$ give

$$E(X_1) = \sum_{i=1}^{n} \frac{1}{n} i = \frac{n+1}{2}$$

while formulae $(2)$ and $(4)$ give

$$\eqalign{E(X_1X_2) &= \sum_{i,j=1;\, i\ne j}^{n} \frac{1}{n(n-1)} i j \\ &= \frac{1}{n(n-1)}\left(\sum_{i=1}^{n}\sum_{j=1}^{n} i j - \sum_{i=1}^{n} i^2\right)\\ &= \frac{1}{n(n-1)}\left(\sum_{i=1}^{n}i\ \sum_{j=1}^{n} j - \sum_{i=1}^{n} i^2\right)\\ &= \frac{1}{n(n-1)}\left(\left(\frac{n(n+1)}{2}\right)^2 - \frac{n(1+n)(1+2n)}{6}\right) \\ &= \frac{3n^2 + 5n + 2}{12}. }$$

Because there is no distinction among any of the $X_i$, these results hold for any $i \ne j$, not just $i=1$ and $j=2$. In particular,

$$\operatorname{Cov}(X_i,X_j) = E(X_iX_j) - E(X_i)E(X_j) = \frac{3n^2 + 5n + 2}{12} - \left(\frac{n+1}{2}\right)^2 = -\frac{n+1}{12}.$$

When $n=10$, the covariance of $X_i$ and $X_j$ is $-11/12 \approx -0.917$. As a check, here is a simulation of a million such samples (using R):

> cov(t(replicate(1e6, sample.int(10, 5))))

The output is the $5\times 5$ covariance matrix of $(X_1, \ldots, X_5)$. Because this is a simulation the output is random; but because it's a largish simulation, it's reasonably stable from one run to the next. In the first simulation I did, the off-diagonal elements of this covariance matrix ranged from $-0.9277$ to $-0.9080$ with a mean of $-0.9169$: narrowly spread around $-11/12$ as one would expect.

Thank you very much! My mistake was not getting $E(X_iX_j)$ right despite figuring out the density of $f(x_1,x_2)$ — user164144, Aug 21 '17 at 23:15

Question on Covariance for sampling without replacement

1 Answers1

Linked