2

Select $n$ numbers without replacement from the set $\{1,2,...,m\}$, and generate the set $S=\{a_1,a_2,...,a_n\}$. I want to calculate the expectation of the variance for the sampling set $\mathbb{E}[Var(S)]$ and the maximum variance among all samples : $\max{Var(S)}$.

Besides, what's the distribution of the sample variance?

Theta30
  • 625
  • 3
  • 16
Fan Zhang
  • 399
  • 2
  • 8
  • It looks like a homework, so my given answer is not precise, since giving detailed answers to homework problems is not beneficial. – mpiktas Jun 23 '11 at 07:50
  • @mp A glance at @Fan's [previous question](http://stats.stackexchange.com/q/10981) indicates this is not homework. – whuber Jun 23 '11 at 16:12

2 Answers2

2

I will give the hints for the first question. I assume that you are sampling without a replacement from the set $A=\{a_1,...,a_m\}$. We have that

$$P(S=\{a_1,...,a_n\})=\frac{1}{m\choose n}$$

So

$EVar(S)=\sum_{\{i_1,...,i_n\}\subset \{1,...,m\}}\left(\frac{1}{n}\sum_{j=1}^n a_{i_j}^2-\left(\frac{1}{n}\sum_{j=1}^na_{i_j}\right)^2\right)\frac{1}{m\choose n}$

Now

$$\sum_{\{i_1,...,i_n\}\subset \{1,...,m\}}a_{i_1}^2={{m-1}\choose{n-1}}\sum_{i=1}^ma_i^2 $$

and

$$\sum_{\{i_1,...,i_n\}\subset \{1,...,m\}}a_{i_1}a_{i_2}={{m-2}\choose {n-2}}\sum_{i\neq j}^ma_ia_j$$

So the first term in expectation will be

$$\sum_{\{i_1,...,i_n\}\subset \{1,...,m\}}\frac{1}{n}\sum_{j=1}^n a_{i_j}^2\frac{1}{m\choose n}=\sum_{i=1}^na_i^2\frac{1}{m\choose n}{{m-1}\choose {n-1}}=\frac{1}{m}\sum_{i=1}^ma_i^2$$

I'll leave the second term as an exercise. The end result should be that

$$EVar(S)=Var(a_1,...,a_m)$$

maybe with some constants missing. As this is a standard result in survey sampling theory, you can look it up in appropriate book.

As for the second question, I do not think there is a closed formula. The case with $m=3$ and $n=2$ illustrates this. Then there are 3 possible samples and $Var S$ can get three values $(a_1-a_2)^2/4$, $(a_2-a_3)^2/4$ and $(a_1-a_3)^2/4$. The maximum depends on the set $A=\{a_1,a_2,a_3\}$.

mpiktas
  • 33,140
  • 5
  • 82
  • 138
  • @mpiktas I think there is something wrong with your result. The EVar(S) only depends on {a1,...,an} in your result. But I think it should depend on {a1,...,am}. – Fan Zhang Jun 23 '11 at 08:28
  • @Fan, thanks, I fixed it. I usually use $n$ to indicate the size of the sample, hence the mistake. – mpiktas Jun 23 '11 at 08:36
  • @mpiktas Thank you. Do you have any idea about Max Var(S), i think this one is more difficult than the expectation. – Fan Zhang Jun 23 '11 at 09:53
  • @Fan, yes the second one is more difficult. I suggest checking the textbooks on survery sampling to get acquainted with better machinery. Where does this problem come from? – mpiktas Jun 23 '11 at 10:16
  • @mpiktas Actually, this is a problem from myself. I want to know what pattern will generate the maximum variance. Note that the set is fixed {1,2,...,m}, for m=10, n=5, the set {1,2,3,9,10} has the biggest variance after I have enumerated all possible combinations. – Fan Zhang Jun 23 '11 at 11:48
  • @Fan, for the set $\{1,...,m\}$ the maximum variance I suspect be achieved with the sample $\{1,2,...,n/2,m-n/2+1,...,m\}$ with suitable adjustment for odd $n$. The question is how to prove it. – mpiktas Jun 23 '11 at 12:06
  • @FanZhang let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/608/discussion-between-mpiktas-and-fan-zhang) – mpiktas Jun 23 '11 at 12:06
  • @mpiktas I have some problem with loging on the chat room, I haved enumerated other cases, it seems your conjuecture is right. Besides, I have another similar problem which maybe you could help me, can i send it to your email? – Fan Zhang Jun 23 '11 at 13:58
2

We know that

$$\widehat{Var}(\mathbf{a}) = \frac{1}{n-1}\left(\sum_{i=1}^n a_i^2 - \frac{1}{n}\left(\sum_{i=1}^n a_i \right)^2 \right)$$

is an unbiased estimator of the population variance, which is easily computed as $(m+1)m/12$. This, therefore, answers the first question concerning the expected variance.

I will only sketch how to maximize the variance. I claim it is maximized when the $a_i$ are in two contiguous blocks: that is, $\mathbf{a}$ is in the form

$$\mathbf{a} = (1, 2, \ldots, k, m-l+1, m-l+2, \ldots, m).$$

(Evidently $k+l = n$.) To prove this claim, suppose $\mathbf{a}$ is not in this form: then you can find a gap in one of the end sequences and increase the variance by changing one of the components of $\mathbf{a}$ to that gap. It remains only to maximize the variance among these special forms of $\mathbf{a}$; this is done by making the end sequence lengths as balanced as possible; that is, by setting $k=l$ when $n$ is even and otherwise by setting either $k=l+1$ or $l=k+1$. When $n=2k$ is even, the maximum variance equals

$$ n \frac{\left(3 m^2-3 m n+n^2 -1\right)}{12 (n-1)}.$$

When $n=2k+1$ is odd, the maximum variance is

$$(n+1) \frac{\left(3 m^2-3 m n+n^2\right)}{12 n}.$$

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • i think if n=1, the expectation of variance is 0, otherwise the expectation is (m+1)m/12. – Fan Zhang Jun 25 '11 at 11:13
  • @Fan With the definition I provided on the first line, the variance does not exist when $n=1$. – whuber Jun 25 '11 at 22:17
  • Hi whuber, Can I send you a draft of my idea on the correlation problem to your email, I was trapped at current point. – Fan Zhang Jun 26 '11 at 16:12