Why are samples within a cluster less informative than randomly chosen ones from entire population?

Question

Please give me mathematical explanation if possible. And also in the book Kothari 2004, it says:

There is also not as much information in ‘n’ observations within a cluster as there happens to be in ‘n’ randomly drawn observations.

Can you also give me mathematical explanation for this?

My second question is why is simple random sampling preferable to cluster sampling? I'm interested in the mathematical proof of betterness of the former in terms of randomness.

I think the question "Is Cluster sampling less precise than cluster sampling?" does not make any sense! — theomega, Apr 30 '12 at 08:32
Here's an ad hoc (i.e. non-mathematical) answer: Imagine everyone in the data set is perfectly correlated. Then, you only need to know one observation's relationship to a predictor you figure out the rest. So, you effectively have one observation. Now imagine a less extreme version of that. — Macro, Apr 30 '12 at 12:31

score 6 · Answer 1 · answered Apr 30 '12 at 10:06

6

Think of a typical clustering situation - a personal interview household survey where the primary sampling unit is the neighborhood or street, for logistic and cost reasons (a simple random sample of households over a nation or even a city is rarely practicable). Obviously sample subjects in the same street do not have the same information as the same number if drawn at random over the whole population, because those in the same street are likely to have in common a range of socio-economic variables (affording the rent/property prices for starters, never mind subtler cultural issues).

Any text on sample design or surveys would have a mathematical demonstration. Sampling in clusters is only done for practical and logistic reasons.

answered Apr 30 '12 at 10:06

Peter Ellis

16,522
1
44
82

you are right. What I am interested in actually is how can I be sure that random sample of 100 people in a nation is better than 10 randomly picked clusters of 10 people. – spartacus May 01 '12 at 10:13
Well, does my answer help you? If you want more, you'd have to say what you mean by a "cluster" in your case. Is it similar to my example of a neighbourhood. In any event, I'd strongly support looking at a text on sampling. – Peter Ellis May 01 '12 at 19:36
Yes your answer definitely helps. It's that there was a subtle mistake in wording of my original question. What I meant was. Let suppose we want to pick 100 people from a population of 1000. We could just randomly do that. Or if the people are already organized in clusters of 10, we can pick 10 clusters. Why is former preferable than the later. Is there a proof that it is heterogeneous? – spartacus May 02 '12 at 16:48
It depends how the "organization" happened. If it was at random then the two processes are equivalent. If organization was on the basis of something that might be relevant (eg 10 clusters of similarly educated people, in a survey that's looking at income levels) then there is definitely less information in the clustered approach. – Peter Ellis May 02 '12 at 21:28

score 3 · Answer 2 · answered Apr 30 '12 at 09:48

Because if you sample the cluster you just get information about the samples within the cluster. Samples within the cluster are more similar than random samples would be, else they would not be put in the same cluster.

The assumption is that when you cluster your data, the clusters are driven by one or more covariates (which might or might not be observed). So if your data happens to cluster by Factor A and you only sample within a cluster, you will not get any information about the effect of Factor A because all of your samples in the cluster will have the same level for that factor. This explanation is a little simplified, because it assumes clean clustering and assumes we know what drives it, but it should illustrate the point.

score 1 · Answer 3 · answered Apr 30 '12 at 14:37

If you can go through the mathematics of cluster sampling, just follow an explanation of the variance of the total for a cluster survey. See e.g. p. 174 of Lohr's 2nd edition (open the amazon look inside and type "icc" to search for it; the first reference on p. 174 gives you ANOVA table for cluster sampling in a balanced situation). The reference formula (5.7) that Amazon does not show is $$ \mathbf{V}(\hat t_{\rm cluster}) = N^2(1-\frac nN)\frac{M \, ({\rm MSB})}n $$

One can construct artificial examples of populations (or rather their cluster structures) when ICC<0, and hence the cluster sample is more efficient than SRS. For instance, the population clustered as $\{ \{1, 6, 8 \}, \{3, 5, 7\}, \{2, 4, 9 \} \}$ will have this weird property:

    y = c(1, 6, 8, 3, 5, 7, 2, 4, 9)
    i = rep(1:3, each=3)
    anova(lm(y~as.factor(i)))

So we see that this population (or rather the way it has been clustered) produces ${\rm MSB}=0$, and hence the variance of the total of the cluster sample of size $m=1$ cluster will be equal to 0, while the variance of the total of the SRS of the same size $n=3$ will be non-zero by that formula you will see on Amazon:

    N = length(y)
    n = 3
    V_SRS = N*N*(1-n/N)*sd(y)*sd(y)/n

The trick is that the mean of each cluster is equal to 5, the population mean (or rather the total of each cluster is equal to 15, as we talk about the variance between cluster totals; it will make a difference in an unbalanced situation), so there indeed is no variability between clusters.

I would suggest that you go through both the derivation of the cluster variance formula, as well as the above computation, step by step, to see how they work, and try to come up with two different cluster structures for the above y so that the cluster sample (i) will be less efficient than SRS (easy), and (ii) have non-zero MSB, unlike my example above, but still be more efficient than SRS (difficult).

Why are samples within a cluster less informative than randomly chosen ones from entire population?

3 Answers3