4

I've been reading about stratified sampling, 2-stage SRS sampling, and ratio estimation in finite populations and I have a question. When the ratio estimator is introduced, it seems that in order for it to perform well it is necessary that the population follows the following model (which I'll denote $\xi$): $$y_{ij} = \beta x_{ij} + \epsilon_{ij}$$ with the additional requirement that $E(\epsilon_{i,j}) = 0$ and $V(\epsilon_{i,j}) = \sigma^2 x_{i,j}$ where $x$ refers to the auxiliary information. I'm using subscripts $i, j$ since I'm thinking of this in the context of clusters.

Here's my understanding of this issue as a whole: from the design-based perspective (perhaps I should call this the model-assisted design-based perspective?), inference about the ratio point estimate is based on all possible samples under the design (2-stage SRS) and is not at all based upon this model, although if this model is met by the population then we'll see much better performance. If instead we assume this model (i.e. take the model-based route) then inference is based upon it and we actually need it to be true. As I understand it, this will lead to the same point estimate but potentially different variance estimates because we are either estimating the variance over all possible samples (design-based) or over all possible populations realizable under this model with these model parameters (i.e. $\beta$ and $\epsilon_{i,j}$). We'll also have that the ratio estimate is design-biased and model-unbiased. Please correct me if any of this is wrong.

Here's my actual question. When we move on to cluster sampling, this model does not seem to be mentioned at all. As I understand it, the auxiliary information is the size of the clusters (denoted $M_i$). That the relationship passes through the origin, i.e. $y_{i,j} = 0$ iff $M_i = 0$, is clearly built in, but what about the variance relationship? I don’t see where variance proportional to cluster size comes in. Yet the ratio estimator seems to be viewed as a very reasonable (simple) choice so it seems that the conditions for it to perform well must be met. Any clarification would be tremendously appreciated.

jld
  • 18,405
  • 2
  • 52
  • 65

1 Answers1

3

If the model for your data is $$ y_{ij} = \mu_i + \varepsilon_{ij}, \quad {\rm E}_\xi \varepsilon_{ij} = 0, {\rm V}_\xi \varepsilon_{ij} = \sigma^2 $$ where the subindex $\xi$ stands for the model expectations, then the cluster totals $T_i[y] = \sum_j y_{ij}$ have the model moments $$ {\rm E}_\xi T_i[y] = M_i \mu, \quad {\rm V}_\xi T_i[y] = M_i \sigma^2, $$ so there clearly is proportionality to the cluster size.

If you want to understand design-based inference, think in totals. These are the only linear statistics; everything else is a ratio or another non-linear statistic that requires a delta-method.

StasK
  • 29,235
  • 2
  • 80
  • 165
  • Thank you very much. The model-based variance is clear to me now. Just to make sure that I understand the design-based part, for the design-based totals (assuming SRS within each cluster) we have $V(t_i) = \frac{M_i^2}{m_i}\left(1-\frac{m_i}{M_i}\right)s_i^2$. Is it the $M_i^2$ in front that makes the variance proportional? – jld Feb 25 '14 at 19:31
  • 1
    Yes and no -- this is a different order of magnitude. I would rather say that $M_i^2$ is the scaling parameter for $s_i^2$. Let's say we are measuring $y$ in days. Then $t_i$ is on the scale of years if we have hundreds of observations per cluster, and so $V(t_i)$ must be on the scale of years ^2. $s_i^2$ is only on the scale of days^2, as it is per-observation variance, so $M_i^2$ brings it up to the scale of years^2. Design-based calculations only have the randomization intuition, which may not be particularly relevant here. – StasK Feb 26 '14 at 00:27