I've been reading about stratified sampling, 2-stage SRS sampling, and ratio estimation in finite populations and I have a question. When the ratio estimator is introduced, it seems that in order for it to perform well it is necessary that the population follows the following model (which I'll denote $\xi$): $$y_{ij} = \beta x_{ij} + \epsilon_{ij}$$ with the additional requirement that $E(\epsilon_{i,j}) = 0$ and $V(\epsilon_{i,j}) = \sigma^2 x_{i,j}$ where $x$ refers to the auxiliary information. I'm using subscripts $i, j$ since I'm thinking of this in the context of clusters.
Here's my understanding of this issue as a whole: from the design-based perspective (perhaps I should call this the model-assisted design-based perspective?), inference about the ratio point estimate is based on all possible samples under the design (2-stage SRS) and is not at all based upon this model, although if this model is met by the population then we'll see much better performance. If instead we assume this model (i.e. take the model-based route) then inference is based upon it and we actually need it to be true. As I understand it, this will lead to the same point estimate but potentially different variance estimates because we are either estimating the variance over all possible samples (design-based) or over all possible populations realizable under this model with these model parameters (i.e. $\beta$ and $\epsilon_{i,j}$). We'll also have that the ratio estimate is design-biased and model-unbiased. Please correct me if any of this is wrong.
Here's my actual question. When we move on to cluster sampling, this model does not seem to be mentioned at all. As I understand it, the auxiliary information is the size of the clusters (denoted $M_i$). That the relationship passes through the origin, i.e. $y_{i,j} = 0$ iff $M_i = 0$, is clearly built in, but what about the variance relationship? I don’t see where variance proportional to cluster size comes in. Yet the ratio estimator seems to be viewed as a very reasonable (simple) choice so it seems that the conditions for it to perform well must be met. Any clarification would be tremendously appreciated.