Conceptual definition between randomness, representativeness and bias in sampling

Question

I was wondering if you can help me clarifying some concepts (if it is possible providing references to papers or books) that I will write in the form of ideas rather than questions. Consider this situation:

we draw a stratified random sampling method, first by randomizing the selection of regions within a country and then randomizing the selection of households within the regions. But, we collect data at the individual level.

Idea 1: If I don't have any information about the population, I can only assume that if the selection was drawn at random (each observation had the same probability of being selected) and if the sample size is big enough, It will be "representative" of a certain population parameter(s).

Idea 2 If I have information about the population, I can test statistically whether or not the sample is "representative" of a certain population parameter.

Idea 3 If the sample resulted not representative, do not necessarily means that the selection was not at random. Rather, that it is not representative of a certain population parameter and it will produce a biased estimator(s). For instance, if I randomly select $n$ individuals and collect information about two parameters $X_{n}$ and $Y_{n}$, i.e, age and gender. With information about the population $N$, $X_{N}$ and $Y_{N}$, we could compute some weights $W_{X}=\frac{X_{N}*n}{X_{n}*N}$ and $W_{Y}=\frac{Y_{N}*n}{Y_{n}*N}$ that will give us a rough indicator of the representativeness of our sample. It is possible that none, one or both estimators $\hat{X}=X_{n}/n$ and $\hat{Y}=Y_{n}/n$ resulted in bias. I can imagine two scenarios to explain the source of the bias, one is that the sampler purposely decided to bias the sample, which will make the selection process deterministic instead of random. If it is not the case, and the selection was done by change, I will consider that the sample was not big enough and that the estimator(s) could be overrepresented or underrepresented. This does not mean that the selection process had no uncertainty (it was drawn by choice rather than chance), nor independence between observations or that they didn't come from the same uniform distribution. But, rather that we need more information in order to get closer to the distribution of the population. I guess what I want to propose is that random sampling can still produce bias estimator(s) if the sample size is small.

Idea 4 The method of weights that can only make sense "after sampling", when I know that the sample is not representative according to the information that I have about the population. In other words, it is not a weakness of the stratified sampling method in itself, but rather a method that can be used in order to correct for bias in the sample.

Do you think my question(s) are not clear or I should post individual questions? — Mario GS, Dec 03 '15 at 14:57
Uhm... you did not have any questions, really. Just expressions of ideas. — StasK, Aug 10 '16 at 15:39

score 3 · Accepted Answer · answered Aug 10 '16 at 15:54

No serious sampling book gives a definition of representativeness. This is a concept that people think they have intuitively, but it is an evasive one to really pin down on paper.

To me as a survey statistician, representativeness is about randomization and reasonably good knowledge of the (combined) probabilities of selection and response. By that definition, self-selected online panels cannot be representative of any population because you generally don't know neither the selection mechanism nor the selection probabilities.

For a general sampling design with probabilities of selection of individual units $\pi_i$ and pairs of units $\pi_{ij}$, unbiased inference is obtained through the Horwitz-Thompson estimator of the total

$$ t[y] = \sum_i \frac{y_i}{\pi_i}, \quad \mathbb{V}\{t[y]\} = \frac12 \sum_{i\neq j} \bigl( \frac{y_i}{\pi_i} - \frac{y_j}{\pi_j})^2 (\pi_i \pi_j - \pi_{ij}) $$

so inferences have nothing to do with neither the sample size nor the equal probability of selection. (If anything, one particular popular sampling design with equal probabilities, the systematic sampling, does not have an unbiased estimate for the variance.)

If you follow this route, and "test" for representativeness, then your rejection will be a type I error, period.

Now, the above is a simplified situation with no nonresponse. If you have nonresponse, then we can start talking about sampling weights and nonresponse-adjusted weights. If you test for "representativeness", whatever that means to you, using the former, you can gauge whether nonresponse is a problem. Typically, all you have for a population are some measures of demographics, and responsible survey statisticians incorporate adjustments for demographics in the weights that they produce and distribute along with the survey data sets.

What you seem to have stumbled upon, apparently intuitively, is the idea of weight calibration. See the cornerstone Deville and Sarndal (1992) paper, or may be an intro treatment by Lavallee and Beaumont (2015).

Conceptual definition between randomness, representativeness and bias in sampling

1 Answers1