How do I estimate error when I know a survey is not representative?

Question

Let's say that I have a survey of $n = 1000$ respondents from a population of $N = 3000$.

Moreover, it is known that the ratio of male to female in the population is $50:50$. However, in the sample, $70\%$ were female.

Given that I know that the sample is not representative in terms of sex, how do I calculate the survey error rate?

Not my area, but here are a couple of thoughts: 1) Perhaps you could calculate first a sex-specific error rate and then then average between the two while weighting for their representation within the sample. 2) Since you know the ratio in the population, you could subsample your data to get a subset with characteristics that match the population, and then do your calculations with this. You could take this further by repeating the process multiple times (bootstrap). — mkt, Jul 13 '17 at 10:15
you could make your data look representative if you give males a weight of 5 and females a weight of 2.142857.. but the probability that you would select less than 301 males out of 1000 draws is something like 8.8 * 10^-38 so you have non-sampling error which will be difficult to quantify. — Anthony Damico, Jul 13 '17 at 12:39

score 0 · Accepted Answer · answered Jul 16 '17 at 13:30

I can give some general advice, but a lot of the answer is actually dependent on the design of the survey. I'm going to assume that you made a probability sample (that is, all units had a known probability of being selected). I'm also going to assume you are making estimates of population total - other population quantities are analogous.

Often when we know we need to calibrate to certain general subpopulations that are unidentified before units have responded, we use post-stratified estimation. Generally:

$$ \hat{Y}_{pos} = \sum_{k\text{ post-strata}}\frac{N_k}{\hat{N_k}}\hat{Y}_k $$

Which means to estimate the total within a population, we estimate the total within each subpopulation, and multiply by an adjustment factor which accounts for the disparity between the known subpopulation representation and the estimated subpopulation representation. The estimates of subpopulation total and size are given by:

$$ \hat{Y}_k = \sum_{i\in s_k}\pi_i^{-1}y_i\\ \hat{N}_k = \sum_{i\in s_k}\pi_i^{-1} $$

$\pi_i$ is the known probability of selecting unit $i$.

This estimate of population total is biased, but in many cases this is able to be ignored. Ignoring the bias gives us an approximate variance formula of:

$$ Var(\hat{Y}_{pos})\approx\sum_{k\text{ post-strata }}\left[\sum_{i\in s_k}\sum_{j\in s_k}\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}(\pi_{ij}-\pi_i\pi_j)\right] $$

$\pi_{ij}$ is the probability of both $i$ and $j$ being selected in the sample, and has to be non-zero for all pairs of units in the population. This formula further simplifies rather significantly if the sample is simple random (where all samples of the same size are equally as likely).

How do I estimate error when I know a survey is not representative?

1 Answers1