Is it necessary for a sample to broadly reflect the target population distribution?

Question

Following the question here, the target population is made up of 600 units which are distributed as follows:

while units (17%)
black units (42%)
green units (8%)
yellow units (33%)

A representative sample would be one which broadly reflects this percentage distribution.

However, in this case, one would end up with only a small number of green units (because of their lowest population distribution) thus it is not reliable to analyse the green units on this own. For example, in a sample of 100 units, about 8 will be green units

I want to analyse each colour units on their own (as well as overall) and I understand one way to overcome the above issue is to oversample the green units.

Question: What is the risk of oversampling the green units but keeping the number of other units fixed?

For example, in a sample of 100, I can have approximately 25 units of each colour.

This is out of practical consideration such as the cost of data collection.

Can you explain what you mean by "risk" of oversampling? Oversampling is an accepted method for enabling accurate estimates to be made for rarer populations, see for example http://www.statcan.gc.ca/pub/12-001-x/2009002/article/11036-eng.pdf What do you want to do, so you can get advice specific to your situation? — Michelle, Jan 25 '12 at 03:02
I guess "risk" is not the right word here. I mean something to the effect whether "over-representing the green units while keeping the number of other units constant to accommodate this overrepresentation" is a good way of sampling. By keeping the number of other units constant to accommodate the over-represention, I would be inadvertently under-representing them. — Adhesh Josh, Jan 25 '12 at 04:11

Michelle · Answer 1 · 2012-01-25T04:56:42.113

To sample from a population, we define a sampling frame. We then use a survey sampling method to draw our sample, from our sample frame, and we use the achieved sample to make inferences about the population. When there is a small subpopulation, using something like a simple random sample runs the risk of omitting most, or even all, members of that subpopulation. So we use a different sampling strategy.

If we have information about where our subpopulation of interest tends to be situated (e.g. geographically), we can use stratified sampling and then oversample in the strata/stratum which has the largest proportion of our rare subpopulation (i.e. disproportionate sampling). Survey weights, based on how the sampling was undertaken, are then used to correct for the oversampling. This does require that (1) our subpopulations are not dispersed across all strata reasonably equally and (2) that we know where our small subpopulations lurk so we can create the strata. If we have the former situation, then stratified sampling will not be much more efficient than simple random sampling.

This is a very technical area, and I advise the use of a specially trained survey statistician to assist if this is going to be any more technical than pulling coloured balls out of urns. For example, there are a number of people in the various national statistics agencies who have spent many years just in this area.

Thanks Michelle for helping me. Wouldn't a simple strategy of working out the percentage contribution of each type of unit be sufficient. For example, if I determine that I need 60 green units (10% of 600), then this would represent about 8% of my sample size. The other 92% of the sample would comprise white, black and yellow units (proportionately calculated). — Adhesh Josh, Jan 25 '12 at 05:14
The weight has to take account of the probability of being sampled. These calculations can become very complex, so I recommend that you seek the assistance of a survey statistician who has experience in this area. You will need to give them much more detail about your situation than is contained in your question. — Michelle, Jan 25 '12 at 05:30

Is it necessary for a sample to broadly reflect the target population distribution?

1 Answers1

Linked