28

What if you take a random sample and you can see it is clearly not representative, as in a recent question. For example, what if the population distribution is supposed to be symmetric around 0 and the sample you draw randomly has unbalanced positive and negative observations, and the unbalance is statistically significant, where does that leave you? What reasonable statements can you make about the population based on a biased sample? What is a reasonable course of action in such a situation? Does it matter when in our research we notice this imbalance?

Joel W.
  • 3,096
  • 3
  • 31
  • 45
  • If the sample is truly taken at random and the sample size is large you won't have this problem. In small samples this can happen. However random samples should not be thrown out. If possible collect more data drawn at random. – Michael R. Chernick Jul 16 '12 at 13:15
  • 2
    Michael, this problem might be expected to occur one time in 20, if we use statistical significance as our metric. Most often we do not know when we have randomly chosen a non-representative sample because we do not know enough about the population. But when we do know something about the population, and we notice such an anomaly, what do we do? – Joel W. Jul 16 '12 at 13:26
  • 3
    Yes, the most correct practice is to obtain a large enough random sample, like @MichaelChernick wrote. However, one of my professors tell me that he verified by Monte Carlo simulation that, when a researcher has to increase the sample size, it is not so correct to simply add statistical unities to the sample, but one has to repeat the sampling. Otherwise, the statistics may be biased (once again!). – this.is.not.a.nick Jul 16 '12 at 13:27
  • 4
    @Michael, I do not understand why your statement is true. A p-value less than .05 will occur under the null hypothesis 5% of the time *regardless* of sample size. So how can it be possible that larger sample sizes will solve this problem? It seems to me your recommendation implicitly invites readers to confuse the size and power of hypothesis tests. – whuber Jul 16 '12 at 13:34
  • 1
    @this.is You make a great point. Your professor is correct. It doesn't take Monte-Carlo simulations to prove this, either: just apply some of the theory of [adaptive sampling](http://en.wikipedia.org/wiki/Sequential_probability_ratio_test) to the situation. – whuber Jul 16 '12 at 13:36
  • I think this is a minor point that I would not worry about. it is a problem because instead of choosing a sample of size N you chose a sample of size M – Michael R. Chernick Jul 16 '12 at 13:39
  • 2
    @Michael, what do you mean that we should collect more data at random? Are we to hope that we randomly draw a sample biased in the other direction? In any case, what number of additional cases should we draw? Do you suggest we set a number at the onset or use a stopping rule? If a stopping rule, what might the rule look like? Finally, even if the resulting larger sample has no statistically significant bias, we know it is comprised of two samples, one with bias and one without. What reasonable statements can you make about the population based on such a complex sample? – Joel W. Jul 16 '12 at 13:39
  • @whuber I was not referring to the inference problem. The OP asked what to do if the sample was not "representative" of the population assuming he has some way to know this. I am saying that the nature of a random sample is that it should be representative of the population especially if the sample size is large. The problem he describes does occur with small samples. So all I am saying is that with a larger sample the data should be more representative of the population. – Michael R. Chernick Jul 16 '12 at 13:44
  • Continuation @whuber: So for example if the population is known to be symmetric and the sample is highly skewed, the lack of symmetry will go away with a larger sample. – Michael R. Chernick Jul 16 '12 at 13:45
  • 2
    @Michael An alternative conclusion is that a highly significant, highly skewed sample indicates a problem with the sampling procedure. If so, the lack of symmetry will persist in a larger sample. – whuber Jul 16 '12 at 13:47
  • @Joel W. I think you have some misconceptions that show up in your comment to me. Yes there is a slight issue about the samples being drawn sequentially. But after the fact you cannot construct a stopping rule and claim that you applied it. It seems that such a rule would continue every time you find the sample to be "non-representative". This would be a subjective judgement and could not be formulated in terms of a stopping boundary for a concrete stopping rule. But if you are testing a hypothesis you could use a p-value adjustment as though you did two tests (Bonferroni for examples). – Michael R. Chernick Jul 16 '12 at 13:52
  • @Joel W. continued: Random samples are not biased. If you have a population of size N and choose a sample of size n each of the n out of N possible selections are equally likely. Now if your sample size is small you could happen to get a sample that is highly skewed from a population that is not. In that case a much larger sample is much less likely to have this problem. This is not the same as saying the additional data will be skewed the other way. Suppose you toss a coin and get four heads in a row. That does not mean that the fifth toss is more likely to be a tail. – Michael R. Chernick Jul 16 '12 at 14:04
  • @Joel W. Continued: But if I flip it another 16 times my chances of getting all 20 heads is very low and the distribution of heads and tails should be closer to even. As for how many additional samples to take, that is always a difficult question to answer. If you could afford to triple the sample size I would suggest doing something like that. Just make the sample size much larger so that the problem will disappear or be far less severe. – Michael R. Chernick Jul 16 '12 at 14:08
  • 1
    @whuber I am taking the OP at his word. (1) The sample is taken at random and (2) the population is known to have a symmetric distribution and (3) the sample is highly skewed by chance. Under these assumptions the skewness is due probably a consequence of a small sample size. Of course there is a possibility that the skewness is due to (1) lack of randomness or (2) the population is not symmetric. In either of those cases the additional data will not help. – Michael R. Chernick Jul 16 '12 at 14:13
  • I feel Joel (and my) point is not completely understood. I will repost here the mutant violet-paper eating creature story and propose anyone giving an answer starts from this practical case: Suppose you have a process following a perfect normal distribution centered at 0. A student draws 10000 samples one at the time and writes down the result. However, for reasons known only to her, she likes to write all the positive results on sheets of red paper and all the negatives on sheets of violet paper. If she can write 100 numbers on each sheet, at the end she will have roughly 50 red sheets (...) – user1073012 Jul 16 '12 at 15:06
  • (...) and 50 violet sheets. She takes back all the sheet to her office desk to carry on the analysis but something terrible happens: a mutant paper-eating creature attacks her and eats only 23 of her violet sheets (the creature despises red paper). Now she has still around 50 red sheets, ie. around 5000 positive numbers, but only around 27 violet sheets, ie. 2700 negative numbers. She cannot repeat the experiment. What can she do? Can she choose randomly 23 red sheets and throw them away to rebalance the samples? – [user1073012](http://bit.ly/ODJkVH) (end of the comment) – chl Jul 17 '12 at 08:29

3 Answers3

7

The answer given by MLS (use importance sampling) is only as good as the assumptions you can make about your distributions. The main strength of the finite population sampling paradigm is that it is non-parametric, as it does not make any assumptions about the distribution of the data to make (valid) inferences on the finite population parameters.

An approach to correct for sample imbalances is called post-stratification. You need to break down the sample into non-overlapping classes (post-strata), and then reweight these classes according to the known population figures. If your population is known to have a median of 0, then you can reweight the positive and negative observations so that their weighted proportions become 50-50: if you had an unlucky SRS with 10 negative observations and 20 positive observations, you would give the negative ones the weight of 15/10 = 1.5 and the positive ones, 15/20 = 0.75.

More subtle forms of the sample calibration do exist, in which you can calibrate your sample to satisfy more general constraints, such as having a mean of a continuous variable to be equal to the specific value. The symmetry constraint is pretty difficult to work with, although that might be doable, too. May be Jean Opsomer has something on this: he has been doing a lot of kernel estimation work for survey data.

StasK
  • 29,235
  • 2
  • 80
  • 165
  • How does post-stratification compare, logically or statistically, to simply discarding the unbalanced sample and drawing another sample? (Sometimes drawing the sample is the labor intensive part of the research, but sometimes it is what is done after you have drawn the sample that is labor intensive and drawing the sample involves relatively minor effort, as in much experimental research.) – Joel W. Jul 16 '12 at 18:37
  • 2
    I have never been in a situation where discarding the data is the best answer, and I have never seen it discussed in any of the survey statistics books. In most of survey statistics, getting the data is at least five times more expensive than any of the following data processing and analysis (except probably for some cheap web surveys where the data collection is nearly free). If you are in an experimental world, then you should not tag your post "sampling", and rather use "experiment design" instead. – StasK Jul 17 '12 at 17:06
  • Random samples may be used rather than stratified because there are many possible ways to stratify in a real world setting. It can happen that after selecting two random samples for an experiment, you notice some flagrant imbalance. Then you are stuck between a rock and a hard place: live with the imbalance (e.g., all older people in one group, all non-native speakers in one group, all Ph.D.s in one group, etc.), or draw a new sample and weaken the connection between what you have done and the assumptions of all statistical techniques. Post-stratification seems to be of the second type. – Joel W. Oct 29 '15 at 14:37
2

I'm the Junior Member here, but I'd say that discarding and starting over is always the best answer, if you know that your sample is significantly unrepresentative, and if you have an idea of how the unrepresentative sampling arose in the first place and how to avoid it if possible the second time around.

What good will it do to sample a second time if you'll probably end up in the same boat?

If doing the data gathering again doesn't make sense or is prohibitively costly, you have to work with what you have, attempting to compensate for the unrepresentativeness via stratification, imputation, fancier modeling, or whatever. You need to clearly note that you compensated in this way, why you think it's necessary, and why you think it worked. Then work the uncertainty that arose from your compensation all the way through your analysis. (It will make your conclusions less certain, right?)

If you can't do that, you need to drop the project entirely.

Wayne
  • 19,981
  • 4
  • 50
  • 99
  • What if you do not know why the sample is unrepresentative, are you still justified in discarding it and drawing a new, random sample? If not, why not? Also, let's say you do discard the first sample and draw a second one, are the inferential statistics that you might calculate based on the second sample in any way inappropriate due to the discarded first sample? For example, if you subscribe to discarding unrepresentative samples, are you changing the sampling distribution that your statistical test is based on? If so, are you making it easier or harder to find statistical significance? – Joel W. Jul 16 '12 at 22:16
  • @Wayne Good idea. –  Jul 03 '16 at 12:43
1

This is a partial answer that assumes we know both the distribution $q$ from which was sampled, and the true (or desired) distribution $p$. Additionally, I assume that these distributions are different. If the samples were actually obtained through $p$, but they look wrong: the samples are still unbiased and any adaptation (such as removing outliers) will likely add bias.

I assume you want to find some statistic $s_p = E \{ f(X) | X \sim p \}$. For instance, $s(p)$ might be the mean of the distribution, in which case $f$ is the identity function. If you had samples $\{ x_1, \ldots, x_n \}$ obtained through $p$, you could simply use $$ s_p \approx \frac{1}{n} \sum_{i=1}^n f(x_i) \enspace. $$ However, suppose you only have samples that were obtained (from the same domain) with a sampling distribution $x_i \sim q$. Then, we can still get an unbiased estimate of $s_p$ by weighting each of the samples according to the relative probability of it occuring under each distribution: $$ s_p \approx \frac{1}{n} \sum_{i=1}^n \frac{p(x_i)}{q(x_i)} f(x_i) \enspace. $$ The reason this works is that $$ E \left\{ \frac{p(X)}{q(X)} f(X) \middle| X \sim q \right\} = \int p(X) f(X) dx \enspace, $$ as desired. This is called importance sampling.

MLS
  • 728
  • 3
  • 13
  • You say the sample is not biased and any attempt to fix the sample will add bias. I suggest that the process by which the sample was collected is without bias but, in fact, the sample is biased, perhaps seriously biased. Are there ways to try to fix the known large bias that might be expected to introduce relatively little additional bias? – Joel W. Jul 16 '12 at 13:54
  • 1
    To disambiguate the terminology a bit: I think of bias as a property of the expectation of a random variable. In other words, if the process that collects the data is unbiased, then so is the sample. However, the sample may still be atypical and lead to unwanted conclusions. Any general way to fix this induces bias, since you are adapting the (unbiased) sampling procedure. Probably the less biased approach is to collect and use new samples. A slightly more biased approach would add these new samples to the old ones, but the result might be less variable since you have more samples in total. – MLS Jul 16 '12 at 14:10
  • 2
    @Joel W. What do you mean when you say the sample is biased? Is it the estimate of the mean based on the sample that is biased? Any sample estimate is going to differ from the true mean and some can be far off. When sampling at random this is due to variance not bias. It is not right to say a sample is biased because the distribution of the sample is known to look a lot different from the distribution for the population. In small samples many can look unrepresentative for one reason or another but random sampling is not biased sampling. – Michael R. Chernick Jul 16 '12 at 14:19
  • @JoelW. If you can not get new samples, I do not think you can easily 'fix' the samples set to reduce the bias. However, if you know something about the distribution this can perhaps be made explicit in a (Bayesian) prior. Then, instead of adapting the sample set, you adapt the inference to reach a better justified conclusion. The prior can somewhat mitigate the randomness in the samples. Alternatively, you can consider adaptations such as a truncated distribution, and remove outliers to remove skewness. If the number of removed outliers is fairly small, the induced bias can be acceptable. – MLS Jul 16 '12 at 14:20
  • @Michael, by bias I mean a distortion, intended or unintended. Yes, the estimate of the mean based on such a biased sample will be biased, and it seems improper to present the mean of the sample as indicating the mean of the population if we know the sample is distorted in a specific way. – Joel W. Jul 16 '12 at 15:07
  • You only have a vague notion about how the sample is different from the population. I think the terminology you use is poor and not in keeping with the statistical definition of bias. No random sample can be called biased. Now you may hav4e reason to believe that the population distribution is symmetric and the fact that the observed sample has a highly skewed distribution may indicate to you that you do not think your estimate is reliable. Again this is mainly a small sample size issue. – Michael R. Chernick Jul 16 '12 at 15:13
  • But statistically speaking, the sample mean from a random sample is an unbiased estimate of the population mean. – Michael R. Chernick Jul 16 '12 at 15:14
  • 1
    @Michael, I agree that we must recognize and live with random variance when we have to. I am asking what we might reasonably do when we detect unintended variance. What if our random sample turns out to include relatively too many young people, or too many blue collar workers, etc., when those categories are relevant to our research? Going even further, should we check our samples to see if they are unbalanced in such ways? And does it matter if we notice this before doing further research with the sample or after we have invested resources in conducting research with the sample? – Joel W. Jul 16 '12 at 15:20
  • 1
    Covariate imbalance is very important. If it exists in a sample a regression model can be used to adjust for it. Vance Berger has written a book on this topic which I have probably cited previously on this website. Here is an amazon link to a description of the book. http://www.amazon.com/Selection-Covariate-Imbalances-Randomized-Statistics/dp/0470863625/ref=sr_1_1?s=books&ie=UTF8&qid=1342452573&sr=1-1&keywords=Vance+Berger – Michael R. Chernick Jul 16 '12 at 15:30
  • What do you think of discarding the sample and starting over? What risks does that pose? Are those risks greater than the risks of working with an imbalanced sample. – Joel W. Jul 16 '12 at 15:34