7

Knowing that a population sample (non-random) is biased in terms of its demographics, what are the best practices to correct for this issue?

That is, let's say that I can attach an array of demographics to the sample, and that I wish to transform this sample so that they resemble that of the population these results where picked. Later on, this adjusted sample will be used for mathematical modeling.

As I see it, it is quite straightforward to correct for one certain aspect. If males are under represented by 50 %, all males are assigned a weight of 2. But what if one wants to take into account several variables at the same time? Is building a n-dimensional array the way to go? Are there better solutions?

Are there readily available methods for this? An R-package?

Figaro
  • 1,042
  • 2
  • 12
  • 24
  • 2
    A point of wording, but one central here. In statistics, "skewed" means "skewed", which is a technical term meaning asymmetry of distributions; it does not mean "biased", which is a technical term that happens to have a similar meaning to its informal meaning. You're talking about biases in sample choice, it seems. – Nick Cox Dec 02 '14 at 14:47
  • 2
    *Why* did the sample become biased? This is crucial to know, because for many forms of non-randomness there will be no valid cure for the problem. You cannot turn a judgement sample into something with the properties of a random sample merely be reweighting it, for instance. – whuber Dec 02 '14 at 21:20

3 Answers3

11

As Tim pointed out, you should use survey weighting.

In your case, more specifically, if all the auxiliary variables (your demographic variables) you want to use to make your sample match your population are qualitative variables you will use:

  • Post-stratification: If you have the full joint distribution of these variables on the population
  • Raking: If you only have the marginal distributions of these variables on the population

More generally, if you have qualitative and quantitative auxiliary variables, you can use a Calibration approach.

Tim also pointed out the survey package in R. There you can find three functions that implements these methods:

  • Post-stratification: postStratify
  • Raking: rake
  • Calibration: calibrate

There is the sampling package in R containing the function for weighting.

  • Calibration: calib

It is important to note though that these weighting methods were originally developed under a probability sampling framework, which does not appear to be your case (you referred to your sample as "non-random"). These methods might mitigate some potential bias in your estimates, as long as the auxiliary variables used in the weighting adjustments are related to your outcome variables and to the selection mechanism of your sample. See this paper by Little and Vartivarian for a similar discussion in survey nonresponse.

djhurio
  • 658
  • 4
  • 16
  • Hi, thank you for your answer. This solved the problem at hand and truly opened my eyes to the possibilities of the survey package. At this time, I'm a bit unsure how I take my results further: how do I actually use the weights? Let's assume to cases: $1)$ I want to plot a histogram of the corrected gender distribution. Using `svyhist()` only plots the biased distribution. $2)$ I want to use my data set to perform a Factor Analysis. As I see it, the `factanal()` functions do not take weights. – Figaro Dec 03 '14 at 14:24
  • 1) I don't think `svyhist()` will work with a qualitative variable such as gender. Instead, I would use `barplot` in the following way: `barplot(svymean(~gender,design=dsn.obj))`. 2) You can use the function `svyfactanal` from the `survey` package to fit a factor analysis model or try using `lavaan.survey` from the package with the same name, which fits SEM also taking into account the feature of a complex sample design, such as weighting. – Raphael Nishimura Dec 03 '14 at 15:55
  • 1
    Regarding your 1st question ("how do I actually use the weights?"), I would recommend using functions from the `survey` package for statistical methods that are implemented there, such as `svyglm` for generalized linear models, or looking for other packages, such as `lavaan.survey` that enables analysis with survey objects. If you want to use a function from other packages that has `weights` as an argument, such as `lm` as pointed out by Tim, you can extract the weights from your survey design object using the function `weights` and passing them as the argument to the function you want to use. – Raphael Nishimura Dec 03 '14 at 16:04
  • @djhurio The function `calib` of the `sampling` is a good alternative to compute the weights of a calibration estimator. However, one advantage of using the `calibrate` function is that it creates an object that allows for the other functions in the `survey` package to incorporate the potential gains in precision in the sampling variance estimates. If the weights created by the `calib` are employed without any further modification on the codes, I don't believe the standard errors will take that into account. – Raphael Nishimura Dec 05 '14 at 21:50
  • @RaphaelNishimura, I agree with you. It depends on how you are going to do variance estimation. The g-weights from the `calib` are good enough for the `vardom` function from the `vardpoor` package for variance estimation. – djhurio Dec 06 '14 at 04:59
  • One more question before I close this topic. As @RaphaelNishimura pointed out, there are several modeling functions which take weights as input. However, if I wish to use a method that does not take weights, let's say k-NN, could I just add duplicate rows to my dataset according to their weights? For example, if a weight is 3.4, I round it to 3 and have it appear three times in my dataset. Obviously the rounding introduces an error, but is it an OK approach? – Figaro Dec 28 '14 at 09:59
  • 1
    You can do that, although, as you mentioned, you will have some rounding errors. In fact, that'w the way statisticians used to do weighting back on the old times :) However, this approach is fine only to compute point estimates. If you need to estimate sampling variability, you will need to rely on appropriate variance estimation techniques, such as Taylor Series Expansion or Repeated Replication (BRR, Jackknife or Bootstrap). – Raphael Nishimura Dec 29 '14 at 22:32
3

The common thing to do in this kind of situation is to use survey weighting (or an intro here). A clear definition could be found on Wikipedia:

data should usually be weighted if the sample design does not give each individual an equal chance of being selected. For instance, when households have equal selection probabilities but one person is interviewed from within each household, this gives people from large households a smaller chance of being interviewed. This can be accounted for using survey weights. Similarly, households with more than one telephone line have a greater chance of being selected in a random digit dialing sample, and weights can adjust for this.

There is an survey package for R that enables you to use weighting (check also JSS article describing it). Generally, you can use weights with different functions in R (e.g. lm has weights argument).

Tim
  • 108,699
  • 20
  • 212
  • 390
1

I follow both Raphael and Tim in their suggestions -- especially about the use of the R package survey. However, as Raphael suggested, these weighting techniques were developed for probability samples and it might not be your case.

If you are familiar to multilevel modeling and have quality auxiliary data to estimate the weights you may use the R package lme4 (which is flexible and friendly-user) to implement Andrew Gelman's suggestions in this and this articles.

I have not applied this to my own work but Gelman's results are impressive. I think these papers are, at least, food for thought.

FabF
  • 121
  • 1
  • 8