Questions tagged [survey-sampling]

Creating samples from a well-specified population (human: all adults; registered voters; individuals with diabetes; students of a university; establishment: all firms; firms with employment of 200 or more in New York City; resource: all land of a country or a state/province) using a probabilistic method, with the purpose of inference to that specific population

Sampling is used to collect data when observing whole population is not practical or not feasible (e.g., too expensive, conceptually impossible, etc.). To draw valid statistical inferences about sampled data, the mechanism by which the samples are drawn must be specified, and must involve randomization (selecting units using random numbers or random events). Randomization is necessary to be able to make probabilistic statements: one can talk about the mean or a tail probability of the sampling distribution of a statistic by virtue of looking at the histogram of this statistic as obtained by (hypothetically, or by actual exhaustive search) taking all possible samples from populaton and computing the statistic of interest based on every possible sample.

The simplest sampling method is simple random sampling (SRS): for a population of $N$ units, the SRS of size $n$ is a sampling design that assigns to each sample of size $n$ the same probability of selection $1/C_N^n$. This simplest method allows for inference that is nearly equivalent to the textbook "i.i.d." assumption. E.g., the minimum variance unbiased estimate of the population mean is the sample mean $\bar x$, and its variance is $s^2(1-n/N)/n$ where $s^2 = \sum (x_i - \bar x)^2/(n-1)$, and the factor $1-n/N$ is the finite population correction. However, if any other selection method was used to obtain the sample, the analysis methods must be modified to account for the features of this selection method. For instance, a naive understanding of sampling may entail thinking that if every unit in the population has the same probability of selection $n/N$, then the "i.i.d." analysis methods are applicable. This is not so; for a systematic sampling design (all units are arranged in the list, a starting point $k$ is chosen randomly as a number between 1 and $[N/n]$, and the units $k, k+[N/n], k+2[N/n], ...$ are taken into the sample), the sampling variance cannot even be estimated!

In samples of human and natural resource populations, the most typical twists on sampling selection methods include (a combination of):

  1. Stratification: selecting units independently within well-defined groups (e.g., regions or states in geographic samples; industry and size of an enterprize in establishment surveys; type of land use in natural resource surveys; etc.). Typically, although not necessarily, stratification leads to reduction of sampling variance.
  2. Multistage selection: selecting units within a specific hierarchy (schools within districts, then students within schools in education surveys; counties within states, then city blocks within counties, then households within city blocks in geographic samples; etc.). Multistage samples are also known as cluster samples (clusters of units rather than individual units are sampled at the early stages of selection). Clustering typically increases sampling variances.
  3. Unequal probability of selection, usually associated either with a need to obtain a sufficient number of observations for certain groups of populations, or with a need to balance costs of the survey. Unequal probabilities of selection must be accounted for by specifying (and using in analysis) sampling weights. Unweighted estimates will typically be biased, and hence of no real interest.

In some disciplines, the term "sample" is intended to mean "an observation", a single record containing data on one particular unit of analysis. More often, the term "sample" is used to denote a collection of units for which observations were made, measurements were taken, responses were obtained, etc. Furthermore, some disciplines use the term "sampling" rather loosely to indicate the process of collection data on arbitratrily taken units from the population. However, scientifically rigorous inferences can only be obtained from the samples that are random, i.e., a randomization mechanism is built into the data collection process.

To find out more, visit Wikipedia page, take a look at What Is a Survey? booklet of the American Statistical Association, or read introductory textbooks such as Lohr (2009), Kish (1995) or Cochran (1977). A complete and thorough discussion of how survey statistics should be analyzed in R is given in Lumley (2010).

Potentially related tags: survey, sample-size, response-rate, stratification, svy

Other uses of the concept of sampling: see ,

196 questions
17
votes
2 answers

Two worlds collide: Using ML for complex survey data

I am struck with seemingly easy problem, but I haven't found a suitable solution for several weeks now. I have quite a lot of poll/survey data (tens of thousands of respondents, say 50k per dataset), coming from something I hope is called complexly…
kotrfa
  • 618
  • 1
  • 6
  • 15
13
votes
8 answers

Surveys: Is 25% of a large user base representative?

My employer is currently running a company wide survey about attitudes towards the office i.e. Sentiment. In the past, they opened the survey to all areas of the business (Let's assume 10 very different departments) and all employees within them…
Colin
  • 239
  • 2
  • 4
9
votes
3 answers

Using post-stratification weights in R survey package

I am analyzing a dataset that has a variable for post-stratification weights. As this is a complex survey, the plan is to use the R survey package. I have been reading its documentation and feel like able to set a survey design correctly. So far, so…
FabF
  • 121
  • 1
  • 8
9
votes
3 answers

Recommend references on survey sample weighting

Let's aim for some at an introductory level, some articles and some textbooks. Applied is more helpful, including R code is great. Thanks!
Michael Bishop
  • 2,171
  • 3
  • 21
  • 31
7
votes
5 answers

Are the differences between sampling clusters and sampling strata, conceptual, methodological, neither or both?

I am fuzzy on the distinctions between sampling strata and sampling clusters. Both seem to aim at designs aiming at creating useful estimates of between/within group (strata, cluster) variation, and in particular, seem to be driven by homogeneity…
Alexis
  • 26,219
  • 5
  • 78
  • 131
7
votes
1 answer

Question on Covariance for sampling without replacement

Suppose I have numbers 1,2... 10 and I sample 5 from them randomly without replacement noted as $X_1, X_2, X_3, X_4, X_5$ What is $Cov(X_i,X_j)$ for $i \not=j$ So $Cov(X_i,X_j)=E(X_iX_j)-E(X_i)E(X_j)$ I consider that any $X_i$ treated on its own is…
user164144
  • 1,077
  • 7
  • 18
7
votes
1 answer

Difference between calculated inclusion probability and what is returned by sampling function?

I have a (small) population from which I wish to sample. I assign probabilities proportional to $y$. I enumerate the possible samples and then determine the probability of each sample occurring based on the product of the probabilities for each…
t-student
  • 720
  • 5
  • 16
6
votes
0 answers

Defining quantiles for complex survey samples

I am looking to accumulate a comprehensive list of definitions for quantiles under complex sampling that have been published or implemented in software. I'm not worrying about the separate problem of uncertainty estimation. Without the complication…
Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
6
votes
1 answer

What does this sampling weight mean?

The data comes from agricultural market research on farming. The sample was derived based on stratification of farming industries (sheep, beef, grains, etc.) and random sampling within each stratum. We have population estimates (frequencies,…
NonSleeper
  • 617
  • 1
  • 5
  • 13
5
votes
1 answer

Stratified survey calculations by hand and with survey package don't agree. Simulation results

Bounty info: I originally emailed Thomas Lumley at an old email address. He did respond to an email to his new address. Note: Long post (lots of code) I can’t seem to replicate the results of the survey function using very basic by-hand…
abalter
  • 770
  • 6
  • 18
5
votes
1 answer

Really a knife's edge?

Something nice and topical. I just read these two items on the news: Obama is in the lead by 50.4% to 48%, with 61% of votes counted. (Ohio) With 86% of the vote counted, Virginia is still sitting on a knife edge. Romney is hanging on to a lead of…
Darren Cook
  • 1,772
  • 1
  • 12
  • 26
5
votes
1 answer

Multi-stage sampling together with hierarchical/ mixed effects models: which R packages?

Analyzing educational datasets we have samples of children from samples of class in samples of schools - we have sampling weights, so I use the survey package e.g. to do a linear model. But this kind of design also requires looking at the mixed…
5
votes
1 answer

Understanding svycontrast in R with simple random sampling

I'm taking a class on survey sampling and I have some problem understanding the R implementation of simple random sampling (SRS). Please look at this piece of code: library(survey) data(api) N <- nrow(apipop) srs_design <- svydesign(id=~1,…
nalzok
  • 1,385
  • 12
  • 24
5
votes
1 answer

Proof of the Horvitz-Thompson result

I'm trying to find an elementary derivation (proof was the wrong word) of the Horvitz-Thompson estimator: $$ \hat{Y}=\sum_{i\in s}\frac{y_{i}}{\pi_{i}} $$ where $i \in s$ if and only if unit $y_{i}$ a sampled unit is in a sample of interest, and…
4
votes
0 answers

How should very large sample sizes be used?

At what point can very large samples, as a proportion of the population, be treated as a census? For example, if your sample contains 90% of the population units, can one dispense with inferential statistical inferences? What about 80%? And so on .…
Scott Hale
  • 41
  • 1
1
2 3
13 14