Questions tagged [sampling]

Creating samples from a well-specified population using a probabilistic method and/or producing random numbers from a specified distribution. As this tag is ambiguous, please consider [survey-sampling] for the former and [monte-carlo] or [simulation] for the latter. For questions regarding creating random samples from known distributions, please consider using the [random-generation] tag.

Sampling is used to collect data when observing whole population is not practical or not feasible (e.g., too expensive, conceptually impossible, etc.). To draw valid statistical inferences about sampled data, the mechanism by which the samples are drawn must be specified, and must involve randomization (selecting units using random numbers or random events). Randomization is necessary to be able to make probabilistic statements: one can talk about the mean or a tail probability of the sampling distribution of a statistic by virtue of looking at the histogram of this statistic as obtained by (hypothetically, or by actual exhaustive search) taking all possible samples from populaton and computing the statistic of interest based on every possible sample.

The simplest sampling method is simple random sampling (SRS): for a population of $N$ units, the SRS of size $n$ is a sampling design that assigns to each sample of size $n$ the same probability of selection $1/C_N^n$. This simplest method allows for inference that is nearly equivalent to the textbook "i.i.d." assumption. E.g., the minimum variance unbiased estimate of the population mean is the sample mean $\bar x$, and its variance is $s^2(1-n/N)/n$ where $s^2 = \sum (x_i - \bar x)^2/(n-1)$, and the factor $1-n/N$ is the finite population correction. However, if any other selection method was used to obtain the sample, the analysis methods must be modified to account for the features of this selection method. For instance, a naive understanding of sampling may entail thinking that if every unit in the population has the same probability of selection $n/N$, then the "i.i.d." analysis methods are applicable. This is not so; for a systematic sampling design (all units are arranged in the list, a starting point $k$ is chosen randomly as a number between 1 and $[N/n]$, and the units $k, k+[N/n], k+2[N/n], ...$ are taken into the sample), the sampling variance cannot even be estimated!

In samples of human and natural resource populations, the most typical twists on sampling selection methods include (a combination of):

  1. Stratification: selecting units independently within well-defined groups (e.g., regions or states in geographic samples; industry and size of an enterprize in establishment surveys; type of land use in natural resource surveys; etc.). Typically, although not necessarily, stratification leads to reduction of sampling variance.
  2. Multistage selection: selecting units within a specific hierarchy (schools within districts, then students within schools in education surveys; counties within states, then city blocks within counties, then households within city blocks in geographic samples; etc.). Multistage samples are also known as cluster samples (clusters of units rather than individual units are sampled at the early stages of selection). Clustering typically increases sampling variances.
  3. Unequal probability of selection, usually associated either with a need to obtain a sufficient number of observations for certain groups of populations, or with a need to balance costs of the survey. Unequal probabilities of selection must be accounted for by specifying (and using in analysis) sampling weights. Unweighted estimates will typically be biased, and hence of no real interest.

In some disciplines, the term "sample" is intended to mean "an observation", a single record containing data on one particular unit of analysis. More often, the term "sample" is used to denote a collection of units for which observations were made, measurements were taken, responses were obtained, etc. Furthermore, some disciplines use the term "sampling" rather loosely to indicate the process of collection data on arbitratrily taken units from the population. However, scientifically rigorous inferences can only be obtained from the samples that are random, i.e., a randomization mechanism is built into the data collection process.

To find out more, visit Wikipedia page, take a look at What Is a Survey? booklet of the American Statistical Association, or read introductory textbooks such as Lohr (2009), Kish (1995) or Cochran (1977). A complete and thorough discussion of how survey statistics should be analyzed in R is given in Lumley (2010).

Potentially related tags: survey, sample-size, response-rate, stratification, svy

Another, more algorithmic, meaning of the word "sampling" is to describe the procedures of drawing random numbers that have a specified distribution. Assuming that a (pseudo) random number generator is available that creates (pseudo) random numbers from $U[0,1)$, the simplest method is by inverting the distribution function: $X = F^{-1}(U)$. In more complicated cases, one has to utilize more sophisticated algorithms, such as acceptance-rejection sampling, importance sampling, etc. Understanding sampling methods is crucial in computational Bayesian statistics.

Potentially related tags: Bayesian, MCMC

Other meanings are also discussed at Wikipedia disambiguation page.

2894 questions
280
votes
16 answers

Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean?

It seems that through various related questions here, there is consensus that the "95%" part of what we call a "95% confidence interval" refers to the fact that if we were to exactly replicate our sampling and CI-computation procedures many times,…
Mike Lawrence
  • 12,691
  • 8
  • 40
  • 65
98
votes
3 answers

Can someone explain Gibbs sampling in very simple words?

I'm doing some reading on topic modeling (with Latent Dirichlet Allocation) which makes use of Gibbs sampling. As a newbie in statistics―well, I know things like binomials, multinomials, priors, etc.―,I find it difficult to grasp how Gibbs sampling…
Thea
  • 983
  • 1
  • 7
  • 4
76
votes
5 answers

Central limit theorem for sample medians

If I calculate the median of a sufficiently large number of observations drawn from the same distribution, does the central limit theorem state that the distribution of medians will approximate a normal distribution? My understanding is that this is…
55
votes
8 answers

Is sampling relevant in the time of 'big data'?

Or more so "will it be"? Big Data makes statistics and relevant knowledge all the more important but seems to underplay Sampling Theory. I've seen this hype around 'Big Data' and can't help wonder that "why" would I want to analyze everything?…
PhD
  • 13,429
  • 19
  • 45
  • 47
55
votes
5 answers

Statistical inference when the sample "is" the population

Imagine you have to do reporting on the numbers of candidates who yearly take a given test. It seems rather difficult to infer the observed % of success, for instance, on a wider population due to the specifity of the target population. So you may…
pbneau
  • 1,161
  • 4
  • 13
  • 17
53
votes
4 answers

Is a sample covariance matrix always symmetric and positive definite?

When computing the covariance matrix of a sample, is one then guaranteed to get a symmetric and positive-definite matrix? Currently my problem has a sample of 4600 observation vectors and 24 dimensions.
Morten
  • 918
  • 1
  • 9
  • 11
44
votes
5 answers

Why does increasing the sample size lower the (sampling) variance?

Big picture: I'm trying to understand how increasing the sample size increases the power of an experiment. My lecturer's slides explain this with a picture of 2 normal distributions, one for the null-hypothesis and one for the alternative-hypothesis…
user2740
  • 1,226
  • 2
  • 12
  • 19
42
votes
1 answer

Why is the sampling distribution of variance a chi-squared distribution?

The statement The sampling distribution of the sample variance is a chi-squared distribution with degree of freedom equals to $n-1$, where $n$ is the sample size (given that the random variable of interest is normally distributed). Source My…
41
votes
4 answers

How to sample from a normal distribution with known mean and variance using a conventional programming language?

I've never had a course in statistics, so I hope I'm asking in the right place here. Suppose I have only two data describing a normal distribution: the mean $\mu$ and variance $\sigma^2$. I want to use a computer to randomly sample from this…
Fixee
  • 555
  • 1
  • 4
  • 6
38
votes
3 answers

What percentage of a population needs a test in order to estimate prevalence of a disease? Say, COVID-19

A group of us got to discussing what percentage of a population needs to be tested for COVID-19 in order to estimate the true prevalence of the disease. It got complicated, and we ended the night (over zoom) arguing about signal detection and…
37
votes
3 answers

Explanation of finite population correction factor?

I understand that when sampling from a finite population and our sample size is more than 5% of the population, we need to make a correction on the sample's mean and standard error using this formula: $\hspace{10mm} FPC=\sqrt{\frac{N-n}{N-1}}$ Where…
Sara
  • 1,347
  • 4
  • 13
  • 16
36
votes
1 answer

Bootstrapping vs Bayesian Bootstrapping conceptually?

I'm having a trouble understanding what a Bayesian Bootstrapping process is, and how that would differ from your normal bootstrapping. And if someone could offer an intuitive/conceptual review and comparison of both, that would be great. Let's take…
SpicyClubSauce
  • 495
  • 1
  • 4
  • 9
35
votes
6 answers

Sampling for Imbalanced Data in Regression

There have been good questions on handling imbalanced data in the classification context, but I am wondering what people do to sample for regression. Say the problem domain is very sensitive to the sign but only somewhat sensitive to the magnitude…
someben
  • 738
  • 1
  • 6
  • 11
33
votes
5 answers

Why do political polls have such large sample sizes?

When I watch the news I've noticed that the Gallup polls for things like presidential elections have [I assume random] sample sizes of well over 1,000. From what I remember from college statistics was that a sample size of 30 was a "significantly…
samplesize999
  • 331
  • 3
  • 3
33
votes
2 answers

Drawing from Dirichlet distribution

Let's say we have a Dirichlet distribution with $K$-dimensional vector parameter $\vec\alpha = [\alpha_1, \alpha_2,...,\alpha_K]$. How can I draw a sample (a $K$-dimensional vector) from this distribution? I need a (possibly) simple explanation.
user1315305
  • 1,199
  • 4
  • 14
  • 15
1
2 3
99 100