Questions tagged [stratification]

A sampling technique in which the population of interest is partitioned into subsets ("strata") based on characteristics known at all units before sampling.

Stratification is a sampling technique in which the population of interest is split into strata based on characteristics available for all units before sampling. When done wisely, it may offer the following advantages:

  1. Efficiency of the estimates may be improved (i.e., variances/standard errors can be made lower).
  2. Data within a stratum or several strata can be analyzed independently of other observations. (Generally, subsets of data collected in a complex design sample require analysis of the complete data set using the techniques specifically formulated for domains or subpopulations.)
  3. Different sampling designs can be implemented within different strata.

Efficiency gains, if any, come from the following expression (variance estimator of the weighted mean) applicable to the stratified single stage sampling design:

$$ V_{\rm str}[\bar y_{\rm str}] = \sum_{h=1}^L \Bigl(1 - \frac{n_h}{N_h} \Bigr) \Bigl( \frac{N_h}{N} \Bigr)^2 \frac{S_h^2}{n_h}, \quad S_h^2 = \frac1{N_h-1} \sum_{i=1}^{N_h} (y_i - \bar y_h)^2 $$

Efficiency gains materialize if the within-stratum variances $S_h^2$ are lower (ideally, much lower) than the overall variance; or, in other words, when similar units are put together in the same stratum. In the extreme case, when the population consists of replicates of a small number of distinct units, the ideal stratification strategy can achieve zero sampling variance by putting the identical units together into corresponding strata and taking just one unit from each stratum.

Feasibility and efficiency of stratification crucially depend on the available sampling frames, and the auxiliary information that can be found on these frames (i.e., whether there are any additional data besides the frame identifier and the contact information).

In human population sampling, where most behaviors and outcomes are at least weakly associated with demographics, European statistical agencies can stratify on age and gender in the countries that have population registers where these individual characteristics are recorded, and draw samples from such registers. For such rich frames, the frame identifier is usually the person's tax number; the contact information includes the address and the phone number; and additional information may include all sorts of government records associated with that tax number.

In the U.S., where collection of such detailed data by the government is considered overstepping the limits of personal privacy, stratification by age and gender is not feasible. Often, the only possible stratification of the U.S. general population is by geography (although such stratification can also be employed to target and/or oversample recognizable minorities known to reside compactly in relatively homogeneous enclaves). Hence large scale in-person surveys have to be designed from scratch using area samples, enumerating the dwellings, and collecting household rosters within dwellings. In such frames, the sample is collected in several stages; at the first, top-most stage, census tracts may be selected, where the frame identifier is the census tract number in the GIS systems, and the contact information for that unit is the map of that census tract; auxiliary information at the tract level may include detailed summaries of the population of that tract based on the most recent Census data. Geographic stratification is also used in the U.S. survey industry for random digit dialing phone surveys. It relies on the implementation details of how the telecom industry has been assigning phone numbers, although targeting specific geographic areas using the landline phones frame has coverage limitations (not every household has a home landline phone), while on the cell phone frame, the available geographic resolution is never more detailed than about 100K people. For the phone frame, there is often no contact information (e.g., householder's name) available, and the frame identifier and the contact information is just the phone number itself, with no auxiliary information available regarding that phone number other than the area code (the first three digits of a phone number) and the exchange (the next three digits) pointing to the geography in which this number is likely to be located.

In establishment surveys, firms are usually put into strata defined by industry and a relevant measure of size (revenues or employment in the past year, often available either in commercial databases or in establishment registers), with larger firms sampled at much higher rate, up to 100% (i.e., sampling with certainty), as these firms are responsible for a disproportionate fraction of the total employment or revenue, and hence the sampling error is reduced when a large firm is included with certainty and contributes zero error to the total.

Other surveys and their specialized sampling frames may allow for their own idiosyncratic stratification strategies corresponding to the research questions and data collection needs.

233 questions
74
votes
5 answers

Understanding stratified cross-validation

I read in Wikipedia: In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly…
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
38
votes
2 answers

Why use stratified cross validation? Why does this not damage variance related benefit?

I've been told that is beneficial to use stratified cross validation especially when response classes are unbalanced. If one purpose of cross-validation is to help account for the randomness of our original training data sample, surely making each…
James Owers
  • 627
  • 1
  • 5
  • 11
32
votes
1 answer

Benefits of stratified vs random sampling for generating training data in classification

I would like to know if there are any/some advantages of using stratified sampling instead of random sampling, when splitting the original dataset into training and testing set for classification. Also, does stratified sampling introduce more bias…
gc5
  • 877
  • 2
  • 12
  • 23
18
votes
4 answers

Is leave-one-out cross validation (LOOCV) known to systematically overestimate error?

Let's assume that we want to build a regression model that needs to predict the temperature in a build. We start from a very simple model in which we assume that the temperature only depends on weekday. Now we want to use k-fold validation to check…
Roman
  • 1,013
  • 2
  • 23
  • 38
14
votes
2 answers

Does fitting Cox-model with strata and strata-covariate interaction differ from fitting two Cox models?

In Regression Modeling Strategies by Harrell (second edition) there is a section (S. 20.1.7) discussing Cox models including an interaction between a covariate whose main effect on survival we want to estimate as well (age in the example below) and…
Vincent
  • 720
  • 3
  • 18
13
votes
2 answers

Empirical distribution alternative

BOUNTY: The full bounty will be awarded to someone who provides a reference to any published paper which uses or mentions the estimator $\tilde{F}$ below. Motivation: This section is probably not important to you and I suspect it won't help you get…
12
votes
3 answers

Multilevel model vs. separate models for each level

What are the advantages and disadvantages of running separate models vs. multilevel modeling? More particularly, suppose a study examined patients nested within doctors' practices nested within countries. What are the advantages/disadvantages of…
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
12
votes
2 answers

Sampling with replacement in R randomForest

The randomForest implementation does not allow sampling beyond the number of observations, even when sampling with replacement. Why is this? Works fine: rf <- randomForest(Species ~ ., iris, sampsize=c(1, 1, 1), replace=TRUE) rf <-…
cohoz
  • 618
  • 5
  • 16
12
votes
1 answer

Stratified classification with random forests (or another classifier)

So, I've got a matrix of about 60 x 1000. I'm looking at it as 60 objects with 1000 features; the 60 objects are grouped into 3 classes (a,b,c). 20 objects in each class, and we know the true classification. I'd like to do supervised learning on…
11
votes
2 answers

Remove duplicates from training set for classification

Let us say I have a bunch of rows for a classification problem: $$X_1, ... X_N, Y$$ Where $X_1, ..., X_N$ are the features/predictors and $Y$ is the class the row’s feature combination belongs to. Many feature combination and their classes are…
10
votes
1 answer

Stratified sampling with multiple variables?

I don't know much about stats so I'm looking for a starting point here. Any resources or insights would be helpful. I'm conducting an e-learning experiment, in which students watch videos and then complete a survey which measures cognitive load and…
waitinforatrain
  • 185
  • 1
  • 2
  • 11
9
votes
3 answers

Using post-stratification weights in R survey package

I am analyzing a dataset that has a variable for post-stratification weights. As this is a complex survey, the plan is to use the R survey package. I have been reading its documentation and feel like able to set a survey design correctly. So far, so…
FabF
  • 121
  • 1
  • 8
9
votes
1 answer

Machine learning with weighted / complex survey data

I have worked a lot with various nationally representative data. These data sources have a complex survey design, so the analysis requires the specification of stratification and weight variables. Among the data sources that are within my area of…
Brian P
  • 455
  • 1
  • 6
  • 12
8
votes
3 answers

Simple post-stratification weights in R

I just got my hands on the ANES (American National Election Studies) 2008 data set, and would like to do some simple analysis in R. However, I've never worked with this complex of a data set before and I've run into an issue. The survey uses…
Wilduck
  • 385
  • 4
  • 8
8
votes
1 answer

Bootstrapping dataset with imbalanced classes

I am trying to build an ensemble model to classify dataset with imbalanced data, where some of classes have just a few samples. And, because of this dataset property, when I am doing re-sampling with replacement, some of classes become "discarded",…
1
2 3
15 16