i have a data set that's considered the best survey for a series of questions. therefore, i do not want to re-weight the survey -- and i cannot use any of the techniques described on http://r-survey.r-forge.r-project.org/survey/example-poststrat.html
i have one variable within the survey dataset that is not very good, and i want to adjust it to align better with a more realistic outside source.
the outside source has information on the marginal totals, but not the joint distribution. therefore, i am looking for a raking or calibration technique that only modifies a single variable and not the entire survey dataset -- but can control it based on marginal totals.
here's a simple calibration technique for a single survey variable (mobility
) across a single control (stype
). i am looking for a way to do this across two control variables (perhaps stype
and both
within this data set)
library(survey)
library(zoo)
data(api)
dstrat <- svydesign( id = ~1 , strata = ~stype , weights = ~pw , data = apistrat , fpc = ~fpc )
# some unadjusted results
svyby( ~ mobility , ~ stype , dstrat , svyquantile , seq( 0 , 1 , .1 ) , keep.var = FALSE )
# univariate control totals
stype_adjustment <-
data.frame(
pctile = seq( 0 , 1 , .1 ) ,
stype_E = seq( 300 , 400 , 10 ) ,
stype_M = seq( 400 , 500 , 10 ) ,
stype_H = seq( 500 , 600 , 10 )
)
so right here, notice my stype_adjustment
dataset, which has my target percentiles for this variable. here are the desired 20th, median, and 80th percentiles for my variable mobility
stype_adjustment[ stype_adjustment$pctile %in% c( .2 , .5 , .8 ) , ]
pctile stype_E stype_M stype_H
3 0.2 320 420 520
6 0.5 350 450 550
9 0.8 380 480 580
so those are the targets. that in mind, here's a pretty straightforward algebra-only adjustment:
# merge the eleven known controls onto all 101 percentiles
all_percentiles <-
merge(
data.frame( pctile = seq( 0 , 1 , 0.01 ) ) ,
stype_adjustment , all = TRUE
)
# interpolate the in-between percentiles in the control columns
all_percentiles$stype_E <- na.approx( all_percentiles$stype_E )
all_percentiles$stype_M <- na.approx( all_percentiles$stype_M )
all_percentiles$stype_H <- na.approx( all_percentiles$stype_H )
# calcualte the pre-adjustment results from the data set
pa <- data.frame( t( data.frame( svyby( ~ mobility , ~ stype , dstrat , svyquantile , seq( 0 , 1 , .01 ) , keep.var = FALSE ) ) )[ -1 , ] )
pa[ , ] <- sapply( pa[ , ] , function( w ) as.numeric( as.character( w ) ) )
names( pa ) <- paste0( "unadj_" , names( pa ) )
pa$pctile = seq( 0 , 1 , 0.01 )
# merge on the unadjusted percentiles
all_percentiles <- merge( all_percentiles , pa )
# find the multiplier within each of your strata
all_percentiles$e_multiplier <- all_percentiles$unadj_E / all_percentiles$stype_E - 1
all_percentiles$m_multiplier <- all_percentiles$unadj_M / all_percentiles$stype_M - 1
all_percentiles$h_multiplier <- all_percentiles$unadj_H / all_percentiles$stype_H - 1
# for each observation in the dataset, figure out the nearest percentile -- and save the associated multiplier
e_factor <- all_percentiles$e_multiplier[ unlist( lapply( dstrat$variables$mobility , function( z ) which.min( abs( z - pa$unadj_E ) ) ) ) ]
m_factor <- all_percentiles$m_multiplier[ unlist( lapply( dstrat$variables$mobility , function( z ) which.min( abs( z - pa$unadj_M ) ) ) ) ]
h_factor <- all_percentiles$h_multiplier[ unlist( lapply( dstrat$variables$mobility , function( z ) which.min( abs( z - pa$unadj_H ) ) ) ) ]
# compute the new mobility variable, adjusted based on your univariate controls
dstrat <-
update(
dstrat ,
adjusted_mobility =
ifelse( stype == 'E' , mobility / ( 1 + e_factor ) ,
ifelse( stype == 'M' , mobility / ( 1 + m_factor ) ,
ifelse( stype == 'H' , mobility / ( 1 + h_factor ) , NA ) ) )
)
notice that the weights have not changed -- and the unadjusted mobility
variable gives the same results
svyby( ~ mobility , ~ stype , dstrat , svyquantile , c( 0.2 , 0.5 , 0.8 ) , keep.var = FALSE )
stype statistic1 statistic2 statistic3
E E 12 17 23
H H 6 10 15
M M 8 13 19
so this is not a calibrated data set.. it's only a single calibrated variable. the adjusted mobility variable has been scaled up across every percentile
svyby( ~ adjusted_mobility , ~ stype , dstrat , svyquantile , c( 0.2 , 0.5 , 0.8 ) , keep.var = FALSE )
stype statistic1 statistic2 statistic3
E E 319 350 380
H H 510 544 580
M M 414 448 480
how could this be done if i have the marginal controls for the variables stype
and both
but not for their joint distribution?
> head( apiclus1[ , c( 'stype' , 'both' ) ] )
stype both
1 H Yes
2 E Yes
3 E Yes
4 E No
5 E Yes
6 E Yes
not sure if i need some iterative proportional fitting (for weighted data) algorithm, or what..
thanks!