How to create a synthetic data set of your population or universe from a survey sample?

Question

I am thinking about how best to construct a data set containing one record for every individual in the entire United States population starting from something like the American community survey or the United States decennial census public use microdata files.

Both of these "starting points" would be very large, they have between 1 and 5% of the entire United States population already.

So long as this concept is not majorly flawed from the start, having this synthetic but complete data set would make it much easier to merge on (cold-deck impute) information from other, smaller data sets.

One could obviously take a simplistic approach and - for every record - make the same number of records as that record's weights. This wouldn't be much different than just analyzing the data set with Stata's frequency weights but that will obviously create problems in smaller geographic areas.

Say you've got a small county that you know for a fact has 10,000 residents, but in your sample only ten records. Obviously you cannot simply expand those ten records to the 10,000 residents on their own. Since you'll have a very lumpy (and wrong) age, race, income distribution within the county. However, if you took a bit of information from nearby areas and projected the probabilities of age, race, income, etc. of each of those 10,000 records you're creating from scratch, you could semi-randomly create 10,000 records that would look more reasonable for that specific county.

I am unsure whether this sort of thing has not been done because it's a terrible idea for some reason or if statisticians and demographers simply haven't had the computing power to quickly deal with 300 million records of data until very recently.

I am not sure of what you mean by “merge information from other, smaller data sets” but generally speaking this seems some variant of the surprisingly common question “Why not just duplicate records to reach some desirable sample size?” This just doesn't make sense, you can't magically create information that isn't there. — Gala, Aug 05 '13 at 15:16
if you started with the 1% census microdata and had to guess the age/sex/race/income of every individual person inside a 10,000-person district somewhere within the united states, you could do a pretty good job of creating a fake data set with a realistic spread of people. yes, very few people would be exactly right - and you'd want some measure of how uncertain you are - but i'm not asking for wizardry. :) — Anthony Damico, Aug 05 '13 at 15:37
It doesn't make sense to me (Statistically it may be possible!). Let's take your example: if one county has 10 individuals and neighboring three counties have 15, 20, 30 individuals (sample), how can I expand the sample of 10 individuals in the first county to the population of 10000 individuals(unless you have population characteristics of neighboring counties)?. — Metrics, Aug 05 '13 at 18:08
so if you know a county has 10,000 residents, maybe you could take the age/sex/race/income distribution in the nearest X counties (or at least counties nearby that have other characteristics in common) until you have 50,000 records. maybe you can then just use a randomly-sampled 10,000 individuals from those 50,000 records as your "synthetic" population for that county. repeat the same trick over and over again for every county in the united states? or maybe use those 50,000 records to inform some imputation of the 10,000 county residents? this making any sense? :) — Anthony Damico, Aug 05 '13 at 18:40
Maybe this notion of creating “records” is a distraction? What you can do is get a better idea of the mean of some variable for the country than afforded by the small number of observations you have, “borrowing strength” from the larger sample. That's a common theme in Andrew Gelman's work with multi-level models. — Gala, Aug 05 '13 at 21:41
thanks.. the point is to create a starting population-wide data set that quickly allows a _best guess_ for every small geography (even if it's uncertain), with the ultimate goal of cold-decking on a bunch of additional variables that aren't already in the original [survey] data set. obviously falsely increasing your sample 300x means you've got to somehow broaden the confidence intervals, but once that's done, i don't see why this data structure would have to be measurably inferior to non-synthetic data? and it's much easier to work with and explain. — Anthony Damico, Aug 05 '13 at 21:56
Well, for one thing, that means that everything you do down the line has to take that into account, it sounds very dangerous and you can't rely on standard methods or software. I am still not entirely clear on what you want to do but I would first look into more principled solutions instead of focusing on this idea of creating records out of thin air. Running standard routines on a fictional data set and somehow correcting the results after the fact instead of using an appropriate model seems a backward way of approaching the problem. Those additional observations just aren't there. — Gala, Aug 05 '13 at 22:24
(Also: Could you try to use proper case? I edited your original post to fix the problem but it makes your messages harder to read.) — Gala, Aug 05 '13 at 22:25
you can't rely on standard methods, but perhaps there's a simple design-effect adjustment that could be used when calculating confidence intervals--and the structure of this ideal data would be so simple that that'd be easy to implement everywhere. another advantage: creating records _out of thin air_ would make it far easier to construct [pseudo-panel data](http://arno.uvt.nl/show.cgi?fid=26669) from two cross-sectional years of something like the american community survey :) — Anthony Damico, Aug 05 '13 at 23:32
Still, all this makes very little sense to me and doesn't sound simple at all. In any case, I am not aware of anything like this and I don't quite understand why you seem so committed to this idea and uninterested in actual solutions to actual problems like the ones I mentioned before. — Gala, Aug 06 '13 at 07:38
it's an integrated approach to modeling almost every aspect of the american population. it's a data set that non-statisticians can quickly understand. it would allow researchers to look at synthetic individuals _over ten years_ and _at the census tract level_, and use any variable you were able to cold-deck from any available survey data set. i get that it's not straightforward. if nobody's done it, hey more fun for me. — Anthony Damico, Aug 06 '13 at 10:32
I haven't seen anything quite like this, but it has similarities to [dasymetric](http://stats.stackexchange.com/q/15784/1036) mapping, and is kind of the opposite of when people want to [obsfucate data for privacy reasons](http://gis.stackexchange.com/a/25854/751). I wouldn't say it is a terrible idea, it is just very difficult (to impossible) to project the individual level data while maintaining the appropriate relationships between variables (i.e. the ecological fallacy prevents you from knowing the individual level correlations). — Andy W, Aug 06 '13 at 13:50

score 3 · Answer 1 · edited Jun 11 '20 at 14:32

Two projects come pretty close to what you wanted.

First, the Synthetic Population Viewer from the RTI thinktank uses 2007-2011 ACS data and makes a "synthetic" household so that they sum up to the 2010 census tract estimates.

You can find a methods explanation here:

Wheaton, W.D., J.C. Cajka, B.M. Chasteen, D.K. Wagener, P.C. Cooley, L. Ganapathi, D.J. Roberts, and J.L. Allpress. 2009. Synthesized population databases: A U.S. geospatial database for agent-based models. RTI Press paper available here.

Second, as Andy W mentioned, this is similar to dasymetric mapping where ancillary information is combined with survey data to come up with small-area estimates. A good example of this method is the work by Nagle and colleagues:

Nagle, Nicholas N. et al. “Dasymetric Modeling and Uncertainty.” Annals of the Association of American Geographers. Association of American Geographers 104.1 (2014): 80–95.

Leyk S, Buttenfield BP, Nagle NN. Transactions in Geographic Information Science. 2013. Modeling ambiguity in census microdata allocations to improve demographic small area estimates.

However, proper caution is still needed when using the output of either of these two methods, but I think you could use either of these approaches as a baseline for "cold-deck imputation" at the census tract level. Keeping in mind that cold-deck imputation should only be used under heroic assumptions of missing completely at random data.

How to create a synthetic data set of your population or universe from a survey sample?

1 Answers1