2

Context

In transportation planning, agent-based microsimulation is a method to deal with the complexity of the problem. Instead of computing aggregate flows (as in the classical four-step model), agents (persons, households, businesses, ...) are simulated on an individual basis. Results are obtained by aggregating after the simulation.

For the US, data sources for the simulation include the PUMS (Public-Use Micro Sample), an anonimyzed 5% sample from the full population census, and aggregated control toals (also spatially). The population census contains mostly demographic data. For other countries, similar data sources exist. This data cannot be fed straight away into a microsimulation model, as it requires

  • Ideally, a full sample with demographic variables

  • Spatial information for each agent

  • Adherence to control totals

This problem has received some attention in the recent literature, starting with Beckman et al.'s 1996 paper (reference below). However, my feeling is that this is often seen as an engineering rather than a statistical analysis task.

I am aware of this related question, but the answer doesn't seem to fit my needs.

Questions

  1. While inferring from a sample is a common statistical task, inferring a likely full population doesn't seem to be that common. In which scientific fields could this (or a similar) problem be also relevant?

  2. Which methodology/techniques does one need to know in order to perform a maximum-likelihood estimation of a population given sample? I have peeked into entropy maximization, bootstrapping and Gibbs sampling/MCMC, but I'm missing a feeling for which technique would work best, and I don't have a deep statistics background. (Actually, near-maximum likelihood should be fine, too. Eventually, fractional weights will have to become integers during the process, and this is computationally expensive if it has to be done exactly.)

  3. Which approach is best suitable to validate the synthetic population, other than running the microsimulation model? Should I reserve a validation data set, or would some sort of cross-validation be suitable in this case? Again, I am missing some background here, so I'd appreciate any hints.

  4. Do I really need to carry out statistical equivalence testing for the validation (as described by Wellek), or "will it do" to fail to detect a difference? I am aware that the latter is incorrect from a statistical point of view, but I also was unable to apply Wellek's methods to my data.

I appreciate any hints, remarks, and literature references.


Beckman, R.J., Baggerly, K.A., & McKay, M.D. (1996). Creating synthetic baseline populations. Transportation Research-A, 30(6):415-429.

Wellek, S (2010). Testing Statistical Hypotheses of Equivalence. Chapman and Hall/CRC.

krlmlr
  • 749
  • 1
  • 8
  • 35

0 Answers0