Skip to content

Constructing a Synthetic Population

Process

The default synthetic population is based on the U.S. Census Bureau’s Public Use Microdata Samples (PUMS) and aggregated data from the 2005-2009 American Community Survey (ACS) 5-year sample. As a static data set (a "snapshot" of the population at a given time, the synthetic population comprises a spatially accurate model of all households, schools, workplaces, and group quarters (e.g., prisons, college dorms, military bases and nursing homes) in the United States. Individual agents are defined and assigned to each household, school, and workplace in the database so that the result closely matches the census-based spatial distributions of households and population sizes at the census block group level, as well as commuting patterns across census-tract boundaries. For agent-based models (ABMs) that model specific geographic regions in the U.S., this synthetic population provides an excellent source of spatially-accurate population information with which to initialize agents in the simulation.

Validation

Validation processes allow us to demonstrate that a model is sufficiently realistic to address its intended use cases. In cases where we use model components such as birth, death, and household formation to carry our synthetic population forward in time during a simulation, we validate by comparing to reference data from the U.S. Census Bureau. This allows us to confirm that the resulting synthetic population matches the reference data.

When engaging in data comparison, we account for the following metrics:

  • Total population size
  • The distribution of age, overall and by sex
  • The distribution of race and ethnicity
  • The distribution of marital status
  • The distribution of household size
  • Employment status

Using the above metrics, we compare our modeling outcomes in two different ways. We compare our findings to aggregate population characteristics, such as age, sex, and race/ethnicity, and household totals and characteristics, such as household size and composition. These are reported at the census tract level in the American Community Survey (ACS)’s five-year estimates. We also compare our modeling findings to year-by-year population estimates and characteristics at the county level.

Origin

The core of our default synthetic population was developed by RTI International. In short, RTI used an iterative proportional fitting method developed in Beckman, et al. (1996) to generate an agent population based on the US Census Bureau’s Public Use Microdata Sample (PUMS) files and Census aggregated data. See Wheaton, et al. (2009) for a detailed description. The result is that each agent has a set of socio-demographic characteristics and daily behaviors that include age, sex, and race and household location and membership.

As described on the RTI web site:

“Unlike typical sociodemographic data, the RTI U.S. Synthetic Household Population represents households and persons as dots on a Report—matching high-resolution population distributions with the correct mix of households in each census block group.”

The RTI synthetic population has been transformed into an Epistemix synthetic population, so that it can be used as a core component of data science and modeling projects with the Epistemix platform.