Skip to content

Including External Data in a FRED model

In this tutorial, we demonstrate two important ways in which a FRED model can be enhanced by incorporating data from an external source:

  1. The Epistemix Platform Synthetic Population can be augmented with custom agent attributes relevant to a specific modeling problem.
  2. Custom places in FRED that our agents will visit during simulation runs can be created.

We will explore these two use cases by building on a basic respiritory disease model. First, agent attributes called my_prob_overweight and my_prob_obese_given_overweight are used to categorize agents as being according to their Body Mass Index in a realistic way. Agents who are categorized as having a "healthy weight" (BMI < 25) according to the Body Mass Index (BMI), retain the default probability of becoming symptomatic when they are infected with influenza (66%). Agents who are categorized as "overweight" (25 <= BMI < 30) have their probability of becoming symptomatic increase to 75%. Agents who are categorized as "obese" (BMI >= 30) have their probability of becoming symptomatic increase to 90%.

Then, we introduce a data-dependent mitigation strategy in the form of weight loss clinics whose locations are specified by an external data set. Agents who are categorized as "overweight" or "obese" are assumed to visit a weight loss clinic if there is one within a 1.2 km radius (approximately a 15-minute walk) of their household. Visiting a weight loss clinic is assumed to be associated with some chance of transitioning to a different BMI category: Every 30 days, agents who are categorized as "obese" have a 20% chance of transitioning to the "overweight" category, and agents who are categorized as "overweight" have a 10% chance of transitioning to the "healthy weight" category.

In this tutorial, the external data set that describes the locations of the weight loss clinics is randomly generated. This enables the notebook to be self-contained. However, the method that is demonstrated here, can just as easily be used to represent real places with known geographical locations in a FRED model.

import numpy as np

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon

import matplotlib.pyplot as plt
import plotly.express as px

from util import get_n_points_in_polygon, agent_df_to_fac_df, plot_factor_summary

1. Generating External Data Sets

In this section, we generate two files containing data that will be incorporated into the FRED model that is described in Section 2. These are:

  1. data/prob_overweight_obese.csv, which specifies the conditional probabilities that an agent is "overweight" or "obese" according to the BMI, given their age, sex, and race.
  2. data/weight_loss_clinics.csv, which specifies the geographic locations of weight loss clinics.

1.1 First Data Set: Demographic-Dependent BMI Category Probabilities

The data/prob_overweight_obese.csv file will contain the six columns. The first four are used to partition the agent population into classes based on their demographic characterstics. The last two are class-specific (conditional) probabilities.

The specific columns are as follows:

  1. sex_id: sex of agents in class (0 = Female, 1 = Male).
  2. race_id: race of agents in class (see numerical codes below).
  3. age_lower_bound: minimum age of agents in class in years (inclusive).
  4. age_upper_bound: maximum age of agents in class in years (inclusive).
  5. prob_overweight: conditional probability of an individual being "overweight" given their class.
  6. prob_obese_given_overweight: conditional probability of an individual being "obese" given their class and that they are "overweight".

The numerical values representing different races in FRED are listed below (see also documentation in the FRED Language Reference:

  • -1 = UNKNOWN
  • 1 = WHITE
  • 2 = AFRICAN_AMERICAN
  • 3 = AMERICAN_INDIAN
  • 4 = ALASKA_NATIVE
  • 5 = TRIBAL
  • 6 = ASIAN
  • 7 = HAWAIIAN_NATIVE
  • 8 = OTHER_RACE
  • 9 = MULTIPLE_RACE

To make the probabilities of agents being categorized as "overweight" or "obese" in the model realistic, we use results of empirical research into the proportions of US adults who are categorized as "overweight" or "obese" based on their age, sex, and race (see References, below: Flegal et al., 2012).

The file data/proc/flegal_2012_table2_extract.csv contains a manually extracted copy of the relevant data from Table 2 of Flegal et al., 2012. We can read this file to load the data into the notebook as a DataFrame:

prob_df = (
    pd.read_csv("data/raw/flegal_2012_table2_extract.csv")
    .assign(prob_overweight=lambda df: df["pct_overweight"] / 100)
    .assign(prob_obese=lambda df: df["pct_obese"] / 100)
    .drop(columns=["pct_overweight", "pct_obese"])
)

prob_df.head()

Simplifying assumptions:

  • Everyone under the age of 20 is assumed to have no chance to be categorized as "overweight" or "obese," because data for that age group isn't reported in Flegal et al., 2012
  • The probabilities for races other than white and African American are assumed to be equal to the probabilities in the "other" category from the data.
  • White and African American agents are assumed to be non-hispanic, as hispanic agents are not explicitly represented in the Synthetic Population.

Assign FRED-readable codes to sex and race attributes

sex_map = pd.Series(
    {
        "male": 1,
        "female": 0,
    },
    name="sex_id",
).rename_axis("sex")

race_map = pd.Series(
    {
        "white": 1,
        "african_american": 2,
    },
    name="race_id",
).rename_axis("race")

The race other in prob_df is handled separately. The values corresponding to other are used for all other race codes available in FRED.

other_fred_race_codes = [
    x for x in range(-1, 10) if x not in race_map.values and x != 0
]

other_prob_df = pd.concat(
    [
        prob_df.pipe(lambda df: df[df["race"] == "other"]).assign(race_id=race_id)
        for race_id in other_fred_race_codes
    ]
).drop(columns="race")

white_aa_prob_df = (
    prob_df.pipe(lambda df: df[df["race"] != "other"])
    .merge(race_map.to_frame(), on="race", how="left")
    .drop(columns="race")
)

prob_df = (
    pd.concat([white_aa_prob_df, other_prob_df])
    .merge(sex_map.to_frame(), how="left", on="sex")
    .drop(columns="sex")
    .loc[
        :,
        [
            "sex_id",
            "race_id",
            "age_lower_bound",
            "age_upper_bound",
            "prob_overweight",
            "prob_obese",
        ],
    ]
    .sort_values(by=["sex_id", "race_id", "age_lower_bound"])
)
prob_df.head()

Calculate prob_obese_given_overweight

Applying the definition of conditional probability, we know that

\[ P(B|O) = \frac{P(B,O)}{P(O)} \]

where \(P(B|O)\) is the probability of an individual being categorized as "obese" given they are categorized as "overweight" and \(P(B,O)\) is the probability that an individual is categorized as both "obese" and "overweight". Since the categories are overlapping, individuals who categorized as "obese" are also categorized as "overweight." That is, \(P(B,O) = P(B)\). Therefore:

\[ P(B|O) = \frac{P(B)}{P(O)} \]

Applying this calculation to prob_df, we have:

prob_df = prob_df.assign(
    prob_obese_given_overweight=lambda df: (
        (df["prob_obese"] / df["prob_overweight"]).fillna(0)
    )
).drop(columns=["prob_obese"])
prob_df.head()

Now, we are ready to save this table as a CSV file that can be accessed during the FRED simulation.

prob_df.to_csv("data/prob_overweight_obese.csv", index=False)

1.2 Second Data Set: Weight Loss Clinic Locations

The data/weight_loss_clinics.csv file will contain the following columns:

  • ID: identifier for the weight loss clinic (set to 0 for all clinics, so that FRED will automatically generate IDs for us).
  • LAT: Geographic latitude of the weight loss clinic location.
  • LON: Geographic longitude of the weight loss clinic location.
  • ELEV: Elevation of weight loss clinic location (set to 0 for all clinics, since elevation is not relevant to this model).

Each row in the file represents an individual weight loss clinic location. Below, we will create this file by randomly generating a set of points to produce a realistic overall spatial density.

Note that, if we wanted to associate additional information about places specified by external data for use in the model (e.g., max_patients), we could do so by including additional columns in this file. See the FRED Language Guide for more information.

To generate the weight loss clinic location file, we will first download US County geographic boundary data in shapefile format from the US Census website and read it into a GeoDataFrame.

us_counties_shapefile_url = (
    "https://www2.census.gov/geo/tiger/GENZ2021/shp/cb_2021_us_county_20m.zip"
)
all_us_county_boundaries_gdf = gpd.read_file(us_counties_shapefile_url)

Then, we can extract the geographic boundary of Erie County, PA (FIPS code 42049):

erie_county_fips = "42049"
erie_county_boundary_gdf = all_us_county_boundaries_gdf.pipe(
    lambda df: df[df["GEOID"] == erie_county_fips]
)
erie_county_boundary_gdf

The boundary coordinates in erie_county_boundary_gdf are expressed in geodetic coordinates (where the units are degrees). We project these coordinates into the Albers coordinate reference system (CRS), which uses coordinates expressed in meters. The Albers CRS has the property - called "equal area" - that it preserves the area of a projected region compared to the region described by the corresponding geodetic coordinates. This is desirable, since our objective is to randomly generate a specific number of points per unit of area. The Albers projection has the EPSG code 9822.

erie_county_boundary_proj_gdf = erie_county_boundary_gdf.to_crs(9822)

As a parameter of our model, let's assume there are is one weight loss clinic for every 5000 people.

clinic_density_per_capita = 1.0 / 5000

The population of Erie County county in our synthetic population is 266,460.

population_size = 266460

Given these variables, we are ready to randomly generate clinic locations:

np.random.seed(12345) # for reproducibility
clinic_locs = get_n_points_in_polygon(
    erie_county_boundary_proj_gdf.iloc[0]["geometry"],
    int(clinic_density_per_capita * population_size),
)

f, ax = plt.subplots()
erie_county_boundary_proj_gdf["geometry"].plot(ax=ax, color="b")
clinic_locs.plot(ax=ax, color="k");

In the following section, we will use the FRED function read_place_file to read these locations into our simulation model.

First, though, we need to create a correctly-formatted table in the correct format that is accepted by read_place_file and save that table in a CSV file in the data directory.

(
    clinic_locs.set_crs(9822)
    .to_crs(4326)
    .to_frame()
    .assign(ID=0)  # Causes FRED to randomly generate place IDs for us
    .assign(LAT=lambda df: df["geometry"].y)
    .assign(LON=lambda df: df["geometry"].x)
    .assign(ELEV=0)
    .drop(columns="geometry")
).to_csv("data/weight_loss_clinics.csv", index=False)

2. Examining the FRED model

The FRED model that will use the external data files that we have generated (data/prob_overweight_obese.csv and data/weight_loss_clinics.csv) is contained in the main.fred file (which references several additional files in the model directory). Much of the FRED code contained in this model will be familiar from the lessons in Parts 1-10 of the Quickstart Guide, so here we focus on the FRED features that we use to load in external data.

main.fred

In main.fred, we use the following blocks to create the my_prob_overweight and my_prob_obese_given_overweight attributes for agents using the values from data/prob_overweight_obese.csv:

use model/Read_Attribute {
    input_file = data/prob_overweight_obese.csv
    attribute = my_prob_overweight
    column = 4
    min_age_column = 2
    max_age_column = 3
    sex_column = 0
    race_column = 1
}

use model/Read_Attribute {
    input_file = data/prob_overweight_obese.csv
    attribute = my_prob_obese_given_overweight
    column = 5
    min_age_column = 2
    max_age_column = 3
    sex_column = 0
    race_column = 1
}
These blocks reference a FRED module called Read_Attribute. Modules - which are contained in .fredmod files to distinguish them from ordinary .fred files - are a feature of FRED that make it easy to reuse code multiple times - either in the same model or even across different models. They accept input parameters (such as those specified in the use block above) that customize the generic code written in the module for use in the specific model invoking it. In this example, our objective is to assign the values of two agent variables - my_prob_overweight and my_prob_obese_given_overweight - based on data that is read data from a .csv file. This is a natural use case for modules, because the FRED code needed to load data for each variable is essentially the same, except for the name of the variable and the column number in the .csv file containing the corresponding data. We can specify these values as module parameters.

For more information about modules, including details of their specialized syntax, see the chapter on Modularity in the FRED Language Guide.

Each row in data/prob_overweight_obese.csv (see sample below) corresponds to a group of agents identified by their age range, sex, and race with the same probabilities of being "overweight" and "obese" according to their BMI.

sex_id,race_id,age_lower_bound,age_upper_bound,prob_overweight,prob_obese_given_overweight
0,-1,0,19,0.0,0.0
0,-1,20,39,0.5579999999999999,0.5716845878136202
0,-1,40,59,0.66,0.5454545454545454

In the code blocks following the use statements above, we specify: 1. The name of the FRED agent variable that will be created to store the data values (attribute).
2.The column number in the input data file that contains the values that will be stored in the given attribute for each class of agent (column). 3. The column numbers in the input data file that identify the demographic characteristics that define our agent classes (min_age_column, max_age_column, sex_column, and race_column).

Note that the column numbers in all cases are 0-indexed. We call use model/Read_Attribute once for each attribute we want to instantiate for the agents (my_prob_overweight and my_prob_obese_given_overweight).

Recall that age, sex, and race are agent attributes that are built into the Epistemix Platform Synthetic Population. As a result, they are automatically available in every FRED model that uses the provided Synthetic Population. Because of this, they can be thought of as "scaffolding" attributes, which can be used as the basis for new agent attributes whose values are drawn from model-specific external data in a way that depends on the values of these existing attributes, as we have demonstrated here.

model/weight_loss_clinic_mitigation.fred

The FRED code used to implement the effect of weight loss clinics - i.e., that living within 1.2 km of a weight loss clinic causes agents to change their BMI category with some probability - is in the file weight_loss_clinic_mitigation.fred.

Here, we declare the existence of a type of FRED Place called a WeightLossClinic:

place WeightLossClinic {
    has_group_agent = 1
}
We specify that WeightLossClinic places have Group Agents (has_group_agent = 1), so that we can ask these agents for information about the location of the weight loss clinics in the model's conditions. We specify that Household places also have group agents for the same reason. Households are included in FRED models by default, with locations specified in the Synthetic Population, so the effect of the following block is only to associate group agents with these places.
place Household {
    has_group_agent = 1
}
We use three conditions in weight_loss_clinic_mitigation.fred to:

  1. Read the locations of weight loss clinics from data/weight_loss_clinics.csv (READ_WEIGHT_LOSS_CLINICS).
  2. Identify agent households within 1.2 km of a weight loss clinic (NEAR_WEIGHT_LOSS_CLINIC).
  3. Determine whether agents who are classified as "overweight" or "obese" will transition to a different BMI category on a 30-day schedule (WEIGHT_LOSS).

See documentation of these conditions in weight_loss_clinic_mitigation.fred for further details of the implementation.

3. Model behavior verification

In this section, we run the FRED model and analyze the outputs to verify that the model is behaving as expected. This is a typical part of the modeling process and is important to ensure that the model implementation matches the specification.

from epxexec.fred_job import fred_job

Note that the following command that runs the test simulation will take approximately 3min 30s to run. The model will run quicker using a smaller population than that of Erie County, PA. We have chosen Erie County to provide a reasonably large sample of each of the agent age, sex, and race groups used in the BMI status model. This will allow us to demonstrate that it is possible to recover the distributions of BMI statuses specified in data/prob_overweight_obese.csv from the agents in the synthetic population.

test_weight_loss_job = fred_job("main.fred", job_key="test_weightloss", force=True)

3.1 BMI status attributes correctly applied to agents

Here, we verify that our external obesity probability data is being correctly read into FRED and applied to the agents in the appropriate age, sex, and race categories. We can do this examining the agent attribute data that was output by FRED, as specified by the code in the report_agent_attributes.fred file, and comparing it to the prob_df input data.

agent_df = (
    pd.concat(
        [
            test_weight_loss_job.runs[1]
            .get_table_variable(variable=x[0])
            .astype(x[1])
            .rename_axis("agent_id")
            for x in [
                # Specify name of variable and the Python type to store the data as
                ("agent_age", int),
                ("agent_race", int),
                ("agent_sex", int),
                ("agent_prob_symptomatic", float),
            ]
        ],
        axis=1,
    )
    .rename(columns=lambda x: x.replace("agent_", ""))
    .rename(columns={"race": "race_id", "sex": "sex_id"})
)
agent_df.head()

The next cell associates agents' ages with the age groups used in prob_df:

all_ages_with_bounds_df = (
    prob_df[["age_lower_bound", "age_upper_bound"]]
    .groupby(level=0, as_index=False)
    .apply(
        lambda df: (
            pd.Series(
                np.arange(
                    df.iloc[0]["age_lower_bound"], df.iloc[0]["age_upper_bound"] + 1
                )
            )
            .rename("age")
            .to_frame()
            .assign(age_lower_bound=df.iloc[0]["age_lower_bound"])
            .assign(age_upper_bound=df.iloc[0]["age_upper_bound"])
        )
    )
    .reset_index(drop=True)
)
agent_df = agent_df.merge(all_ages_with_bounds_df, on="age", how="left")
agent_df.head()

In the BMI_STATUS condition in main.fred, we specified that agents in the "healthy weight" category have my_prob_symptomatic = 0.66, agents in the "overweight" category have my_prob_symptomatic = 0.75, and agents in the "obese" category have my_prob_symptomatic = 0.90.

We can use these values to infer each agent's BMI status:

bmi_status_map = (
    pd.Series(
        {
            "healthy_weight": 0.66,
            "overweight": 0.75,
            "obese": 0.90,
        }
    )
    .rename("prob_symptomatic")
    .rename_axis("bmi_status")
)
agent_df = agent_df.merge(
    bmi_status_map.to_frame().reset_index(), on="prob_symptomatic", how="left"
)

Then, we can calculate probability of each BMI status for each combination of agent age range, sex, and race:

agent_bmi_status_prob_df = (
    agent_df.groupby(["race_id", "sex_id", "age_lower_bound", "age_upper_bound"])[
        "bmi_status"
    ]
    .apply(
        lambda bmi_status_s: (
            (bmi_status_s.value_counts() / len(bmi_status_s.index))
            .reindex(pd.Index(["healthy_weight", "overweight", "obese"], name="bmi_status"))
            .fillna(0)
        )
    )
    .rename("prob_bmi_status")
    .to_frame()
    .unstack()["prob_bmi_status"]
    .rename(columns=lambda x: f"agent_prob_{x}")
    .reset_index()
    .rename_axis(None, axis=1)
    .drop(columns="agent_prob_healthy_weight")
)
agent_bmi_status_prob_df.head()

Now, let's create a DataFrame that allows us to compare the resulting agent health status probabilities to the input probabilities:

bmi_status_prob_comp_df = (
    prob_df.assign(
        prob_obese=lambda df: (
            df["prob_overweight"] * df["prob_obese_given_overweight"]
        )
    )
    .drop(columns="prob_obese_given_overweight")
    .merge(
        agent_bmi_status_prob_df,
        on=["sex_id", "race_id", "age_lower_bound", "age_upper_bound"],
        how="right",
    )
    .assign(
        agent_prob_overweight_or_obese=lambda df: (
            df["agent_prob_overweight"] + df["agent_prob_obese"]
        )
    )
    .drop(columns="agent_prob_overweight")
    .rename(columns={"prob_overweight": "prob_overweight_or_obese"})
    .assign(
        prob_overweight_or_obese_error=lambda df: (
            df["agent_prob_overweight_or_obese"] - df["prob_overweight_or_obese"]
        )
    )
    .assign(prob_obese_error=lambda df: (df["agent_prob_obese"] - df["prob_obese"]))
)

We can plot the distribution of differences between the proportion of agents classified as overweight or obese in the FRED simulation and the expected values across all age group, sex, and race combinations from the initial data. We refer to this difference as probability error.

error_cols = ["prob_overweight_or_obese_error", "prob_obese_error"]
plot_df = (
    bmi_status_prob_comp_df[error_cols]
    .stack()
    .rename("Probability error")
    .rename_axis((None, "Error Value"))
    .to_frame()
    .reset_index(level=1)
    .reset_index(drop=True)
)
fig = px.histogram(plot_df, x="Probability error", color="Error Value")
fig.show()

Based on this comparison, we can see that the vast majority of agent classes have probabilities of being categorized as "overweight" or "obese" that are within 10% of their expected values. The remaining outliers can be explained by the fact that the sample size of agents within those categories in the synthetic population is small.

Because agents are randomly assigned health statuses depending on their age, sex, and race, the output in the following cell will vary slightly each time the model is run (if the random seed is different). We have found that typically the agent categories with absolute probability errors > 10% correspond to agents with FRED race codes 3 (AMERICAN_INDIAN), 5 (TRIBAL), and 7 (HAWAIIAN_NATIVE). It is unsurprising that the relative numbers of people in these groups are small in Erie County, PA.

(
    bmi_status_prob_comp_df.pipe(
        lambda df: df[(df[error_cols[0]].abs() > 0.1) | (df[error_cols[1]].abs() > 0.1)]
    )
)

We can also study the number of agents assigned to each BMI status in more detail by summarizing the absolute number of agents in the simulation with each BMI status in each age, sex, and race category. These can be compared directly to the expected number of agents in each category implied by the probabilities provided by Flegal et al., 2012.

fac_df = agent_df_to_fac_df(agent_df, prob_df)
plot_factor_summary(fac_df, "age")
plot_factor_summary(fac_df, "sex")
plot_factor_summary(fac_df, "race")

We see that the number of agents in the simulation with each BMI status matches very well with the expected number based on the input probabilities when summarized by age, sex, and race. We can therefore conclude that the process we have used in Section 2 to assign health statuses to agents is working as expected.

3.2 Proximity to weight loss clinics causing agents to lose weight

Here, we check that agents are transitioning between Obese -> Overweight -> Healthy Weight states in the BMI_STATUS condition during the course of the model run as expected.

We can create a time series of the count of agents in each of the above states from the results associated with the test_weight_loss_job FREDJob object that we created at the beginning of this section by running the model.

bmi_statuses = ["Healthy_Weight", "Overweight", "Obese"]
bmi_status_names = ["Healthy Weight", "Overweight", "Obese"]
agent_bmi_status_df = pd.concat(
    [pd.Series(test_weight_loss_job.runs[1].sim_dates).rename("Date")]
    + [
        test_weight_loss_job.runs[1].get_state("BMI_STATUS", x, "count").rename(y)
        for x, y in zip(bmi_statuses, bmi_status_names)
    ],
    axis=1,
).melt(
    id_vars="Date",
    value_vars=bmi_status_names,
    var_name="BMI status",
    value_name="Agent count",
)

Then, we can plot the time series of the aggregate number of agents in each state.

fig = px.line(agent_bmi_status_df, x="Date", y="Agent count", color="BMI status")
fig.show()

We see that the number of "healthy weight" agents increases and the number of "obese" agents decreases stepwise every 30 days as expected. The number of "overweight" agents increases, because the probability of transitioning from the "obese" category to the "overweight" category is 0.2, whereas the probability of transitioning from the "overweight" category to the "healthy weight" category is 0.1 (see model/parameters.fred). This implies that a net positive increase in the number of "overweight" agents over time is an expected outcome of the model.

4. Conclusions and next steps

In this notebook, we have generated a data set that contains agent attributes that are relevant to a modeling problem, but which aren't present in the default Synthetic Population (prob_overweight_obese.csv) and demonstrated how to incorporate that data into a FRED model. Additionally, we created places that agents interact with in a FRED model using an external data set. We ran the model and verified that the agents are using the imported data in the expected ways.

The model is now ready to start exploring the effect of a weight loss clinic intervention on population health, especially on relationship between being categorized as "overweight" according to the BMI and respiratory morbidity. For example, readers might:

  • Invesigate how increasing the number of weight loss clinics per capita, clinic_density_per_capita (see Section 1.2), affects agents' BMI statuses.
  • Use empirical data to fine tune the rates at which agents transition between BMI status categories (see obese_to_overweight_prob and overweight_to_healthy_prob in model/parameters.fred).

References

Flegal KM, Carroll MD, Kit BK, Ogden CL. Prevalence of Obesity and Trends in the Distribution of Body Mass Index Among US Adults, 1999-2010. JAMA. 2012;307(5):491–497. doi:10.1001/jama.2012.39