Including External Data in a FRED model¶
In this tutorial, we demonstrate two important ways in which a FRED model can be enhanced by incorporating data from an external source:
- The Epistemix Synthetic Population can be augmented with custom agent attributes relevant to a specific modeling problem.
- Custom places that our agents will visit during simulation runs can be created.
We will explore these two use cases by building on a basic respiratory disease model. First, agent attributes called my_prob_overweight
and my_prob_obese_given_overweight
are used to categorize agents according to their Body Mass Index in a realistic way. Agents who are categorized as having a "healthy weight" (BMI < 25) according to the Body Mass Index (BMI), retain the default probability of becoming symptomatic when they are infected with influenza (66%). Agents who are categorized as "overweight" (25 <= BMI < 30) have their probability of becoming symptomatic increase to 75%. Agents who are categorized as "obese" (BMI >= 30) have their probability of becoming symptomatic increase to 90%.
Then, we introduce a data-dependent mitigation strategy in the form of weight loss clinics whose locations are specified by an external data set. Agents who are categorized as "overweight" or "obese" are assumed to visit a weight loss clinic if there is one within a 1.2 km radius (approximately a 15-minute walk) of their household. Visiting a weight loss clinic is assumed to be associated with some chance of transitioning to a different BMI category: Every 30 days, agents who are categorized as "obese" have a 20% chance of transitioning to the "overweight" category, and agents who are categorized as "overweight" have a 10% chance of transitioning to the "healthy weight" category.
In this tutorial, the external data set that describes the locations of the weight loss clinics is randomly generated. This enables the notebook to be self-contained. However, the method that is demonstrated here, can just as easily be used to represent real places with known geographical locations in a FRED model.
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon
import matplotlib.pyplot as plt
import plotly.express as px
from util import get_n_points_in_polygon, agent_df_to_fac_df, plot_factor_summary
1. Generating External Data Sets¶
In this section, we generate two files containing data that will be incorporated into the FRED model that is described in Section 2. These are:
data/prob_overweight_obese.csv
, which specifies the conditional probabilities that an agent is "overweight" or "obese" according to the BMI, given their age, sex, and race.data/weight_loss_clinics.csv
, which specifies the geographic locations of weight loss clinics.
1.1 First Data Set: Demographic-Dependent BMI Category Probabilities¶
The data/prob_overweight_obese.csv
file will contain the six columns. The first four are used to partition the agent population into classes based on their demographic characterstics. The last two are class-specific (conditional) probabilities.
The specific columns are as follows:
sex_id
: sex of agents in class (0 = Female, 1 = Male).race_id
: race of agents in class (see numerical codes below).age_lower_bound
: minimum age of agents in class in years (inclusive).age_upper_bound
: maximum age of agents in class in years (inclusive).prob_overweight
: conditional probability of an individual being "overweight" given their class.prob_obese_given_overweight
: conditional probability of an individual being "obese" given their class and that they are "overweight".
The numerical values representing different races in the FRED modeling language are listed below (see also documentation in the FRED Language Reference:
- -1 = UNKNOWN
- 1 = WHITE
- 2 = AFRICAN_AMERICAN
- 3 = AMERICAN_INDIAN
- 4 = ALASKA_NATIVE
- 5 = TRIBAL
- 6 = ASIAN
- 7 = HAWAIIAN_NATIVE
- 8 = OTHER_RACE
- 9 = MULTIPLE_RACE
To make the probabilities of agents being categorized as "overweight" or "obese" in the model realistic, we use results of empirical research into the proportions of US adults who are categorized as "overweight" or "obese" based on their age, sex, and race (see References, below: Flegal et al., 2012).
The file data/proc/flegal_2012_table2_extract.csv
contains a manually extracted copy of the relevant data from Table 2 of Flegal et al., 2012.
We can read this file to load the data into the notebook as a DataFrame:
prob_df = (
pd.read_csv("data/raw/flegal_2012_table2_extract.csv")
.assign(prob_overweight=lambda df: df["pct_overweight"] / 100)
.assign(prob_obese=lambda df: df["pct_obese"] / 100)
.drop(columns=["pct_overweight", "pct_obese"])
)
prob_df.head()
Simplifying assumptions:¶
- Everyone under the age of 20 is assumed to have no chance to be categorized as "overweight" or "obese," because data for that age group isn't reported in Flegal et al., 2012
- The probabilities for races other than white and African American are assumed to be equal to the probabilities in the "other" category from the data.
- White and African American agents are assumed to be non-Hispanic, as ethnicity is not explicitly represented in the Synthetic Population.
Assign FRED-readable codes to sex
and race
attributes¶
sex_map = pd.Series(
{
"male": 1,
"female": 0,
},
name="sex_id",
).rename_axis("sex")
race_map = pd.Series(
{
"white": 1,
"african_american": 2,
},
name="race_id",
).rename_axis("race")
The race other
in prob_df
is handled separately. The values corresponding to other
are used for all other race codes.
other_fred_race_codes = [
x for x in range(-1, 10) if x not in race_map.values and x != 0
]
other_prob_df = pd.concat(
[
prob_df.pipe(lambda df: df[df["race"] == "other"]).assign(race_id=race_id)
for race_id in other_fred_race_codes
]
).drop(columns="race")
white_aa_prob_df = (
prob_df.pipe(lambda df: df[df["race"] != "other"])
.merge(race_map.to_frame(), on="race", how="left")
.drop(columns="race")
)
prob_df = (
pd.concat([white_aa_prob_df, other_prob_df])
.merge(sex_map.to_frame(), how="left", on="sex")
.drop(columns="sex")
.loc[
:,
[
"sex_id",
"race_id",
"age_lower_bound",
"age_upper_bound",
"prob_overweight",
"prob_obese",
],
]
.sort_values(by=["sex_id", "race_id", "age_lower_bound"])
)
prob_df.head()
Calculate prob_obese_given_overweight
¶
Applying the definition of conditional probability, we know that
where \(P(B|O)\) is the probability of an individual being categorized as "obese" given they are categorized as "overweight" and \(P(B,O)\) is the probability that an individual is categorized as both "obese" and "overweight". Since the categories are overlapping, individuals who categorized as "obese" are also categorized as "overweight." That is, \(P(B,O) = P(B)\). Therefore:
Applying this calculation to prob_df
, we have:
prob_df = prob_df.assign(
prob_obese_given_overweight=lambda df: (
(df["prob_obese"] / df["prob_overweight"]).fillna(0)
)
).drop(columns=["prob_obese"])
prob_df.head()
Now, we are ready to save this table as a CSV file that can be accessed during the FRED simulation.
1.2 Second Data Set: Weight Loss Clinic Locations¶
The data/weight_loss_clinics.csv
file will contain the following columns:
ID
: identifier for the weight loss clinic (set to 0 for all clinics, so that FRED will automatically generate IDs for us).LAT
: Geographic latitude of the weight loss clinic location.LON
: Geographic longitude of the weight loss clinic location.ELEV
: Elevation of weight loss clinic location (set to 0 for all clinics, since elevation is not relevant to this model).
Each row in the file represents an individual weight loss clinic location. Below, we will create this file by randomly generating a set of points to produce a realistic overall spatial density.
Note that, if we wanted to associate additional information about places specified by external data for use in the model (e.g., max_patients
), we could do so by including additional columns in this file.
To generate the weight loss clinic location file, we will first download US County geographic boundary data in shapefile format from the US Census website and read it into a GeoDataFrame.
us_counties_shapefile_url = (
"https://www2.census.gov/geo/tiger/GENZ2021/shp/cb_2021_us_county_20m.zip"
)
all_us_county_boundaries_gdf = gpd.read_file(us_counties_shapefile_url)
Then, we can extract the geographic boundary of Erie County, PA (FIPS code 42049
):
erie_county_fips = "42049"
erie_county_boundary_gdf = all_us_county_boundaries_gdf.pipe(
lambda df: df[df["GEOID"] == erie_county_fips]
)
erie_county_boundary_gdf
The boundary coordinates in erie_county_boundary_gdf
are expressed in geodetic coordinates (where the units are degrees). We project these coordinates into the Albers coordinate reference system (CRS), which uses coordinates expressed in meters. The Albers CRS has the property - called "equal area" - that it preserves the area of a projected region compared to the region described by the corresponding geodetic coordinates. This is desirable, since our objective is to randomly generate a specific number of points per unit of area. The Albers projection has the EPSG code 9822
.
As a parameter of our model, let's assume there are is one weight loss clinic for every 5000 people.
The population of Erie County county in our synthetic population is 278,027.
Given these variables, we are ready to randomly generate clinic locations:
np.random.seed(12345) # for reproducibility
clinic_locs = get_n_points_in_polygon(
erie_county_boundary_proj_gdf.iloc[0]["geometry"],
int(clinic_density_per_capita * population_size),
)
f, ax = plt.subplots()
erie_county_boundary_proj_gdf["geometry"].plot(ax=ax, color="b")
clinic_locs.plot(ax=ax, color="k");
In the following section, we will use the read_place_file
function to read these locations into our simulation model.
First, though, we need to create a table in the correct format that is accepted by read_place_file
and save that table in a CSV file in the data
directory.
(
clinic_locs.set_crs(9822)
.to_crs(4326)
.to_frame()
.assign(ID=0) # Causes FRED to randomly generate place IDs for us
.assign(LAT=lambda df: df["geometry"].y)
.assign(LON=lambda df: df["geometry"].x)
.assign(ELEV=0)
.drop(columns="geometry")
).to_csv("data/weight_loss_clinics.csv", index=False)
2. Examining the FRED model¶
The FRED model that will use the external data files that we have generated (data/prob_overweight_obese.csv
and data/weight_loss_clinics.csv
) is contained in the main.fred
file (which references several additional files in the model
directory). Much of the FRED modeling language code contained in this model will be familiar from the lessons in Parts 1-10 of the Quickstart Guide, so here we focus on the language features that we use to load in external data.
main.fred
¶
In main.fred
, we use the following blocks to create the my_prob_overweight
and my_prob_obese_given_overweight
attributes for agents using the values from data/prob_overweight_obese.csv
:
use model/Read_Attribute {
input_file = data/prob_overweight_obese.csv
attribute = my_prob_overweight
column = 4
min_age_column = 2
max_age_column = 3
sex_column = 0
race_column = 1
}
use model/Read_Attribute {
input_file = data/prob_overweight_obese.csv
attribute = my_prob_obese_given_overweight
column = 5
min_age_column = 2
max_age_column = 3
sex_column = 0
race_column = 1
}
Read_Attribute
. Modules - which are contained in .fredmod
files to distinguish them from ordinary .fred
files - are a feature of the FRED modeling langauge that make it easy to reuse code multiple times - either in the same model or even across different models. They accept input parameters (such as those specified in the use
block above) that customize the generic code written in the module for use in the specific model invoking it.
In this example, our objective is to assign the values of two agent variables - my_prob_overweight
and my_prob_obese_given_overweight
- based on data that is read data from a CSV file. This is a natural use case for modules, because the FRED code needed to load data for each variable is essentially the same, except for the name of the variable and the column number in the CSV file containing the corresponding data. We can specify these values as module parameters.
Each row in data/prob_overweight_obese.csv
(see sample below) corresponds to a group of agents identified by their age range, sex, and race with the same probabilities of being "overweight" and "obese" according to their BMI.
sex_id,race_id,age_lower_bound,age_upper_bound,prob_overweight,prob_obese_given_overweight
0,-1,0,19,0.0,0.0
0,-1,20,39,0.5579999999999999,0.5716845878136202
0,-1,40,59,0.66,0.5454545454545454
In the code blocks following the use
statements above, we specify:
1. The name of the agent variable that will be created to store the data values (attribute
).
2.The column number in the input data file that contains the values that will be stored in the given attribute for each class of agent (column
).
3. The column numbers in the input data file that identify the demographic characteristics that define our agent classes (min_age_column
, max_age_column
, sex_column
, and race_column
).
Note that the column numbers in all cases are 0-indexed. We call use model/Read_Attribute
once for each attribute we want to instantiate for the agents (my_prob_overweight
and my_prob_obese_given_overweight
).
Recall that age, sex, and race are agent attributes that are built into the Epistemix Synthetic Population. As a result, they are automatically available in every FRED model that uses the provided Synthetic Population. Because of this, they can be thought of as "scaffolding" attributes, which can be used as the basis for new agent attributes whose values are drawn from model-specific external data in a way that depends on the values of these existing attributes, as we have demonstrated here.
model/weight_loss_clinic_mitigation.fred
¶
The FRED code used to implement the effect of weight loss clinics - i.e., that living within 1.2 km of a weight loss clinic causes agents to change their BMI category with some probability - is in the file weight_loss_clinic_mitigation.fred
.
Here, we declare the existence of a type of Place called a WeightLossClinic
:
WeightLossClinic
places have group agents (has_group_agent = 1
), so that we can ask
these agents for information about the location of the weight loss clinics in the model's conditions. We specify that Household
places also have group agents for the same reason. Households are included in FRED models by default, with locations specified in the Synthetic Population, so the effect of the following block is only to associate group agents with these places.
We use three conditions in weight_loss_clinic_mitigation.fred
to:
- Read the locations of weight loss clinics from
data/weight_loss_clinics.csv
(READ_WEIGHT_LOSS_CLINICS
). - Identify agent households within 1.2 km of a weight loss clinic (
NEAR_WEIGHT_LOSS_CLINIC
). - Determine whether agents who are classified as "overweight" or "obese" will transition to a different BMI category on a 30-day schedule (
WEIGHT_LOSS
).
See documentation of these conditions in weight_loss_clinic_mitigation.fred
for further details of the implementation.
3. Model behavior verification¶
In this section, we run the model and analyze the outputs to verify that the model is behaving as expected. This is a typical part of the modeling process and is important to ensure that the model implementation matches the specification.
import time
# Epistemix package for managing simulations and simulation results in Python
from epx import Job, ModelConfig, SynthPop
Note that the following command that runs the test simulation will take approximately 3min 30s to run. The model will run quicker using a smaller population than that of Erie County, PA. We have chosen Erie County to provide a reasonably large sample of each of the agent age, sex, and race groups used in the BMI status model. This will allow us to demonstrate that it is possible to recover the distributions of BMI statuses specified in data/prob_overweight_obese.csv
from the agents in the synthetic population.
# create a ModelConfig object
weightloss_config = ModelConfig(
synth_pop = SynthPop("US_2010.v5", ["Erie_County_PA"]),
start_date = "2023-01-01",
end_date = "2023-03-31",
model_params = {
"include_weight_loss_clinics" : 1,
"obese_to_overweight_prob" : 0.2,
"overweight_to_healthy_prob" : 0.1
}
)
# create a Job object using the ModelConfig
weightloss_job = Job(
"main.fred",
config=[weightloss_config],
key="weightloss_job",
fred_version="11.0.1",
results_dir="/home/epx/qsg-results"
)
# call the `Job.execute()` method
weightloss_job.execute()
# the following loop idles while we wait for the simulation job to finish
start = time.time()
timeout = 300 # timeout in seconds
idle_time = 3 # time to wait (in seconds) before checking status again
while str(weightloss_job.status) != 'DONE':
if time.time() > start + timeout:
msg = f"Job did not finish within {timeout / 60} minutes."
raise RuntimeError(msg)
time.sleep(idle_time)
str(weightloss_job.status)
3.1 BMI status attributes correctly applied to agents¶
Here, we verify that our external obesity probability data is being correctly read into the simulation and applied to the agents in the appropriate age, sex, and race categories. We can do this examining the agent attribute data that was output by the simulation engine, as specified by the code in the report_agent_attributes.fred
file, and comparing it to the prob_df
input data.
agent_df = (
pd.concat(
[
weightloss_job.results
.table_var(x[0])
.loc[weightloss_job.results
.table_var(x[0]).sim_day.ne(0)]
.drop(columns = ["run_id", "sim_day"])
.rename(columns={"key": "agent_id", "value": x[0]})
.set_index("agent_id")
.astype(x[1])
for x in [
# Specify name of variable and the Python type to store the data as
("agent_age", int),
("agent_race", int),
("agent_sex", int),
("agent_prob_symptomatic", float),
]
],
axis=1,
)
.rename(columns=lambda x: x.replace("agent_", ""))
.rename(columns={"race": "race_id", "sex": "sex_id"})
)
agent_df.head()
The next cell associates agents' ages with the age groups used in prob_df
:
all_ages_with_bounds_df = (
prob_df[["age_lower_bound", "age_upper_bound"]]
.groupby(level=0, as_index=False)
.apply(
lambda df: (
pd.Series(
np.arange(
df.iloc[0]["age_lower_bound"], df.iloc[0]["age_upper_bound"] + 1
)
)
.rename("age")
.to_frame()
.assign(age_lower_bound=df.iloc[0]["age_lower_bound"])
.assign(age_upper_bound=df.iloc[0]["age_upper_bound"])
)
)
.reset_index(drop=True)
)
agent_df = agent_df.merge(all_ages_with_bounds_df, on="age", how="left")
agent_df.head()
In the BMI_STATUS
condition in main.fred
, we specified that agents in the "healthy weight" category have my_prob_symptomatic = 0.66
, agents in the "overweight" category have my_prob_symptomatic = 0.75
, and agents in the "obese" category have my_prob_symptomatic = 0.90
.
We can use these values to infer each agent's BMI status:
bmi_status_map = (
pd.Series(
{
"healthy_weight": 0.66,
"overweight": 0.75,
"obese": 0.90,
}
)
.rename("prob_symptomatic")
.rename_axis("bmi_status")
)
agent_df = agent_df.merge(
bmi_status_map.to_frame().reset_index(), on="prob_symptomatic", how="left"
)
Then, we can calculate probability of each BMI status for each combination of agent age range, sex, and race:
agent_bmi_status_prob_df = (
agent_df.groupby(["race_id", "sex_id", "age_lower_bound", "age_upper_bound"])[
"bmi_status"
]
.apply(
lambda bmi_status_s: (
(bmi_status_s.value_counts() / len(bmi_status_s.index))
.reindex(pd.Index(["healthy_weight", "overweight", "obese"], name="bmi_status"))
.fillna(0)
)
)
.rename("prob_bmi_status")
.to_frame()
.unstack()["prob_bmi_status"]
.rename(columns=lambda x: f"agent_prob_{x}")
.reset_index()
.rename_axis(None, axis=1)
.drop(columns="agent_prob_healthy_weight")
)
agent_bmi_status_prob_df.head()
Now, let's create a DataFrame that allows us to compare the resulting agent health status probabilities to the input probabilities:
bmi_status_prob_comp_df = (
prob_df.assign(
prob_obese=lambda df: (
df["prob_overweight"] * df["prob_obese_given_overweight"]
)
)
.drop(columns="prob_obese_given_overweight")
.merge(
agent_bmi_status_prob_df,
on=["sex_id", "race_id", "age_lower_bound", "age_upper_bound"],
how="right",
)
.assign(
agent_prob_overweight_or_obese=lambda df: (
df["agent_prob_overweight"] + df["agent_prob_obese"]
)
)
.drop(columns="agent_prob_overweight")
.rename(columns={"prob_overweight": "prob_overweight_or_obese"})
.assign(
prob_overweight_or_obese_error=lambda df: (
df["agent_prob_overweight_or_obese"] - df["prob_overweight_or_obese"]
)
)
.assign(prob_obese_error=lambda df: (df["agent_prob_obese"] - df["prob_obese"]))
)
We can plot the distribution of differences between the proportion of agents classified as overweight or obese in the simulation and the expected values across all age group, sex, and race combinations from the initial data. We refer to this difference as probability error.
error_cols = ["prob_overweight_or_obese_error", "prob_obese_error"]
plot_df = (
bmi_status_prob_comp_df[error_cols]
.stack()
.rename("Probability error")
.rename_axis((None, "Error Value"))
.to_frame()
.reset_index(level=1)
.reset_index(drop=True)
)
fig = px.histogram(plot_df, x="Probability error", color="Error Value")
fig.show()
Based on this comparison, we can see that the vast majority of agent classes have probabilities of being categorized as "overweight" or "obese" that are within 10% of their expected values. The remaining outliers can be explained by the fact that the sample size of agents within those categories in the synthetic population is small.
Because agents are randomly assigned health statuses depending on their age, sex, and race, the output in the following cell will vary slightly each time the model is run (if the random seed is different). We have found that typically the agent categories with absolute probability errors > 10% correspond to agents with FRED race codes 3 (AMERICAN_INDIAN), 5 (TRIBAL), and 7 (HAWAIIAN_NATIVE). It is unsurprising that the relative numbers of people in these groups are small in Erie County, PA.
(
bmi_status_prob_comp_df.pipe(
lambda df: df[(df[error_cols[0]].abs() > 0.1) | (df[error_cols[1]].abs() > 0.1)]
)
)
We can also study the number of agents assigned to each BMI status in more detail by summarizing the absolute number of agents in the simulation with each BMI status in each age, sex, and race category. These can be compared directly to the expected number of agents in each category implied by the probabilities provided by Flegal et al., 2012.
We see that the number of agents in the simulation with each BMI status matches very well with the expected number based on the input probabilities when summarized by age, sex, and race. We can therefore conclude that the process we have used in Section 2 to assign health statuses to agents is working as expected.
3.2 Proximity to weight loss clinics causing agents to lose weight¶
Here, we check that agents are transitioning between Obese
-> Overweight
-> Healthy Weight
states in the BMI_STATUS
condition during the course of the model run as expected.
We can create a time series of the count
of agents in each of the above states from the job results associated with the weightloss_job
object that we created at the beginning of this section.
bmi_statuses = ["Healthy_Weight", "Overweight", "Obese"]
bmi_status_names = ["Healthy Weight", "Overweight", "Obese"]
agent_bmi_status_df = pd.concat(
[pd.Series(weightloss_job.results.dates().sim_date).rename("Date")]
+ [
weightloss_job.results.state("BMI_STATUS", x, "count")["count"].rename(y)
for x, y in zip(bmi_statuses, bmi_status_names)
],
axis=1,
).melt(
id_vars="Date",
value_vars=bmi_status_names,
var_name="BMI status",
value_name="Agent count",
)
Then, we can plot the time series of the aggregate number of agents in each state.
We see that the number of "healthy weight" agents increases and the number of "obese" agents decreases stepwise every 30 days as expected. The number of "overweight" agents increases, because the probability of transitioning from the "obese" category to the "overweight" category is 0.2
, whereas the probability of transitioning from the "overweight" category to the "healthy weight" category is 0.1
(see model/parameters.fred
). This implies that a net positive increase in the number of "overweight" agents over time is an expected outcome of the model.
4. Conclusions and next steps¶
In this notebook, we have generated a data set that contains agent attributes that are relevant to a modeling problem, but which aren't present in the default Epistemix Synthetic Population (prob_overweight_obese.csv
) and demonstrated how to incorporate that data into a simulation model. Additionally, we created places that agents interact within a simulation using an external data set. We ran the model and verified that the agents are using the imported data in the expected ways.
The model is now ready to start exploring the effect of a weight loss clinic intervention on population health, especially on relationship between being categorized as "overweight" according to the BMI and respiratory morbidity. For example, readers might:
- Invesigate how increasing the number of weight loss clinics per capita,
clinic_density_per_capita
(see Section 1.2), affects agents' BMI statuses. - Use empirical data to fine tune the rates at which agents transition between BMI status categories (see
obese_to_overweight_prob
andoverweight_to_healthy_prob
inmodel/parameters.fred
).
References¶
Flegal KM, Carroll MD, Kit BK, Ogden CL. Prevalence of Obesity and Trends in the Distribution of Body Mass Index Among US Adults, 1999-2010. JAMA. 2012;307(5):491–497. doi:10.1001/jama.2012.39