Skip to main content

A package for finding the number of people residing near environmental hazards

Project description

Pop_Exp: functions to assess the number of people exposed to an environmental hazard

Pop_Exp is an open-source Python package designed to help environmental epidemiologists reproducibly assess population-level exposure to environmental hazards based on residential proximity. Functions in the package are designed to be fast, memory-efficient, and easy to use.

I. Overview

Pop_Exp identifies the number of people living near an environmental hazard or set of environmental hazards by overlaying buffered environmental hazard geospatial data with gridded population data, in order to count the number of people that live within the affected area.

Pop_Exp can estimate either (a) the number of people living within the buffer distance of each hazard (e.g., the number of people living within 10 km of each individual wildfire disaster burned area in 2018 in California) or (b) the number of people living within the buffer distance of any of the cumulative set of hazards (e.g., the number of people living within 10 km of any wildfire disaster burned area in 2018 in California). These estimates can be broken down by additional spatial units such as census tracts, counties, or ZCTAs. For example, Pop_Exp can find the number of people living within 10 km of any wildfire disaster burned area in 2018 by ZCTA, and calculate spatial unit denominators such as the number of residents in each ZCTA. Pop_Exp could also be used to assess exposure based on residential proximity to hurricanes, gas wells, emissions sources, or other environmental hazards mapped in geospatial data.

A tutorial on how to use all the functions in Pop_Exp is available on GitHub, at https://github.com/heathermcb/Pop_Exp/demo. Please see the tutorial for a detailed explanation on how to use the package functions for exposure assessment.

II. Available functions

There are three functions available in Pop_Exp:

1. find_num_people_affected: Estimate the number of people living within a buffer distance of an environmental hazard or set of environmental hazards.

Inputs:

  • Environmental hazard geospatial data: Path to a geospatial data file (GeoJSON or GeoParquet file) with geometries describing environmental hazards (e.g., wildfire disaster boundaries in the US 2015-2020, active oil and gas well coordinates in Texas in 2018, or global data of all tropical cyclone paths in 2020). Hazards can be any kind of geometry except a geometry collection. Columns must include:
    • ID_climate_hazard: Unique identifier for each hazard.
    • geometry: Geometry of the hazard.
    • buffer_dist: Buffer distance to be applied to each hazard. This function buffers each hazard with the corresponding buffer distance passed by the user. If all values in buffer_dist are the same, the same buffer distance applied to all hazards. The user can populate the column buffer_dist with different values for each hazards, based on attributes of the hazards or hazard size. If all buffer_dist values are 0, no buffer will be applied to the hazards.
  • Population raster data: Gridded population dataset as a raster file (e.g., GHSL population dataset which provides coverage for the US from 1990-2020 available for download here: https://human-settlement.emergency.copernicus.eu/ghs_pop.php).
  • Argument 'by_unique_hazard: This additional argument determines how find_num_people_affected will count people living within buffered hazard boundaries.
    • by_unique_hazard (required, True/False): When this parameter is set to True, find_num_people_affected will compute the population affected by each hazard separately. When True, if hazards overlap with each other, the same people may be counted as exposed to two or more distinct hazards (e.g., double counted or more). When this parameter is set to False, find_num_people_affected will compute the total number of people exposed to any hazard in the set of hazards passed to the function. See Key Features for more information. There is no default.

Outputs:

  • Dataframe containing:
    • ID_climate_hazard: If by_unique_hazard was True, this will be a column containing the unique identifiers for each hazard. If by_unique_hazard was False, this will be a list of unique IDs for hazards that did not overlap with other hazards, and concatenated strings of the IDs of any groups of overlapping hazards.
    • num_people_affected: Number of people living within the buffer distance of each hazard or group of hazards, where each row is a single hazard ID or a concatenated list of hazard IDs. Again, if by_unique_hazard was True, the output data will contain one row for every ID_climate_hazard. This means that the rows will be mutually non-exclusive and people may be double counted if they are in the buffered area of two or more different hazards. If by_unique_hazard was False, the output will contain concatenated ID_climate_hazards wherever hazards or hazard buffers overlapped and people will be counted once if they were in the area of the group of two or more overlapping buffered hazards. See Key Features for a detailed explanation.

Key Features:

  • For overlapping hazard geometries or buffered hazard geometries, the user can choose from two options using the argument by_unique_hazard. When this parameter is set to True, find_num_people_affected estimates the population affected by each hazard separately; if hazards overlap with each other, the same people may be counted as exposed to two or more distinct hazards (i.e., double counted or more). When it is set to False, the population affected by overlapping hazards will be combined and people will be counted once if they were in the area of two or more overlapping hazards. The IDs of any overlapping hazards will be concatenated in the output, and the number of people living within the union of those buffered hazards will be returned.
  • The user can select the gridded population dataset they want to use based on the population and time period of interest.
  • find_num_people_affected uses the buffer distances created and passed by the user to buffer the hazard geometries in the best Universal Transverse Mercator projection for each environmental hazard, based on the hazard centroid latitude and longitude, ensuring the most accurate buffered area is created and minimizing distortion from wild map projections.
  • find_num_people_affected masks the gridded residential population raster with the buffered hazard geometries using partial pixel masking, using the package exactextract. This means if the hazard geometry overlaps with half of a given pixel, when all pixel values within the buffered hazard geometry are summed, half of the given pixel value is added to the sum. This produces the most accurate count of people affected, in contrast to centroid masking or including the entire value of any pixels touched by the hazard geometry.
  • Raster masking is done sequentially in exactextract for each individual geometry or set of overlapping geometries in the set of hazards to minimize working memory use and maximize computation speed.

2. find_num_people_affected_by_geo: Estimate the number of people living within a buffer distance of an environmental hazard or set of environmental hazards by additional geographies (e.g., census tract, ZCTA).

This function is very similar to find_num_people_affected, but returns output by an additional geography (e.g., ZCTAs, counties, census tracts). It provides the number of people living near a buffered hazard or set of buffered hazards in each ZCTA, county, etc. It requires an additional input of a geospatial dataset of additional spatial unit (eg. ZCTA, census tract) geometries.

Inputs:

  • Environmental hazard geospatial data: Path to a geospatial data file (GeoJSON or GeoParquet file) with geometries describing environmental hazards (e.g., wildfire disaster boundaries in the US 2015-2020, active oil and gas well coordinates in Texas in 2018, or global data of all tropical cyclone paths in 2020). Hazards can be any kind of geometry except a geometry collection. Columns must include:
    • ID_climate_hazard: Unique identifier for each hazard.
    • geometry: Geometry of the hazard.
    • buffer_dist: Buffer distance to be applied to each hazard. This function buffers each hazard with the corresponding buffer distance passed by the user. If all values in buffer_dist are the same, the same buffer distance applied to all hazards. The user can populate the column buffer_dist with different values for each hazards, based on attributes of the hazards or hazard size. If all buffer_dist values are 0, no buffer will be applied to the hazards.
  • Population raster data: Gridded population dataset as a raster file (e.g., GHSL population dataset which provides coverage for the US from 1990-2020 available for download here: https://human-settlement.emergency.copernicus.eu/ghs_pop.php).
  • Geographic boundaries: Path to a geospatial data file containing boundaries an additional spatial unit (e.g., ZCTAs, counties, census tracts). Columns must include:
    • ID_spatial_unit: Unique identifier for each geography
    • geometry: Geometry of the spatial units
  • Argument 'by_unique_hazard: This additional argument determines how find_num_people_affected will count people living within buffered hazard boundaries.
    • by_unique_hazard (required, True/False): When this parameter is set to True, find_num_people_affected will compute the population affected by each hazard separately. When True, if hazards overlap with each other, the same people may be counted as exposed to two or more distinct hazards (e.g., double counted or more). When this parameter is set to False, find_num_people_affected will compute the total number of people exposed to any hazard in the set of hazards passed to the function. See Key Features for more information. There is no default.

Outputs:

  • Dataframe containing:
    • ID_climate_hazard: Unique identifier for each hazard
    • ID_spatial_unit: Unique identifier for each additional spatial unit (e.g., ZCTA, county, census tract)
    • num_people_affected: Number of people living within the buffer distance of each hazard or group of hazards, where each row is a single hazard ID or a concatenated list of hazard IDs. Again, if by_unique_hazard was True, the output data will contain one row for every ID_climate_hazard. This means that the rows will be mutually non-exclusive and people may be double counted if they are in the buffered area of two or more different hazards. If by_unique_hazard was False, the output will contain concatenated ID_climate_hazards wherever hazards or hazard buffers overlapped and people will be counted once if they were in the area of the group of two or more overlapping buffered hazards. See Key Features for a detailed explanation.

Key Features:

This function returns the count of people for each ID_climate_hazard - ID_spatial_unit combination. For example, a given row may contain the number of people affected by ID_climate_hazard 123 in ZCTA 98107. by_unique_hazard works the same way here as above in find_num_people_affected, as does hazard buffering and partial pixel masking.

3. find_number_of_people_residing_by_geo: Estimate the number of people living within additional spatial units such as ZCTAs, counties, or census tracts.

This function is designed to be used with find_num_people_affected_by_geo to produce denominators for spatial units.

Inputs:

  • Geographic boundaries: Path to a geospatial data file containing boundaries an additional spatial unit (e.g., ZCTAs, counties, census tracts). Columns must include:
    • ID_spatial_unit: Unique identifier for each geography
    • geometry: Geometry of the spatial units
  • Population raster data: Gridded population dataset as a raster file (e.g., GHSL population dataset which provides coverage for the US from 1990-2020 available for download here: https://human-settlement.emergency.copernicus.eu/ghs_pop.php).

Outputs:

  • Dataframe containing:
    • ID_spatial_unit: Unique identifier for each additional spatial unit (e.g., ZCTA, county, census tract)
    • num_people_residing: The number of people living within the boundaries of each spatial unit according to the population raster passed to the function.

III. Requirements

  1. Python If you do not already have Python, you can install Python at https://www.python.org/downloads/. We recommend programming in Python with VS Code. We've provided a virtual environment containing the requirements of Pop_Exp at https://github.com/heathermcb/Pop_Exp.

  2. Inputs You need:

    • A path to a hazard geospatial data file (e.g., oil wells, wildfires, floods, etc.) with specific columns.

      columns: ID_climate_hazard, geometry, buffer_dist filetype: GeoParquet or GeoJSON

    • A path to a gridded population dataset

      format: raster, must contain a CRS

    • A path to additional geographies geospatial data file (optional)

      columns: ID_spatial_unit, geometry filetype: GeoParquet or GeoJSON

IV. How to run

Please see the tutorial for a detailed explanation of how to use all functions in Pop_Exp, at https://github.com/heathermcb/Pop_Exp/tree/main/demo/demo_code.

You can run the function in Python by calling the function with the appropriate arguments. Note that below hazard_gdf, pop_raster, and geo_gdf are the paths to the respective files. Example calls:

python find_num_people_affected(hazard_gdf, pop_raster, buffer_dist=1000)
python find_num_people_affected_by_geo(hazard_gdf, pop_raster, geo_gdf, buffer_dist=1000)

V. Additional must-reads on how Pop_Exp works

Written in plainer language!

Temporality: Hazard data is for a specific time period. Maybe you have fracking-related quakes for 2010, or wildfires for 2019. Pop_Exp requires you to pick the gridded population raster that you want to use to calculate how many people live near those hazards yourself, so pick one that corresponds to the correct time period. For example, if you have hazard data from 2009-2021, you might not want to use the same population dataset for all of your environmental hazards. You might want to call the function several times on subsets of your data. Maybe you want to call it for each year between 2009 and 2015 using the GHSL population raster from 2010, and then again for each year between 2016 and 2021 with the population raster for 2020. This is up to you to handle.

Overlapping hazards: Depending on your dataset, you might have some overlapping hazards. Maybe you are looking at oil wells and you want to know how many people live within 1 km of oil wells in the US. Because there are often multiple wells next to each other, there may be people who live within 1 km of multiple wells. The parameter by_unique_hazard allows you to specify how you want to count people. If by_unique_hazard=False, the function counts people ONCE if they are within the buffer distance of any hazard, and returns output with overlapping hazards grouped together. It doesn't tell you if people are within the buffer distance of multiple hazards. If by_unique_hazard=True, it tells you how many people are within the buffer of each hazard, and double-counts people who are within the buffer of two or more hazards.

This means if you have multiple years or months of data, even if you're using the same population dataset, you might want to do separate function runs for separate years or months. For example, if you have wildfire data from 2015-2020, and you want to know how many people were affected by fires by ZCTA by year, if you throw all the data in this function at once with by_unique_hazard=False, if a fire burned ZCTA 10032 in 2015 and in 2020, all you will know from the function output is the count of people who were within the buffer distance of EITHER fire perimeter in that ZCTA. That may not be what you want. The results will not be broken down by year. So you could instead run the function once for each year 2015-2020, to determine how many people were affected by any fire by year.

VI. Additional details on how these functions work if you're interested

Here are a list of helpers and main functions in the source code of this package. Whether these helpers are all called by each main function and in which order changes based on what's being calculated.

1: prep_geographies This function reads in a climate hazard geospatial data file and spatial unit geospatial data file (counties, zcta, etc.) if applicable, in GeoParquet format or GeoJSON format. This file must contains a string column called ID_climate_hazard, a numeric column called 'buffer_dist', and a geometry column, and nothing else. This function makes geometries valid, reprojects to WGS84 projection, and if the data is hazard data, adds a column indicating the best UTM projection to the data frame.

  • Input: dataframe with 3 columns 1.ID_climate_hazard or ID_spatial_unit 2. buffer_dist 3. geometry
  • Output: dataframe with 4 columns
    1. ID column: ID_climate_hazard or ID_spatial_unit
    2. original hazard geometry (geometry)
    3. buffer distance (buffer_dist)
    4. column with best UTM projection for each hazard (not present for spatial unit dataframes) (utm_projection)

2: add_buffered_geom_col This function reprojects each hazard into the best UTM zone based on the centroid location and then buffers the provided geometries. The buffer distance is provided in the buffer_dist column. If the buffer distance is 0, the function does not buffer the geometry. After the buffer distance is added, the function reprojects the geometry back to WGS84.

  • Input: dataframe with 3 columns
    1. ID_climate_hazard
    2. buffer_dist
    3. geometry
    4. utm_projection
  • Output: dataframe with 5 columns
    1. ID_climate_hazard
    2. buffer_dist
    3. geometry
    4. utm_projection
    5. buffered hazard geometry: buffered_hazard

3: combine_overlapping_geometries This function combines any overlapping hazard geometries into a single geometry, using the GeoPandas function unary_union. This is necessary when calling find_num_people_affected or find_num_people_affected_by_geo with by_unique_hazard=False. This is when the user is aiming to count people once if they are living within more than one buffered hazard boundary. This function retains the hazard ID column, but concatenates any IDs for hazards that were overlapping.

  • Input: dataframe with at least 2 columns:
    1. ID_climate_hazard
    2. geometry
  • Output: dataframe with 2 columns
    1. concatenated climate hazard IDs: ID_climate_hazard
    2. combined geoms: geometry
  1. mask_raster_patial_pixel This function mutates a dataframe to add a column for the population of each buffered hazard area. This function opens the population raster and masks each buffered hazard geometry or group of geometries, and sums the raster values to find the residential population of the buffered hazard area. It adds this sum to the dataframe as a new column called num_people_affected.
  • Input: dataframe with at least 2 columns:
    1. ID_climate_hazard
    2. geometry
  • Output: dataframe with 2 columns
    1. concatenated climate hazard IDs: ID_climate_hazard
    2. combined geoms: geometry
    3. population within each buffered hazard area: num_people_affected

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pop_exp-0.1.8.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pop_exp-0.1.8-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file pop_exp-0.1.8.tar.gz.

File metadata

  • Download URL: pop_exp-0.1.8.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.6 Darwin/22.6.0

File hashes

Hashes for pop_exp-0.1.8.tar.gz
Algorithm Hash digest
SHA256 472e8667c25d28179440f31eb6eaaf4a5969fa3a70a9747b0328b6c90052a919
MD5 c88b557568fe12ced6f6fa25350f51b3
BLAKE2b-256 7b42c0d5dc5a5802d15dd207600f214207439c5a316260e6438dc82674f0f80f

See more details on using hashes here.

File details

Details for the file pop_exp-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: pop_exp-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.6 Darwin/22.6.0

File hashes

Hashes for pop_exp-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 926666eebb93db30d7ea71ef6da48c6d73c6d08524701fa6425450f250350e76
MD5 2551039fc1f1c711d3db544c0eedbf6b
BLAKE2b-256 82bcf93e7cebad55f8c6b531829075107745c740552fd468c2a72bf43827acfd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page