Skip to main content

Gdptools

Project description

gdptools

gdptools

PyPI conda Latest Release

Status Python Version

License

Read the documentation at https://gdptools.readthedocs.io/ pipeline status coverage report

pre-commit Ruff uv

gdptools is a Python package for calculating area-weighted statistics and spatial interpolations between gridded datasets and vector geometries. It provides efficient tools for grid-to-polygon, grid-to-line, and polygon-to-polygon interpolations with support for multiple data catalogs and custom datasets.

Welcome figure

Figure: Example grid-to-polygon interpolation. A) HUC12 basins for Delaware River Watershed. B) Gridded monthly water evaporation amount (mm) from TerraClimate dataset. C) Area-weighted-average interpolation of gridded TerraClimate data to HUC12 polygons.

🚀 Key Features

  • Multiple Interpolation Methods: Grid-to-polygon, grid-to-line, and polygon-to-polygon area-weighted statistics
  • Catalog Integration: Built-in support for STAC catalogs (NHGF, ClimateR) and custom metadata
  • Flexible Data Sources: Works with any xarray-compatible gridded data and geopandas vector data
  • Scalable Processing: Serial, parallel, and Dask-based computation methods
  • Multiple Output Formats: NetCDF, CSV, and in-memory results
  • Extensive vs Intensive Variables: Proper handling of different variable types in polygon-to-polygon operations
  • Intelligent Spatial Processing: Automatic reprojection to equal-area coordinate systems and efficient spatial subsetting

🌍 Spatial Processing & Performance

gdptools automatically handles complex geospatial transformations to ensure accurate and efficient calculations:

Automatic Reprojection

  • Equal-Area Projections: Both source gridded data and target geometries are automatically reprojected to a common equal-area coordinate reference system (default: EPSG:6931 - US National Atlas Equal Area)
  • Accurate Area Calculations: Equal-area projections ensure that area-weighted statistics are calculated correctly, regardless of the original coordinate systems
  • Flexible CRS Options: Users can specify alternative projection systems via the weight_gen_crs parameter

Efficient Spatial Subsetting

  • Bounding Box Optimization: Gridded datasets are automatically subset to the bounding box of the target geometries plus a buffer
  • Smart Buffering: Buffer size is calculated as twice the maximum grid resolution to ensure complete coverage
  • Memory Efficiency: Only the necessary spatial extent is loaded into memory, dramatically reducing processing time and memory usage for large datasets
# Example: Custom projection and efficient processing
from gdptools import AggGen

agg = AggGen(
    user_data=my_data,
    weight_gen_crs=6931,  # US National Atlas Equal Area (default)
    method="parallel"      # Leverage spatial optimizations
)
results = agg.get_zonal_stats()

📦 Installation

Via pip

pip install gdptools

Via conda

conda install -c conda-forge gdptools

Development installation

# Clone the repository
git clone https://code.usgs.gov/wma/nhgf/toolsteam/gdptools.git
cd gdptools

# Install uv if not already installed
pip install uv

# Create virtual environment and install dependencies with uv
uv sync --all-extras

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Set up pre-commit hooks
pre-commit install --install-hooks

🔧 Core Components

Data Classes

  • ClimRCatData: Interface with ClimateR catalog datasets
  • NHGFStacData: Interface with NHGF STAC catalog datasets
  • UserCatData: Custom user-defined gridded datasets
  • UserTiffData: GeoTIFF/raster data interface

Processing Classes

  • WeightGen: Calculate spatial intersection weights
  • AggGen: Perform area-weighted aggregations
  • InterpGen: Grid-to-line interpolation along vector paths

🎯 Quick Start

Grid-to-Polygon Aggregation

import geopandas as gpd
import xarray as xr
from gdptools import UserCatData, WeightGen, AggGen

# Load your data
gridded_data = xr.open_dataset("your_gridded_data.nc")
polygons = gpd.read_file("your_polygons.shp")

# Setup data interface
user_data = UserCatData(
    source_ds=gridded_data,
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var=["temperature", "precipitation"],
    target_gdf=polygons,
    target_crs="EPSG:4326",
    target_id="polygon_id",
    source_time_period=["2020-01-01", "2020-12-31"]
)

# Calculate intersection weights
weight_gen = WeightGen(user_data=user_data, method="parallel")
weights = weight_gen.calculate_weights()

# Perform aggregation
agg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",
    agg_engine="parallel",
    agg_writer="netcdf",
    weights=weights
)
result_gdf, result_dataset = agg_gen.calculate_agg()

Using NHGF-STAC Catalogs

from gdptools import NHGFStacData
import pystac

# Access NHGF STAC catalog
catalog = pystac.read_file("https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/")
collection = catalog.get_child("conus404-daily")

user_data = NHGFStacData(
    source_stac_item=collection,
    source_var=["PWAT"],
    target_gdf=watersheds,
    target_id="huc12",
    source_time_period=["1999-01-01", "1999-01-07"]
)

Using ClimateR Catalog

from gdptools import ClimRCatData
import pandas as pd

# Query ClimateR catalog
catalog = pd.read_parquet("https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet")
terraclimate = catalog.query("id == 'terraclim' & variable == 'aet'")

user_data = ClimRCatData(
    source_cat_dict={"aet": terraclimate.to_dict("records")[0]},
    target_gdf=basins,
    target_id="basin_id",
    source_time_period=["1980-01-01", "1980-12-31"]
)

📊 Use Cases & Examples

1. Climate Data Aggregation

  • TerraClimate monthly evapotranspiration to HUC12 basins
  • GridMET daily temperature/precipitation to administrative boundaries
  • CONUS404 high-resolution climate data to custom polygons
  • MERRA-2 reanalysis data to watershed polygons

2. Hydrologic Applications

  • Stream network analysis: Extract elevation profiles along river reaches using 3DEP data
  • Watershed statistics: Calculate basin-averaged climate variables
  • Flow routing: Grid-to-line interpolation for stream network analysis

3. Environmental Monitoring

  • Air quality: Aggregate gridded pollution data to census tracts
  • Land cover: Calculate fractional land use within administrative units
  • Biodiversity: Combine species habitat models with management areas

⚡ Performance Options

Processing Methods

  • "serial": Single-threaded processing (default, reliable)
  • "parallel": Multi-threaded processing (faster for large datasets)
  • "dask": Distributed processing (requires Dask cluster)

Memory Management

  • Chunked processing: Handle large datasets that don't fit in memory
  • Caching: Cache intermediate results for repeated operations
  • Efficient data structures: Optimized spatial indexing and intersection algorithms

Large-scale heuristics

Target polygons Recommended engine Notes
< 5k "serial" Fits comfortably in RAM; best for debugging
5k–50k "parallel" Run with jobs=-1 and monitor memory usage
> 50k / nationwide "dask" Use a Dask cluster and consider 2,500–10,000 polygon batches
  • Persist the gridded dataset once, then iterate through polygon batches to keep memory flat.
  • Write each batch of weights to Parquet/CSV immediately; append at the end instead of keeping all intersections in memory.
  • Avoid intersections=True unless you need the geometries; it multiplies memory requirements.
  • See docs/weight_gen_classes.md ⇢ "Scaling to Nationwide Datasets" for an end-to-end chunking example.

📈 Statistical Methods

Available Statistics

  • "masked_mean": Area-weighted mean (most common)
  • "masked_sum": Area-weighted sum
  • "masked_median": Area-weighted median
  • "masked_std": Area-weighted standard deviation

Variable Types for Polygon-to-Polygon

  • Extensive: Variables that scale with area (e.g., total precipitation, population)
  • Intensive: Variables that don't scale with area (e.g., temperature, concentration)

🔧 Advanced Features

Custom Coordinate Reference Systems

# Use custom projection for accurate area calculations
weight_gen = WeightGen(
    user_data=user_data,
    weight_gen_crs=6931  # US National Atlas Equal Area
)

Intersection Analysis

# Save detailed intersection geometries for validation
weights = weight_gen.calculate_weights(intersections=True)
intersection_gdf = weight_gen.intersections

Output Formats

# Multiple output options
agg_gen = AggGen(
    user_data=user_data,
    agg_writer="netcdf",      # or "csv", "none"
    out_path="./results/",
    file_prefix="climate_analysis"
)

📚 Documentation & Examples

  • Full Documentation: https://gdptools.readthedocs.io/
  • Example Notebooks: Comprehensive Jupyter notebooks in docs/Examples/
    • STAC catalog integration (CONUS404 example)
    • ClimateR catalog workflows (TerraClimate example)
    • Custom dataset processing (User-defined data)
    • Grid-to-line interpolation (Stream analysis)
    • Polygon-to-polygon aggregation (Administrative boundaries)

Sample Catalog Datasets

gdptools integrates with multiple climate and environmental data catalogs through two primary interfaces:

ClimateR-Catalog

See the complete catalog datasets reference for a comprehensive list of supported datasets including:

  • Climate Data: TerraClimate, GridMET, Daymet, PRISM, MACA, CHIRPS
  • Topographic Data: 3DEP elevation models
  • Land Cover: LCMAP, LCMAP-derived products
  • Reanalysis: GLDAS, NLDAS, MERRA-2
  • Downscaled Projections: BCCA, BCSD, LOCA

NHGF STAC Catalog

See the NHGF STAC datasets reference for cloud-optimized access to:

  • High-Resolution Models: CONUS404 (4km daily meteorology)
  • Observational Data: GridMET, PRISM, Stage IV precipitation
  • Climate Projections: LOCA2, MACA, BCCA/BCSD downscaled scenarios
  • Regional Datasets: Alaska, Hawaii, Puerto Rico, Western US
  • Specialized Products: SSEBop ET, permafrost, sea level rise

User Defined XArray Datasets

For datasets not available through catalogs, gdptools provides UserCatData to work with any xarray-compatible gridded dataset. This is ideal for custom datasets, local files, or specialized data sources.

Basic Usage

import xarray as xr
import geopandas as gpd
from gdptools import UserCatData, WeightGen, AggGen

# Load your custom gridded dataset
custom_data = xr.open_dataset("my_custom_data.nc")
polygons = gpd.read_file("my_polygons.shp")

# Configure UserCatData for your dataset
user_data = UserCatData(
    source_ds=custom_data,           # Your xarray Dataset
    source_crs="EPSG:4326",          # CRS of the gridded data
    source_x_coord="longitude",      # Name of x-coordinate variable
    source_y_coord="latitude",       # Name of y-coordinate variable
    source_t_coord="time",           # Name of time coordinate variable
    source_var=["temperature", "precipitation"],  # Variables to process
    target_gdf=polygons,             # Target polygon GeoDataFrame
    target_crs="EPSG:4326",          # CRS of target polygons
    target_id="polygon_id",          # Column name for polygon identifiers
    source_time_period=["2020-01-01", "2020-12-31"]  # Time range to process
)

Working with Different Data Formats

NetCDF Files

# Single NetCDF file
data = xr.open_dataset("weather_data.nc")

# Multiple NetCDF files
data = xr.open_mfdataset("weather_*.nc", combine='by_coords')

user_data = UserCatData(
    source_ds=data,
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var=["temp", "precip"],
    target_gdf=watersheds,
    target_crs="EPSG:4326",
    target_id="watershed_id"
)

Zarr Archives

# Cloud-optimized Zarr store
data = xr.open_zarr("s3://bucket/climate_data.zarr")

user_data = UserCatData(
    source_ds=data,
    source_crs="EPSG:3857",  # Web Mercator projection
    source_x_coord="x",
    source_y_coord="y",
    source_t_coord="time",
    source_var=["surface_temp", "soil_moisture"],
    target_gdf=counties,
    target_crs="EPSG:4269",  # NAD83
    target_id="county_fips"
)

Custom Coordinate Systems

# Dataset with non-standard coordinate names
data = xr.open_dataset("model_output.nc")

user_data = UserCatData(
    source_ds=data,
    source_crs="EPSG:32612",         # UTM Zone 12N
    source_x_coord="easting",        # Custom x-coordinate name
    source_y_coord="northing",       # Custom y-coordinate name
    source_t_coord="model_time",     # Custom time coordinate name
    source_var=["wind_speed", "wind_direction"],
    target_gdf=grid_cells,
    target_crs="EPSG:32612",
    target_id="cell_id",
    source_time_period=["2021-06-01", "2021-08-31"]
)

Advanced Configuration

Subset by Geographic Area

# Pre-subset data to region of interest for efficiency
bbox = [-120, 35, -115, 40]  # [west, south, east, north]
regional_data = data.sel(
    longitude=slice(bbox[0], bbox[2]),
    latitude=slice(bbox[1], bbox[3])
)

user_data = UserCatData(
    source_ds=regional_data,
    source_crs="EPSG:4326",
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["evapotranspiration"],
    target_gdf=california_basins,
    target_crs="EPSG:4326",
    target_id="basin_id"
)

Multiple Variables with Different Units

# Handle datasets with multiple variables
user_data = UserCatData(
    source_ds=climate_data,
    source_crs="EPSG:4326",
    source_x_coord="lon",
    source_y_coord="lat",
    source_t_coord="time",
    source_var=[
        "air_temperature",      # Kelvin
        "precipitation_flux",   # kg/m²/s
        "relative_humidity",    # %
        "wind_speed"           # m/s
    ],
    target_gdf=study_sites,
    target_crs="EPSG:4326",
    target_id="site_name",
    source_time_period=["2019-01-01", "2019-12-31"]
)

Processing Workflow

# Complete workflow with UserCatData
user_data = UserCatData(
    source_ds=my_dataset,
    source_crs="EPSG:4326",
    source_x_coord="longitude",
    source_y_coord="latitude",
    source_t_coord="time",
    source_var=["surface_temperature"],
    target_gdf=administrative_boundaries,
    target_crs="EPSG:4326",
    target_id="admin_code"
)

# Generate intersection weights
weight_gen = WeightGen(
    user_data=user_data,
    method="parallel",           # Use parallel processing
    weight_gen_crs=6931         # Use equal-area projection for accurate weights
)
weights = weight_gen.calculate_weights()

# Perform area-weighted aggregation
agg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",   # Calculate area-weighted mean
    agg_engine="parallel",
    agg_writer="netcdf",         # Save results as NetCDF
    weights=weights,
    out_path="./results/",
    file_prefix="temperature_analysis"
)

result_gdf, result_dataset = agg_gen.calculate_agg()

Data Requirements

Your xarray Dataset must include:

  • Spatial coordinates: Regularly gridded x and y coordinates
  • Temporal coordinate: Time dimension (if processing time series)
  • Data variables: The variables you want to interpolate
  • CRS information: Coordinate reference system (can be specified manually)

Common Use Cases

  • Research datasets: Custom model outputs, field measurements
  • Local weather stations: Interpolated station data
  • Satellite products: Processed remote sensing data
  • Reanalysis subsets: Regional extracts from global datasets
  • Ensemble models: Multi-model climate projections

Requirements

Data Formats

  • Gridded Data: Any dataset readable by xarray with projected coordinates
  • Vector Data: Any format readable by geopandas
  • Projections: Any CRS readable by pyproj.CRS

Dependencies

  • Python 3.11+
  • xarray (gridded data handling)
  • geopandas (vector data handling)
  • pandas (data manipulation)
  • numpy (numerical operations)
  • shapely (geometric operations)
  • pyproj (coordinate transformations)

🤝 Contributing

We welcome contributions! Please see our development documentation for details on:

  • Development environment setup
  • Testing procedures
  • Code style guidelines
  • Issue reporting

📄 License

This project is in the public domain. See LICENSE for details.

🙏 Acknowledgments

gdptools integrates with several excellent open-source projects:

  • xarray: Multi-dimensional array processing
  • geopandas: Geospatial data manipulation
  • HyRiver: Hydrologic data access (pynhd, pygeohydro)
  • STAC: Spatiotemporal asset catalogs
  • ClimateR: Climate data catalogs

History

The changelog can be found in the changelog

Credits

This project was generated from @hillc-usgs's Pygeoapi Plugin Cookiecutter template.


Questions? Open an issue on our GitLab repository or check the documentation for detailed examples and API reference.

Project details


Release history Release notifications | RSS feed

This version

0.3.6

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdptools-0.3.6.tar.gz (15.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdptools-0.3.6-py3-none-any.whl (101.7 kB view details)

Uploaded Python 3

File details

Details for the file gdptools-0.3.6.tar.gz.

File metadata

  • Download URL: gdptools-0.3.6.tar.gz
  • Upload date:
  • Size: 15.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.19 {"installer":{"name":"uv","version":"0.9.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gdptools-0.3.6.tar.gz
Algorithm Hash digest
SHA256 d4ac671df627430c3e781df037d4e89c3381d071937bb70c7f12752b86916bdb
MD5 5182477210b1b057d0ee485dc45d9a61
BLAKE2b-256 f7d0a5d33444bbef69ecfbcfa52ea5dacbab347c525402459230acc2a6035fff

See more details on using hashes here.

File details

Details for the file gdptools-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: gdptools-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 101.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.19 {"installer":{"name":"uv","version":"0.9.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gdptools-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2d34c11bbe27a31da652c8981377df0c07b879702af333588c361df22000cb54
MD5 289c599022b44da19e8d9dcd8df86888
BLAKE2b-256 a79b66f766bf66cda3f4c6421a590f7ee6ef9a9c56d25d8dd98195ed3fa36e01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page