Gdptools
Project description
gdptools
gdptools is a Python package for calculating area-weighted statistics and spatial interpolations between gridded datasets and vector geometries. It provides efficient tools for grid-to-polygon, grid-to-line, and polygon-to-polygon interpolations with support for multiple data catalogs and custom datasets.
Figure: Example grid-to-polygon interpolation. A) HUC12 basins for Delaware River Watershed. B) Gridded monthly water evaporation amount (mm) from TerraClimate dataset. C) Area-weighted-average interpolation of gridded TerraClimate data to HUC12 polygons.
🚀 Key Features
- Multiple Interpolation Methods: Grid-to-polygon, grid-to-line, and polygon-to-polygon area-weighted statistics
- Catalog Integration: Built-in support for STAC catalogs (NHGF, ClimateR) and custom metadata
- Flexible Data Sources: Works with any xarray-compatible gridded data and geopandas vector data
- Scalable Processing: Serial, parallel, and Dask-based computation methods
- Multiple Output Formats: NetCDF, CSV, and in-memory results
- Extensive vs Intensive Variables: Proper handling of different variable types in polygon-to-polygon operations
- Intelligent Spatial Processing: Automatic reprojection to equal-area coordinate systems and efficient spatial subsetting
🌍 Spatial Processing & Performance
gdptools automatically handles complex geospatial transformations to ensure accurate and efficient calculations:
Automatic Reprojection
- Equal-Area Projections: Both source gridded data and target geometries are automatically reprojected to a common equal-area coordinate reference system (default: EPSG:6931 - US National Atlas Equal Area)
- Accurate Area Calculations: Equal-area projections ensure that area-weighted statistics are calculated correctly, regardless of the original coordinate systems
- Flexible CRS Options: Users can specify alternative projection systems via the
weight_gen_crsparameter
Efficient Spatial Subsetting
- Bounding Box Optimization: Gridded datasets are automatically subset to the bounding box of the target geometries plus a buffer
- Smart Buffering: Buffer size is calculated as twice the maximum grid resolution to ensure complete coverage
- Memory Efficiency: Only the necessary spatial extent is loaded into memory, dramatically reducing processing time and memory usage for large datasets
# Example: Custom projection and efficient processing
from gdptools import AggGen
agg = AggGen(
user_data=my_data,
weight_gen_crs=6931, # US National Atlas Equal Area (default)
method="parallel" # Leverage spatial optimizations
)
results = agg.get_zonal_stats()
📦 Installation
Via pip
pip install gdptools
Via conda
conda install -c conda-forge gdptools
Development installation
# Clone the repository
git clone https://code.usgs.gov/wma/nhgf/toolsteam/gdptools.git
cd gdptools
# Install uv if not already installed
pip install uv
# Create virtual environment and install dependencies with uv
uv sync --all-extras
# Activate the virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Set up pre-commit hooks
pre-commit install --install-hooks
🔧 Core Components
Data Classes
ClimRCatData: Interface with ClimateR catalog datasetsNHGFStacData: Interface with NHGF STAC catalog datasetsUserCatData: Custom user-defined gridded datasetsUserTiffData: GeoTIFF/raster data interface
Processing Classes
WeightGen: Calculate spatial intersection weightsAggGen: Perform area-weighted aggregationsInterpGen: Grid-to-line interpolation along vector paths
🎯 Quick Start
Grid-to-Polygon Aggregation
import geopandas as gpd
import xarray as xr
from gdptools import UserCatData, WeightGen, AggGen
# Load your data
gridded_data = xr.open_dataset("your_gridded_data.nc")
polygons = gpd.read_file("your_polygons.shp")
# Setup data interface
user_data = UserCatData(
source_ds=gridded_data,
source_crs="EPSG:4326",
source_x_coord="lon",
source_y_coord="lat",
source_t_coord="time",
source_var=["temperature", "precipitation"],
target_gdf=polygons,
target_crs="EPSG:4326",
target_id="polygon_id",
source_time_period=["2020-01-01", "2020-12-31"]
)
# Calculate intersection weights
weight_gen = WeightGen(user_data=user_data, method="parallel")
weights = weight_gen.calculate_weights()
# Perform aggregation
agg_gen = AggGen(
user_data=user_data,
stat_method="masked_mean",
agg_engine="parallel",
agg_writer="netcdf",
weights=weights
)
result_gdf, result_dataset = agg_gen.calculate_agg()
Using NHGF-STAC Catalogs
from gdptools import NHGFStacData
import pystac
# Access NHGF STAC catalog
catalog = pystac.read_file("https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/")
collection = catalog.get_child("conus404-daily")
user_data = NHGFStacData(
source_stac_item=collection,
source_var=["PWAT"],
target_gdf=watersheds,
target_id="huc12",
source_time_period=["1999-01-01", "1999-01-07"]
)
Using ClimateR Catalog
from gdptools import ClimRCatData
import pandas as pd
# Query ClimateR catalog
catalog = pd.read_parquet("https://github.com/mikejohnson51/climateR-catalogs/releases/download/June-2024/catalog.parquet")
terraclimate = catalog.query("id == 'terraclim' & variable == 'aet'")
user_data = ClimRCatData(
source_cat_dict={"aet": terraclimate.to_dict("records")[0]},
target_gdf=basins,
target_id="basin_id",
source_time_period=["1980-01-01", "1980-12-31"]
)
📊 Use Cases & Examples
1. Climate Data Aggregation
- TerraClimate monthly evapotranspiration to HUC12 basins
- GridMET daily temperature/precipitation to administrative boundaries
- CONUS404 high-resolution climate data to custom polygons
- MERRA-2 reanalysis data to watershed polygons
2. Hydrologic Applications
- Stream network analysis: Extract elevation profiles along river reaches using 3DEP data
- Watershed statistics: Calculate basin-averaged climate variables
- Flow routing: Grid-to-line interpolation for stream network analysis
3. Environmental Monitoring
- Air quality: Aggregate gridded pollution data to census tracts
- Land cover: Calculate fractional land use within administrative units
- Biodiversity: Combine species habitat models with management areas
⚡ Performance Options
Processing Methods
"serial": Single-threaded processing (default, reliable)"parallel": Multi-threaded processing (faster for large datasets)"dask": Distributed processing (requires Dask cluster)
Memory Management
- Chunked processing: Handle large datasets that don't fit in memory
- Caching: Cache intermediate results for repeated operations
- Efficient data structures: Optimized spatial indexing and intersection algorithms
Large-scale heuristics
| Target polygons | Recommended engine | Notes |
|---|---|---|
| < 5k | "serial" |
Fits comfortably in RAM; best for debugging |
| 5k–50k | "parallel" |
Run with jobs=-1 and monitor memory usage |
| > 50k / nationwide | "dask" |
Use a Dask cluster and consider 2,500–10,000 polygon batches |
- Persist the gridded dataset once, then iterate through polygon batches to keep memory flat.
- Write each batch of weights to Parquet/CSV immediately; append at the end instead of keeping all intersections in memory.
- Avoid
intersections=Trueunless you need the geometries; it multiplies memory requirements. - See
docs/weight_gen_classes.md⇢ "Scaling to Nationwide Datasets" for an end-to-end chunking example.
📈 Statistical Methods
Available Statistics
"masked_mean": Area-weighted mean (most common)"masked_sum": Area-weighted sum"masked_median": Area-weighted median"masked_std": Area-weighted standard deviation
Variable Types for Polygon-to-Polygon
- Extensive: Variables that scale with area (e.g., total precipitation, population)
- Intensive: Variables that don't scale with area (e.g., temperature, concentration)
🔧 Advanced Features
Custom Coordinate Reference Systems
# Use custom projection for accurate area calculations
weight_gen = WeightGen(
user_data=user_data,
weight_gen_crs=6931 # US National Atlas Equal Area
)
Intersection Analysis
# Save detailed intersection geometries for validation
weights = weight_gen.calculate_weights(intersections=True)
intersection_gdf = weight_gen.intersections
Output Formats
# Multiple output options
agg_gen = AggGen(
user_data=user_data,
agg_writer="netcdf", # or "csv", "none"
out_path="./results/",
file_prefix="climate_analysis"
)
📚 Documentation & Examples
- Full Documentation: https://gdptools.readthedocs.io/
- Example Notebooks: Comprehensive Jupyter notebooks in
docs/Examples/- STAC catalog integration (CONUS404 example)
- ClimateR catalog workflows (TerraClimate example)
- Custom dataset processing (User-defined data)
- Grid-to-line interpolation (Stream analysis)
- Polygon-to-polygon aggregation (Administrative boundaries)
Sample Catalog Datasets
gdptools integrates with multiple climate and environmental data catalogs through two primary interfaces:
ClimateR-Catalog
See the complete catalog datasets reference for a comprehensive list of supported datasets including:
- Climate Data: TerraClimate, GridMET, Daymet, PRISM, MACA, CHIRPS
- Topographic Data: 3DEP elevation models
- Land Cover: LCMAP, LCMAP-derived products
- Reanalysis: GLDAS, NLDAS, MERRA-2
- Downscaled Projections: BCCA, BCSD, LOCA
NHGF STAC Catalog
See the NHGF STAC datasets reference for cloud-optimized access to:
- High-Resolution Models: CONUS404 (4km daily meteorology)
- Observational Data: GridMET, PRISM, Stage IV precipitation
- Climate Projections: LOCA2, MACA, BCCA/BCSD downscaled scenarios
- Regional Datasets: Alaska, Hawaii, Puerto Rico, Western US
- Specialized Products: SSEBop ET, permafrost, sea level rise
User Defined XArray Datasets
For datasets not available through catalogs, gdptools provides UserCatData to work with any xarray-compatible gridded dataset. This is ideal for custom datasets, local files, or specialized data sources.
Basic Usage
import xarray as xr
import geopandas as gpd
from gdptools import UserCatData, WeightGen, AggGen
# Load your custom gridded dataset
custom_data = xr.open_dataset("my_custom_data.nc")
polygons = gpd.read_file("my_polygons.shp")
# Configure UserCatData for your dataset
user_data = UserCatData(
source_ds=custom_data, # Your xarray Dataset
source_crs="EPSG:4326", # CRS of the gridded data
source_x_coord="longitude", # Name of x-coordinate variable
source_y_coord="latitude", # Name of y-coordinate variable
source_t_coord="time", # Name of time coordinate variable
source_var=["temperature", "precipitation"], # Variables to process
target_gdf=polygons, # Target polygon GeoDataFrame
target_crs="EPSG:4326", # CRS of target polygons
target_id="polygon_id", # Column name for polygon identifiers
source_time_period=["2020-01-01", "2020-12-31"] # Time range to process
)
Working with Different Data Formats
NetCDF Files
# Single NetCDF file
data = xr.open_dataset("weather_data.nc")
# Multiple NetCDF files
data = xr.open_mfdataset("weather_*.nc", combine='by_coords')
user_data = UserCatData(
source_ds=data,
source_crs="EPSG:4326",
source_x_coord="lon",
source_y_coord="lat",
source_t_coord="time",
source_var=["temp", "precip"],
target_gdf=watersheds,
target_crs="EPSG:4326",
target_id="watershed_id"
)
Zarr Archives
# Cloud-optimized Zarr store
data = xr.open_zarr("s3://bucket/climate_data.zarr")
user_data = UserCatData(
source_ds=data,
source_crs="EPSG:3857", # Web Mercator projection
source_x_coord="x",
source_y_coord="y",
source_t_coord="time",
source_var=["surface_temp", "soil_moisture"],
target_gdf=counties,
target_crs="EPSG:4269", # NAD83
target_id="county_fips"
)
Custom Coordinate Systems
# Dataset with non-standard coordinate names
data = xr.open_dataset("model_output.nc")
user_data = UserCatData(
source_ds=data,
source_crs="EPSG:32612", # UTM Zone 12N
source_x_coord="easting", # Custom x-coordinate name
source_y_coord="northing", # Custom y-coordinate name
source_t_coord="model_time", # Custom time coordinate name
source_var=["wind_speed", "wind_direction"],
target_gdf=grid_cells,
target_crs="EPSG:32612",
target_id="cell_id",
source_time_period=["2021-06-01", "2021-08-31"]
)
Advanced Configuration
Subset by Geographic Area
# Pre-subset data to region of interest for efficiency
bbox = [-120, 35, -115, 40] # [west, south, east, north]
regional_data = data.sel(
longitude=slice(bbox[0], bbox[2]),
latitude=slice(bbox[1], bbox[3])
)
user_data = UserCatData(
source_ds=regional_data,
source_crs="EPSG:4326",
source_x_coord="longitude",
source_y_coord="latitude",
source_t_coord="time",
source_var=["evapotranspiration"],
target_gdf=california_basins,
target_crs="EPSG:4326",
target_id="basin_id"
)
Multiple Variables with Different Units
# Handle datasets with multiple variables
user_data = UserCatData(
source_ds=climate_data,
source_crs="EPSG:4326",
source_x_coord="lon",
source_y_coord="lat",
source_t_coord="time",
source_var=[
"air_temperature", # Kelvin
"precipitation_flux", # kg/m²/s
"relative_humidity", # %
"wind_speed" # m/s
],
target_gdf=study_sites,
target_crs="EPSG:4326",
target_id="site_name",
source_time_period=["2019-01-01", "2019-12-31"]
)
Processing Workflow
# Complete workflow with UserCatData
user_data = UserCatData(
source_ds=my_dataset,
source_crs="EPSG:4326",
source_x_coord="longitude",
source_y_coord="latitude",
source_t_coord="time",
source_var=["surface_temperature"],
target_gdf=administrative_boundaries,
target_crs="EPSG:4326",
target_id="admin_code"
)
# Generate intersection weights
weight_gen = WeightGen(
user_data=user_data,
method="parallel", # Use parallel processing
weight_gen_crs=6931 # Use equal-area projection for accurate weights
)
weights = weight_gen.calculate_weights()
# Perform area-weighted aggregation
agg_gen = AggGen(
user_data=user_data,
stat_method="masked_mean", # Calculate area-weighted mean
agg_engine="parallel",
agg_writer="netcdf", # Save results as NetCDF
weights=weights,
out_path="./results/",
file_prefix="temperature_analysis"
)
result_gdf, result_dataset = agg_gen.calculate_agg()
Data Requirements
Your xarray Dataset must include:
- Spatial coordinates: Regularly gridded x and y coordinates
- Temporal coordinate: Time dimension (if processing time series)
- Data variables: The variables you want to interpolate
- CRS information: Coordinate reference system (can be specified manually)
Common Use Cases
- Research datasets: Custom model outputs, field measurements
- Local weather stations: Interpolated station data
- Satellite products: Processed remote sensing data
- Reanalysis subsets: Regional extracts from global datasets
- Ensemble models: Multi-model climate projections
Requirements
Data Formats
- Gridded Data: Any dataset readable by xarray with projected coordinates
- Vector Data: Any format readable by geopandas
- Projections: Any CRS readable by
pyproj.CRS
Dependencies
- Python 3.11+
- xarray (gridded data handling)
- geopandas (vector data handling)
- pandas (data manipulation)
- numpy (numerical operations)
- shapely (geometric operations)
- pyproj (coordinate transformations)
🤝 Contributing
We welcome contributions! Please see our development documentation for details on:
- Development environment setup
- Testing procedures
- Code style guidelines
- Issue reporting
📄 License
This project is in the public domain. See LICENSE for details.
🙏 Acknowledgments
gdptools integrates with several excellent open-source projects:
- xarray: Multi-dimensional array processing
- geopandas: Geospatial data manipulation
- HyRiver: Hydrologic data access (pynhd, pygeohydro)
- STAC: Spatiotemporal asset catalogs
- ClimateR: Climate data catalogs
History
The changelog can be found in the changelog
Credits
This project was generated from @hillc-usgs's Pygeoapi Plugin Cookiecutter template.
Questions? Open an issue on our GitLab repository or check the documentation for detailed examples and API reference.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gdptools-0.3.6.tar.gz.
File metadata
- Download URL: gdptools-0.3.6.tar.gz
- Upload date:
- Size: 15.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.19 {"installer":{"name":"uv","version":"0.9.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4ac671df627430c3e781df037d4e89c3381d071937bb70c7f12752b86916bdb
|
|
| MD5 |
5182477210b1b057d0ee485dc45d9a61
|
|
| BLAKE2b-256 |
f7d0a5d33444bbef69ecfbcfa52ea5dacbab347c525402459230acc2a6035fff
|
File details
Details for the file gdptools-0.3.6-py3-none-any.whl.
File metadata
- Download URL: gdptools-0.3.6-py3-none-any.whl
- Upload date:
- Size: 101.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.19 {"installer":{"name":"uv","version":"0.9.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d34c11bbe27a31da652c8981377df0c07b879702af333588c361df22000cb54
|
|
| MD5 |
289c599022b44da19e8d9dcd8df86888
|
|
| BLAKE2b-256 |
a79b66f766bf66cda3f4c6421a590f7ee6ef9a9c56d25d8dd98195ed3fa36e01
|