Extension of the original ESGF data discovery and download adding config file-based downloading and advanced regridding/subsetting functionality

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
Programming Language

Project description

esgpull-plus

API and processing extension to esgf-download: YAML-based download config, fast downloads, CDO regridding, and surface/seafloor subsetting.

Installation and set-up
File structure
Dependencies
Keeping up with upstream
Git configuration
Searching for data
CDO regridding pipeline
Works in progress
License

Installation and set-up

1. Install the package (in a conda env if you need CDO regridding):

pip install esgpull-plus

2. Optional – CDO regridding (conda recommended):

conda install -c conda-forge python-cdo

3. Base esgpull:

esgpull self install

See esgf-download installation.

File structure

esgf-download/
├── esgpull/              # Original esgpull
│   └── esgpullplus/      # Extensions (regrid, API, etc.)
├── update-from-upstream.sh

Dependencies

Base: from pyproject.toml (httpx, click, rich, sqlalchemy, pydantic, etc.).
esgpullplus: pandas, numpy, requests, watchdog, xarray; geospatial via xesmf and python-cdo (conda).

Keeping up with upstream

Recommended:

./update-from-upstream.sh

Manual:

git fetch upstream && git merge upstream/main
# Then reinstall (conda-aware): conda install -c conda-forge pandas xarray numpy; pip install xesmf cdo-python watchdog orjson

Git configuration

git remote -v
# origin    https://github.com/orlando-code/esgpull-plus/ (fetch/push)
# upstream  https://github.com/ESGF/esgf-download.git (fetch/push)

If upstream is missing: git remote add upstream https://github.com/ESGF/esgf-download.git

Searching for data

Main search

Populate the search.yaml file (in the repo root) with your ESGF facets and meta options:

search_criteria:
  project: CMIP6
  table_id: Omon
  experiment_id: historical,ssp585
  variable: uo,vo
  filter:
    top_n: 3        # top N datasets to keep
    limit: 10       # max results per sub-search

meta_criteria:
  data_dir: /path/to/data
  max_workers: 4

Run the search + download pipeline (uses search.yaml automatically):

python -m esgpull.esgpullplus.api
python -m esgpull.esgpullplus.api --symmetrical  # only download sources with both historical + SSP experiments

Symmetry: in --symmetrical mode the tool first analyses all experiments and then only downloads datasets from sources that have both historical and SSP-style experiments (e.g. ssp*), so historical/SSP are matched.
Sorting by resolution: search results are converted to a DataFrame and sorted by parsed nominal horizontal resolution, then by dataset_id, so you always get a consistent “highest resolution first” ordering.
Stable IDs: multi-value facets like variable: uo,vo are normalised (split, trimmed, sorted) so the order you write them in search.yaml does not affect the generated search IDs or caching.

Inputs (YAML keys):

Key	Description
`search_criteria.*`	ESGF facets (project, table_id, experiment_id, variable/variable_id, frequency, etc.).
`search_criteria.filter.top_n`	Number of top grouped datasets to keep.
`search_criteria.filter.limit`	Maximum number of results per sub-search (useful for debugging).
`meta_criteria.data_dir`	Base directory for downloaded data and cached search results.
`meta_criteria.max_workers`	Worker count used for any post-download regridding.

Search analysis script

run_search_analysis runs an ESGF search from search.yaml, analyzes source availability (which sources have both historical and SSP experiments, resolutions, ensemble counts), and optionally writes an analysis_df.csv plus PNG plots. It ignores filter.top_n and filter.limit so the analysis uses all matching results.

Run:

python run_search_analysis.py [OPTIONS]

Option	Default	Description
`--config` / `--config-path`	`search.yaml`	Path to search config YAML.
`--output-dir`	`plots/` (repo)	Directory for `analysis_df.csv` and plot PNGs.
`--save-plots`	True	Save plot images (source availability heatmap, ensemble counts, resolution distribution, summary table).
`--show-plots`	True	Display plots interactively; pass `--show-plots` to disable.
`--require-both`	True	Only include sources that have both historical and SSP experiments.

Outputs: analysis_df.csv plus, when --save-plots is on, source_availability_heatmap.png, ensemble_counts.png, resolution_distribution.png, source_summary_table.png in the output directory. Requires matplotlib and seaborn for plotting.

CDO regridding pipeline

Single pipeline in esgpull.esgpullplus.cdo_regrid: regridding with regrid weights reuse, chunked and parallel processing. Supports surface (top level) and seafloor extraction: each writes a file next to the original (*_top_level.nc, *_seafloor.nc) and that file is regridded like any other. Or you can regrid the whole thing.

Command line

# Directory: surface only
python -m esgpull.esgpullplus.cdo_regrid /path/to/dir -o /path/to/out -r 1.0 1.0 --extract-surface

# Directory: seafloor only
python -m esgpull.esgpullplus.cdo_regrid /path/to/dir -o /path/to/out --extract-seafloor --max-workers 2

# Both surface and seafloor per file
python -m esgpull.esgpullplus.cdo_regrid /path/to/dir --extreme-levels

# Single file
python -m esgpull.esgpullplus.cdo_regrid /path/to/file.nc -o /path/to/out.nc --extract-seafloor

Options:

Option	Default	Description
`input` (positional)	required	Input file or directory.
`-o`, `--output`	same as input dir	Output file or directory; if omitted, writes next to input.
`-r`, `--resolution lon lat`	`1.0 1.0`	Target output resolution (lon_res, lat_res).
`-p`, `--pattern`	`"*.nc"`	File pattern when `input` is a directory.
`--include-subdirectories`	`True`	Include subdirectories when walking a directory.
`--extract-surface`	`False`	Extract and regrid only the top level (surface).
`--extract-seafloor`	`False`	Extract and regrid only seafloor values.
`--extreme-levels`	`False`	Regrid both surface and seafloor for each file.
`--no-regrid-cache`	`False`	Disable reuse of CDO weight files.
`--no-seafloor-cache`	`False`	Disable reuse of seafloor depth index cache.
`-w`, `--max-workers`	`4`	Maximum parallel workers.
`--chunk-size-gb`	`2.0`	Maximum time-chunk size in GB.
`--max-memory-gb`	`8.0`	Soft cap for memory-aware chunking.
`--no-parallel`	`False`	Process files sequentially.
`--no-chunking`	`False`	Disable time chunking (process files in one go).
`-v`, `--verbose`	`True`	Verbose progress UI.
`--verbose-max`	`False`	Extra diagnostics (grid type, size, large file messages).
`--quiet`	`False`	Disable verbose output.
`--use-ui`	`True`	Use the rich progress UI.
`--unlink-unprocessed`	`False`	Remove any files that could not be processed.
`--overwrite`	`False`	Overwrite existing output files.

N.B. if --output is not specified, new files will be written to the same directory as the inputs.

File watcher regridding

Continuously watch a directory for new NetCDF files and regrid them as they arrive, using the same CDO pipeline. This is helpful when downloading files and wanting them to be processed directly:

python -m esgpull.esgpullplus.file_watcher /path/to/watch \
  -r 1.0 1.0 \
  --extract-surface \
  --use-regrid-cache \
  --process-existing    # also process files that are already present

Options:

Option	Default	Description
`watch_dir` (positional)	required	Directory to watch for new NetCDF files.
`-r`, `--target-resolution lon lat`	`1.0 1.0`	Target output resolution (lon_res, lat_res).
`--target-grid`	`"lonlat"`	CDO target grid type.
`--weight-cache-dir`	`None`	Directory to store/reuse CDO weight files.
`--max-workers`	`4`	Maximum parallel workers.
`--batch-size`	`10`	Maximum files to accumulate before triggering a batch regrid.
`--batch-timeout`	`30.0`	Maximum seconds to wait before processing a partial batch.
`--extract-surface`	`False`	Extract and regrid only the top level (surface).
`--extract-seafloor`	`False`	Extract and regrid only seafloor values.
`--use-regrid-cache`	`False`	Enable reuse of CDO weight files.
`--use-seafloor-cache`	`False`	Enable reuse of seafloor depth index cache.
`--file-settle-seconds`	`10.0`	Wait time to ensure files are no longer being written before processing.
`--validate-can-open`	`True`	Validate that files can be opened before scheduling regridding.
`--overwrite`	`False`	Overwrite existing regridded outputs.
`--delete-original`	`False`	Delete original files after successful regridding.
`--process-existing`	`True`	Process files already present in `watch_dir` on startup.

Python API

from pathlib import Path
from esgpull.esgpullplus.cdo_regrid import regrid_directory, regrid_single_file, CDORegridPipeline

# Directory
results = regrid_directory(
    Path("data/input"),
    output_dir=Path("data/output"),
    target_resolution=(1.0, 1.0),
    extract_surface=True,
    extract_seafloor=False,
    max_workers=4,
)
# results["successful"], results["failed"], results["skipped"]

# Single file
ok = regrid_single_file(
    Path("data/file.nc"),
    output_dir=Path("data/output"),
    target_resolution=(1.0, 1.0),
    extract_seafloor=True,
)

Features

Surface/seafloor: Writes *_top_level.nc or *_seafloor.nc beside the original, then regrids that file (same CDO path).
Weight reuse: Weights cached per directory (e.g. cdo_weights/); shared when grids match.
Chunking: Large files split by time; optional --chunk-size-gb, --max-memory-gb.
Parallel: Per-file locking; --max-workers; --no-parallel to disable.
Grids: Structured, curvilinear, unstructured (e.g. ncells); multi-level and time series.

Works in Progress

There's a fair bit of functionality here! Time to get a proper documentation site in order...
Merge as much of this functionality as is welcome/useful into the original esgpull repository

I am more than happy to take suggestions/contributions from anyone. Just get in touch via email: rt582@cam.ac.uk

License

Same license terms as the esgpull project.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
Programming Language

Release history Release notifications | RSS feed

This version

1.0.0

Feb 19, 2026

0.0.3

Oct 10, 2025

0.0.2

Oct 10, 2025

0.0.1

Oct 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esgpull_plus-1.0.0.tar.gz (362.3 kB view details)

Uploaded Feb 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

esgpull_plus-1.0.0-py3-none-any.whl (205.3 kB view details)

Uploaded Feb 19, 2026 Python 3

File details

Details for the file esgpull_plus-1.0.0.tar.gz.

File metadata

Download URL: esgpull_plus-1.0.0.tar.gz
Upload date: Feb 19, 2026
Size: 362.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for esgpull_plus-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b42e52227dddfb6e2fa5ddb839d6d6aebb4cc40c6ae11bf39e7eb9a8a8130ded`
MD5	`ce4fb0eebb548d7849afce8d4c6aaf7c`
BLAKE2b-256	`086359ab33a010e0c7da8c3c44103e477e803be7d19eb1a48cad743e8e9581e8`

See more details on using hashes here.

File details

Details for the file esgpull_plus-1.0.0-py3-none-any.whl.

File metadata

Download URL: esgpull_plus-1.0.0-py3-none-any.whl
Upload date: Feb 19, 2026
Size: 205.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for esgpull_plus-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1b817beae18a05183436d1d137657e6d5ff17e843a2b9308af5a1e696da0f429`
MD5	`10bd6e16530e6e35c03d7a5b07bf7c3b`
BLAKE2b-256	`7364b82053d8b6eafbd902d9e1cb8edc1c08e8108809aecc6bf83a7fcd7504ac`

See more details on using hashes here.

esgpull-plus 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

esgpull-plus

Contents

Installation and set-up

File structure

Dependencies

Keeping up with upstream

Git configuration

Searching for data

Main search

Search analysis script

CDO regridding pipeline

Command line

File watcher regridding

Python API

Features

Works in Progress

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes