Extension of the original ESGF data discovery and download adding config file-based downloading and advanced regridding/subsetting functionality
Project description
esgpull-plus
API and processing extension to esgf-download: YAML-based download config, fast downloads, CDO regridding, and surface/seafloor subsetting.
Contents
- Installation and set-up
- File structure
- Dependencies
- Keeping up with upstream
- Git configuration
- Searching for data
- CDO regridding pipeline
- Works in progress
- License
Installation and set-up
1. Install the package (in a conda env if you need CDO regridding):
pip install esgpull-plus
2. Optional – CDO regridding (conda recommended):
conda install -c conda-forge python-cdo
3. Base esgpull:
esgpull self install
See esgf-download installation.
File structure
esgf-download/
├── esgpull/ # Original esgpull
│ └── esgpullplus/ # Extensions (regrid, API, etc.)
├── update-from-upstream.sh
Dependencies
- Base: from
pyproject.toml(httpx, click, rich, sqlalchemy, pydantic, etc.). - esgpullplus: pandas, numpy, requests, watchdog, xarray; geospatial via xesmf and
python-cdo(conda).
Keeping up with upstream
Recommended:
./update-from-upstream.sh
Manual:
git fetch upstream && git merge upstream/main
# Then reinstall (conda-aware): conda install -c conda-forge pandas xarray numpy; pip install xesmf cdo-python watchdog orjson
Git configuration
git remote -v
# origin https://github.com/orlando-code/esgpull-plus/ (fetch/push)
# upstream https://github.com/ESGF/esgf-download.git (fetch/push)
If upstream is missing: git remote add upstream https://github.com/ESGF/esgf-download.git
Searching for data
Main search
Populate the search.yaml file (in the repo root) with your ESGF facets and meta options:
search_criteria:
project: CMIP6
table_id: Omon
experiment_id: historical,ssp585
variable: uo,vo
filter:
top_n: 3 # top N datasets to keep
limit: 10 # max results per sub-search
meta_criteria:
data_dir: /path/to/data
max_workers: 4
Run the search + download pipeline (uses search.yaml automatically):
python -m esgpull.esgpullplus.api
python -m esgpull.esgpullplus.api --symmetrical # only download sources with both historical + SSP experiments
- Symmetry: in
--symmetricalmode the tool first analyses all experiments and then only downloads datasets from sources that have both historical and SSP-style experiments (e.g.ssp*), so historical/SSP are matched. - Sorting by resolution: search results are converted to a DataFrame and sorted by parsed nominal horizontal resolution, then by
dataset_id, so you always get a consistent “highest resolution first” ordering. - Stable IDs: multi-value facets like
variable: uo,voare normalised (split, trimmed, sorted) so the order you write them insearch.yamldoes not affect the generated search IDs or caching.
Inputs (YAML keys):
| Key | Description |
|---|---|
search_criteria.* |
ESGF facets (project, table_id, experiment_id, variable/variable_id, frequency, etc.). |
search_criteria.filter.top_n |
Number of top grouped datasets to keep. |
search_criteria.filter.limit |
Maximum number of results per sub-search (useful for debugging). |
meta_criteria.data_dir |
Base directory for downloaded data and cached search results. |
meta_criteria.max_workers |
Worker count used for any post-download regridding. |
Search analysis script
run_search_analysis runs an ESGF search from search.yaml, analyzes source availability (which sources have both historical and SSP experiments, resolutions, ensemble counts), and optionally writes an analysis_df.csv plus PNG plots. It ignores filter.top_n and filter.limit so the analysis uses all matching results.
Run:
python run_search_analysis.py [OPTIONS]
| Option | Default | Description |
|---|---|---|
--config / --config-path |
search.yaml |
Path to search config YAML. |
--output-dir |
plots/ (repo) |
Directory for analysis_df.csv and plot PNGs. |
--save-plots |
True | Save plot images (source availability heatmap, ensemble counts, resolution distribution, summary table). |
--show-plots |
True | Display plots interactively; pass --show-plots to disable. |
--require-both |
True | Only include sources that have both historical and SSP experiments. |
Outputs: analysis_df.csv plus, when --save-plots is on, source_availability_heatmap.png, ensemble_counts.png, resolution_distribution.png, source_summary_table.png in the output directory. Requires matplotlib and seaborn for plotting.
CDO regridding pipeline
Single pipeline in esgpull.esgpullplus.cdo_regrid: regridding with regrid weights reuse, chunked and parallel processing. Supports surface (top level) and seafloor extraction: each writes a file next to the original (*_top_level.nc, *_seafloor.nc) and that file is regridded like any other. Or you can regrid the whole thing.
Command line
# Directory: surface only
python -m esgpull.esgpullplus.cdo_regrid /path/to/dir -o /path/to/out -r 1.0 1.0 --extract-surface
# Directory: seafloor only
python -m esgpull.esgpullplus.cdo_regrid /path/to/dir -o /path/to/out --extract-seafloor --max-workers 2
# Both surface and seafloor per file
python -m esgpull.esgpullplus.cdo_regrid /path/to/dir --extreme-levels
# Single file
python -m esgpull.esgpullplus.cdo_regrid /path/to/file.nc -o /path/to/out.nc --extract-seafloor
Options:
| Option | Default | Description |
|---|---|---|
input (positional) |
required | Input file or directory. |
-o, --output |
same as input dir | Output file or directory; if omitted, writes next to input. |
-r, --resolution lon lat |
1.0 1.0 |
Target output resolution (lon_res, lat_res). |
-p, --pattern |
"*.nc" |
File pattern when input is a directory. |
--include-subdirectories |
True |
Include subdirectories when walking a directory. |
--extract-surface |
False |
Extract and regrid only the top level (surface). |
--extract-seafloor |
False |
Extract and regrid only seafloor values. |
--extreme-levels |
False |
Regrid both surface and seafloor for each file. |
--no-regrid-cache |
False |
Disable reuse of CDO weight files. |
--no-seafloor-cache |
False |
Disable reuse of seafloor depth index cache. |
-w, --max-workers |
4 |
Maximum parallel workers. |
--chunk-size-gb |
2.0 |
Maximum time-chunk size in GB. |
--max-memory-gb |
8.0 |
Soft cap for memory-aware chunking. |
--no-parallel |
False |
Process files sequentially. |
--no-chunking |
False |
Disable time chunking (process files in one go). |
-v, --verbose |
True |
Verbose progress UI. |
--verbose-max |
False |
Extra diagnostics (grid type, size, large file messages). |
--quiet |
False |
Disable verbose output. |
--use-ui |
True |
Use the rich progress UI. |
--unlink-unprocessed |
False |
Remove any files that could not be processed. |
--overwrite |
False |
Overwrite existing output files. |
N.B. if --output is not specified, new files will be written to the same directory as the inputs.
File watcher regridding
Continuously watch a directory for new NetCDF files and regrid them as they arrive, using the same CDO pipeline. This is helpful when downloading files and wanting them to be processed directly:
python -m esgpull.esgpullplus.file_watcher /path/to/watch \
-r 1.0 1.0 \
--extract-surface \
--use-regrid-cache \
--process-existing # also process files that are already present
Options:
| Option | Default | Description |
|---|---|---|
watch_dir (positional) |
required | Directory to watch for new NetCDF files. |
-r, --target-resolution lon lat |
1.0 1.0 |
Target output resolution (lon_res, lat_res). |
--target-grid |
"lonlat" |
CDO target grid type. |
--weight-cache-dir |
None |
Directory to store/reuse CDO weight files. |
--max-workers |
4 |
Maximum parallel workers. |
--batch-size |
10 |
Maximum files to accumulate before triggering a batch regrid. |
--batch-timeout |
30.0 |
Maximum seconds to wait before processing a partial batch. |
--extract-surface |
False |
Extract and regrid only the top level (surface). |
--extract-seafloor |
False |
Extract and regrid only seafloor values. |
--use-regrid-cache |
False |
Enable reuse of CDO weight files. |
--use-seafloor-cache |
False |
Enable reuse of seafloor depth index cache. |
--file-settle-seconds |
10.0 |
Wait time to ensure files are no longer being written before processing. |
--validate-can-open |
True |
Validate that files can be opened before scheduling regridding. |
--overwrite |
False |
Overwrite existing regridded outputs. |
--delete-original |
False |
Delete original files after successful regridding. |
--process-existing |
True |
Process files already present in watch_dir on startup. |
Python API
from pathlib import Path
from esgpull.esgpullplus.cdo_regrid import regrid_directory, regrid_single_file, CDORegridPipeline
# Directory
results = regrid_directory(
Path("data/input"),
output_dir=Path("data/output"),
target_resolution=(1.0, 1.0),
extract_surface=True,
extract_seafloor=False,
max_workers=4,
)
# results["successful"], results["failed"], results["skipped"]
# Single file
ok = regrid_single_file(
Path("data/file.nc"),
output_dir=Path("data/output"),
target_resolution=(1.0, 1.0),
extract_seafloor=True,
)
Features
- Surface/seafloor: Writes
*_top_level.ncor*_seafloor.ncbeside the original, then regrids that file (same CDO path). - Weight reuse: Weights cached per directory (e.g.
cdo_weights/); shared when grids match. - Chunking: Large files split by time; optional
--chunk-size-gb,--max-memory-gb. - Parallel: Per-file locking;
--max-workers;--no-parallelto disable. - Grids: Structured, curvilinear, unstructured (e.g.
ncells); multi-level and time series.
Works in Progress
- There's a fair bit of functionality here! Time to get a proper documentation site in order...
- Merge as much of this functionality as is welcome/useful into the original
esgpullrepository
I am more than happy to take suggestions/contributions from anyone. Just get in touch via email: rt582@cam.ac.uk
License
Same license terms as the esgpull project.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file esgpull_plus-1.0.0.tar.gz.
File metadata
- Download URL: esgpull_plus-1.0.0.tar.gz
- Upload date:
- Size: 362.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b42e52227dddfb6e2fa5ddb839d6d6aebb4cc40c6ae11bf39e7eb9a8a8130ded
|
|
| MD5 |
ce4fb0eebb548d7849afce8d4c6aaf7c
|
|
| BLAKE2b-256 |
086359ab33a010e0c7da8c3c44103e477e803be7d19eb1a48cad743e8e9581e8
|
File details
Details for the file esgpull_plus-1.0.0-py3-none-any.whl.
File metadata
- Download URL: esgpull_plus-1.0.0-py3-none-any.whl
- Upload date:
- Size: 205.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b817beae18a05183436d1d137657e6d5ff17e843a2b9308af5a1e696da0f429
|
|
| MD5 |
10bd6e16530e6e35c03d7a5b07bf7c3b
|
|
| BLAKE2b-256 |
7364b82053d8b6eafbd902d9e1cb8edc1c08e8108809aecc6bf83a7fcd7504ac
|