Skip to main content

Pipeline to Aggregate Data for Optimised Cloud Capabilities

Project description

PADOCC Package

PyPI version

Padocc (Pipeline to Aggregate Data for Optimal Cloud Capabilities) is a Data Aggregation pipeline for creating Kerchunk (or alternative) files to represent various datasets in different original formats. Currently the Pipeline supports writing JSON/Parquet Kerchunk files for input NetCDF/HDF files. Further developments will allow GeoTiff, GRIB and possibly MetOffice (.pp) files to be represented, as well as using the Pangeo Rechunker tool to create Zarr stores for Kerchunk-incompatible datasets.

Example Notebooks at this link

Documentation hosted at this link

Kerchunk Pipeline

Release 1.4.4

Release date: 22nd January 2026

See the release notes for details.

This package acknowledges contributions by Matt Brown as a pre-release tester.

Installation

To install this package, clone the repository using git clone, then follow the steps below to install the package with the necessary dependencies.

python -m venv .venv
source .venv/bin/activate
pip install poetry
poetry install

Alternatively, install from PyPi with:

pip install padocc==1.4.4

Example Basic Usage.

Once installed, set a working directory environment variable. This location will be used to create all files within the PADOCC pipeline.

export WORKDIR=path/to/my/area

Note: You may also want to set the LOTUS_CFG environment variable, which must point to a lotus config file for use in parallel job deployment. See this link https://cedadev.github.io/padocc/detailed/parallel.html#lotus-2-configurations for more details.

  1. Assemble the initialisation files.

You will need a text file containing all paths to the files you wish to aggregate per-dataset. (For files with a single variable in each, you will need a text file per variable.) Alternatively if all files can be described by a simple wildcard pattern i.e path/to/files/*.nc you may use this. These must go into a CSV file formatted as below for each row:

name_of_dataset,<path_to_text_file_OR_pattern>,,

Add a new row for each dataset/variable described by a set of input files.

Run the following commands in order (if you have >5 datasets in your CSV group you may want to look into parallelisation).

padocc init -G <group_name> -i <path_to_csv_file> -v
padocc scan -G <group_name> -v
padocc compute -G <group_name> -v
padocc validate -G <group_name> -v

If there are problems in the scan/compute phase please refer to the list of known errors here https://cedadev.github.io/padocc/detailed/features.html#custom-pipeline-errors. If the validate phase ends with Fatal errors you may need to recompute with alternative aggregators (V or K). Please try all the combinations to see if any aggregation works (--aggregator V or K in compute, with -n to increment version number).

Validations that result in Success or Warnings are OK and can proceed to completion. The report generated in validation is saved to the completion directory by default.

Note: Only do this once all groups are finished validation. Check this with padocc status -G <group_name>

padocc complete -G <group_name> --completion_dir path/to/outputs

If the data is NOT in the CEDA archive, you will need to add custom --sub and --replace to change the local filepaths of your input files to remote paths (wherever they are downloadable).

For all other queries please contact Daniel Westwood (daniel.westwood@stfc.ac.uk)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

padocc-1.4.6.tar.gz (9.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

padocc-1.4.6-py3-none-any.whl (10.0 MB view details)

Uploaded Python 3

File details

Details for the file padocc-1.4.6.tar.gz.

File metadata

  • Download URL: padocc-1.4.6.tar.gz
  • Upload date:
  • Size: 9.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.11.9 Linux/5.14.0-611.27.1.el9_7.x86_64

File hashes

Hashes for padocc-1.4.6.tar.gz
Algorithm Hash digest
SHA256 5154f27a7c4b5f98b3b290db8ec35de10f1f7af0d1c0d2789e7758091c48a51e
MD5 25f813e0a8be5700fbdfcb944da3c6ea
BLAKE2b-256 eec873dc6476b40757050e8176c2c3a744a2bc55662888857d91525a8ada9170

See more details on using hashes here.

File details

Details for the file padocc-1.4.6-py3-none-any.whl.

File metadata

  • Download URL: padocc-1.4.6-py3-none-any.whl
  • Upload date:
  • Size: 10.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.11.9 Linux/5.14.0-611.27.1.el9_7.x86_64

File hashes

Hashes for padocc-1.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9d428772e5693abd95740dc07e6a4048dea2d438c9442c78c30747a48aaf1169
MD5 4e857a4f1a7ba765ecf778332b3738fe
BLAKE2b-256 5cc1d4d94c1fa870b121ade33beeacfbd778b14455ed7fa1c609acf832d82454

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page