Skip to main content

Package for efficiently parallelising zarr write operations based on awareness of source chunks

Project description

Zarr Parallel Cacher

This package has been developed as part of the NERC EDS FRAME-FM AI project. It has been separated into its own module for ease of reusability across multiple projects. AI-specific steps may form part of the package, but may also be disabled by default.

Basic Usage

from zarr_parallel.assembler import ZarrParallelAssembler

zp = ZarrParallelAssembler(data_uri=uri, preprocessors=preprocessors,
            chunks=chunks,
            engine='kerchunk',
            variables={'d2m':{}}, 
            cache_label='_v1')

zp.cache(
    cache_dir='/gws/ssde/j25b/eds_ai/frame-fm/data/zarr_cache',
    deploy_mode='dask_distributed',
    simultaneous_worker_limit=4)

The above code snippet demonstrates the use of this package. The data_uri and engine parameters refer to the xarray open_dataset method for accessing the source object. chunks are required to specify the output chunking in the zarr cache, which is also required for organising the parallel jobs. variables is optional to add, and includes the ability to run transforms on specific data arrays (such as renaming) which are applied individually.

The preprocessors list defines the set of preprocessing transforms to apply to the dataset (including selection) at the point of caching. This should include all transforms that should be applied to the dataset before writing to the zarr cache.

The num_jobs and simultaneous_worker_limit parameters are used to configure for parallel deployment. If no num_jobs is provided, the assembler will calculate the optimal number of jobs for your memory limit (recommended). The default memory limit is 2GB and the timeout is set at 30 minutes, although this only applies to SLURM deployments at present.

Transforms/Preprocessors

Transformations to the data may be specified via the selector option passed in the above example. Xarray-native transformations are supported, as well as transforms from the FRAME-FM package if installed.

Selection Recommendations

The assembler will halt to recommend alternative data selections based on the underlying chunk structure. Proceeding without recommendations is not advised, as mismatched chunk-region borders may involve duplicating chunk requests and significantly increasing memory requirements per worker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarr_parallel-0.2.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zarr_parallel-0.2.0-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file zarr_parallel-0.2.0.tar.gz.

File metadata

  • Download URL: zarr_parallel-0.2.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.2 Linux/5.14.0-611.27.1.el9_7.x86_64

File hashes

Hashes for zarr_parallel-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1f71e74ba26e67a452381ce8a37a3eee29ee50b719b8a546fdfcb1fe92588320
MD5 31548f8371ce96e92139b317ab4d781c
BLAKE2b-256 37813415181ef2836a9fdc9287c30349c0cf88a8be19b22727e1c746e47ce900

See more details on using hashes here.

File details

Details for the file zarr_parallel-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: zarr_parallel-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.2 Linux/5.14.0-611.27.1.el9_7.x86_64

File hashes

Hashes for zarr_parallel-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 00ea89e1c6feafb0afb7f0514dbbf044c99d9816f6a8998e6cf9dd033bee0664
MD5 d7eea0dbbea5b8c297d8f792f618575a
BLAKE2b-256 3fd6d496e7fe612ce7e888b0843907ef37465c1ae524bf0bc75756f94dad1180

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page