Package for efficiently parallelising zarr write operations based on awareness of source chunks

These details have not been verified by PyPI

Project description

Zarr Parallel Cacher

This package has been developed as part of the NERC EDS FRAME-FM AI project. It has been separated into its own module for ease of reusability across multiple projects. AI-specific steps may form part of the package, but may also be disabled by default.

Basic Usage

from zarr_parallel.assembler import ZarrParallelAssembler

zp = ZarrParallelAssembler(data_uri=uri, preprocessors=preprocessors,
            chunks=chunks,
            engine='kerchunk',
            variables={'d2m':{}}, 
            cache_label='_v1')

zp.cache(
    cache_dir='/gws/ssde/j25b/eds_ai/frame-fm/data/zarr_cache',
    deploy_mode='dask_distributed',
    simultaneous_worker_limit=4)

The above code snippet demonstrates the use of this package. The data_uri and engine parameters refer to the xarray open_dataset method for accessing the source object. chunks are required to specify the output chunking in the zarr cache, which is also required for organising the parallel jobs. variables is optional to add, and includes the ability to run transforms on specific data arrays (such as renaming) which are applied individually.

The preprocessors list defines the set of preprocessing transforms to apply to the dataset (including selection) at the point of caching. This should include all transforms that should be applied to the dataset before writing to the zarr cache.

The num_jobs and simultaneous_worker_limit parameters are used to configure for parallel deployment. If no num_jobs is provided, the assembler will calculate the optimal number of jobs for your memory limit (recommended). The default memory limit is 2GB and the timeout is set at 30 minutes, although this only applies to SLURM deployments at present.

Transforms/Preprocessors

Transformations to the data may be specified via the selector option passed in the above example. Xarray-native transformations are supported, as well as transforms from the FRAME-FM package if installed.

Selection Recommendations

The assembler will halt to recommend alternative data selections based on the underlying chunk structure. Proceeding without recommendations is not advised, as mismatched chunk-region borders may involve duplicating chunk requests and significantly increasing memory requirements per worker.

Version 0.3 Changes

Heartbeats between jobs in the dask workers.
Now able to shut off dask distributed info messages.
Added ability to add attributes

Version 0.4 Changes

Job parallelisation now distributed to workers for efficiency
- Small parallel writes were found to be inefficient, so the writes are parallelised to the largest possible selection while adhering to memory/timeout limits.
Tiling parallelisation now available. Caveats:
- Tiling necessitates rechunking to single chunk-per-tile. This means tile size may need to be smaller than expected to account for memory limitations of individual worker - specifically where source chunking scheme inflates the size of data initially retrieved. Error will be raised if the estimated memory requirement per tile is larger than the memory limit for the worker.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Mar 19, 2026

0.4.1

Mar 18, 2026

This version

0.4.0

Mar 17, 2026

0.3.2

Mar 11, 2026

0.3.1

Mar 11, 2026

0.3.0 yanked

Mar 11, 2026

Reason this release was yanked:

Bug with cache directory name

0.2.0

Mar 10, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarr_parallel-0.4.0.tar.gz (20.0 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zarr_parallel-0.4.0-py3-none-any.whl (24.6 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file zarr_parallel-0.4.0.tar.gz.

File metadata

Download URL: zarr_parallel-0.4.0.tar.gz
Upload date: Mar 17, 2026
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.2 Linux/5.14.0-611.27.1.el9_7.x86_64

File hashes

Hashes for zarr_parallel-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`9a5a4e571d3ba893245a7cc31dffe2dc4a8027c6f555c944956d2e1f4b15b54e`
MD5	`cb16f61714c7570edb72189197e7333c`
BLAKE2b-256	`940e2d2d699c2f74beef39f5603be834caf9acc915ef26dea07ac65c69f0cb0c`

See more details on using hashes here.

File details

Details for the file zarr_parallel-0.4.0-py3-none-any.whl.

File metadata

Download URL: zarr_parallel-0.4.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.2 Linux/5.14.0-611.27.1.el9_7.x86_64

File hashes

Hashes for zarr_parallel-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9683c0bae157581788ce5d214039d12360830dd161913aa1de9c91ba3f7daec3`
MD5	`0f434c8e5e627451ce759ed5df5960d3`
BLAKE2b-256	`bef631ea5c2d6b372678af46fec945d429c666d2e0a2b64f3de1dd62c95e2995`

See more details on using hashes here.

zarr-parallel 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Zarr Parallel Cacher

Basic Usage

Transforms/Preprocessors

Selection Recommendations

Version 0.3 Changes

Version 0.4 Changes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes