Package for efficiently parallelising zarr write operations based on awareness of source chunks
Project description
Zarr Parallel Cacher
This package has been developed as part of the FRAME-FM AI project. It has been separated into its own module for ease of reusability across multiple projects. AI-specific steps may form part of the package, but may also be disabled by default.
Basic Usage
from zarr_parallel.assembler import ZarrParallelAssembler
zp = ZarrParallelAssembler(data_uri=uri, preprocessors=preprocessors,
chunks=chunks,
engine='kerchunk',
variables={'d2m':{}},
cache_label='_v1')
zp.cache(
cache_dir='/gws/ssde/j25b/eds_ai/frame-fm/data/zarr_cache',
deploy_mode='dask_distributed',
simultaneous_worker_limit=4)
The above code snippet demonstrates the use of this package. The data_uri and engine parameters refer to the xarray open_dataset method for accessing the source object. chunks are required to specify the output chunking in the zarr cache, which is also required for organising the parallel jobs. variables is optional to add, and includes the ability to run transforms on specific data arrays (such as renaming) which are applied individually.
The preprocessors list defines the set of preprocessing transforms to apply to the dataset (including selection) at the point of caching. This should include all transforms that should be applied to the dataset before writing to the zarr cache.
The num_jobs and simultaneous_worker_limit parameters are used to configure for parallel deployment. If no num_jobs is provided, the assembler will calculate the optimal number of jobs for your memory limit (recommended).
Transforms/Preprocessors
Transformations to the data may be specified via the selector option passed in the above example. Xarray-native transformations are supported, as well as transforms from the FRAME-FM package if installed.
Selection Recommendations
The assembler will halt to recommend alternative data selections based on the underlying chunk structure. Proceeding without recommendations is not advised, as mismatched chunk-region borders may involve duplicating chunk requests and significantly increasing memory requirements per worker.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zarr_parallel-0.1.0.tar.gz.
File metadata
- Download URL: zarr_parallel-0.1.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.2 Linux/5.14.0-611.27.1.el9_7.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99adb39a53b3c4692f99ca120021168bc4726fac6c5226c1e9cdd15fa05a6c4a
|
|
| MD5 |
f26c3d4f6958f3d9ab87844b6dbf0996
|
|
| BLAKE2b-256 |
3250a92cfaf6e938c5b7dc321dd37017c1647b3494044d048fefc3e7c3e1e12c
|
File details
Details for the file zarr_parallel-0.1.0-py3-none-any.whl.
File metadata
- Download URL: zarr_parallel-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.2 Linux/5.14.0-611.27.1.el9_7.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
913d10bb077bc4d2a4d4b236626b159ed851f4439335cf6a67c827740028c28d
|
|
| MD5 |
66b26ba624e3841cbdce9eef7f93cb14
|
|
| BLAKE2b-256 |
2c2b548b6cb9b073b01dc8257bfa38bb97eea25aae3b9d0f058e99d7da87923c
|