A simple, elegant, and efficient Zarr implementation.
Project description
simplezarr
A simple, elegant, and efficient Zarr implementation
The core of simplezarr implements the Zarr 3 spec in straightforward
Python, without extra fuzz. This makes the code easy to follow, and gives
predictable performance. Extra functionality is provided as functions and classes
that are provided in simplezarr.utils.
Since simplezarr is nice and simple, it's easy to adopt in various
use-cases. It supports parallel io, but does not force the use of asyncio.
Status
- Stores are implemented, (except no remote stores yet).
- Codecs are implemented (all except for sharding).
- Main API can (asynchronously) read and write chunks.
What is not yet supported:
- Writing Zarr files.
- Indexing (wip).
- Sharding.
Motivation
Zarr 3 is a great file format for large datasets. It's nice and elegant. The
simplezarr lib is what happened when we took the Zarr 3 spec, and implemented
it as directly as possible.
Parallelism is achieved using a thread-pool and concurrent.futures.Future
objects. And in once place exactly: the code that reads a chunk (ZarrArray.get_chunk_future()).
We don't force asyncio. In fact, simplezarr does not even import
asyncio (except in code paths that represent a utility specific to asyncio
users).
Comparison with zarr-python
Why not use zarr-python? We ran into performance issues, and upon
investigating what happens under the hood, we found it hard to follow the path
that the code takes, especially regarding threading and asyncio. Granted, part
of that complexity is because it must support older Zarr versions as well.
Another reason is that zarr-python does not seem to have a way to read individual blocks
asynchronously (AsyncArray.get_block_selection() does not exist), which was a
requirement for our use-case.
What zarr-python does
- The store loads data using
asyncio.to_thread(). This runs the io-bound reading of bytes in a separate thread (from the loop's defaultThreadPoolExecutor). - It uses
asyncio.gather()is parallelize concurrent reads/writes. - When using the
zarr.Array(notAsyncArray), indexing is synchronous. To do this:- It uses a dedicated asyncio loop that runs continuously in a dedicated thread.
- A dedicated
ThreadPoolExecutoris set on that loop (which will be used to perform the store IO with). - Then
asyncio.run_coroutine_threadsafe(the_asyncio_coroutine, dedicated_loop)to turn the asyncio code into aconcurrent.futures.Future. - Then sync-wait on that future.
It looks like this complexity is one of the reasons why the performance of ome-zarr is hard to get right. The ome-zarr library wraps zarr-python with Dask, which uses thread pools too, which results in a lot of threads being spawned.
What simplezarr does
- Stores are synchronous.
simplezarr.Array.get_chunk()is synchronous (no threading or async).simplezarr.Array.get_chunk_future()uses aThreadPoolExecutor. It returns aconcurrent.futures.Future.- This is enough to support concurrently reads.
- No asyncio anywhere.
- But can be used in
asyncio(and other frameworks) usingawait asyncio.wrap_future(f)orf.add_done_callback(call_soon_threadsafe).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simplezarr-0.0.1.tar.gz.
File metadata
- Download URL: simplezarr-0.0.1.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b8000771af96ffce7cb89031d02e2ddf254008046308c192775b6e3c2e572d7
|
|
| MD5 |
043b9bdf1f4bffa8836e7dc0cb051892
|
|
| BLAKE2b-256 |
e3a0c4e07919321861034eb8b83eb66e00456d34783b1f86a80fc6ac2f645c60
|
Provenance
The following attestation bundles were made for simplezarr-0.0.1.tar.gz:
Publisher:
ci.yml on canpute/simplezarr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simplezarr-0.0.1.tar.gz -
Subject digest:
3b8000771af96ffce7cb89031d02e2ddf254008046308c192775b6e3c2e572d7 - Sigstore transparency entry: 1591163083
- Sigstore integration time:
-
Permalink:
canpute/simplezarr@cf71dd006fea27cd64d564ee1c32a8992ad0c98b -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/canpute
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@cf71dd006fea27cd64d564ee1c32a8992ad0c98b -
Trigger Event:
push
-
Statement type:
File details
Details for the file simplezarr-0.0.1-py3-none-any.whl.
File metadata
- Download URL: simplezarr-0.0.1-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55fee52bb7e0ef70593b62c214b78a39e853e3ce923c20ef1fb132155c70a1da
|
|
| MD5 |
6983350e2b28d42bbeb92e63cf56014c
|
|
| BLAKE2b-256 |
ed72f83cdba95c54005a4e43ec21cd79045dc4626e96b325e9e951f5d9b54928
|
Provenance
The following attestation bundles were made for simplezarr-0.0.1-py3-none-any.whl:
Publisher:
ci.yml on canpute/simplezarr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simplezarr-0.0.1-py3-none-any.whl -
Subject digest:
55fee52bb7e0ef70593b62c214b78a39e853e3ce923c20ef1fb132155c70a1da - Sigstore transparency entry: 1591163091
- Sigstore integration time:
-
Permalink:
canpute/simplezarr@cf71dd006fea27cd64d564ee1c32a8992ad0c98b -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/canpute
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@cf71dd006fea27cd64d564ee1c32a8992ad0c98b -
Trigger Event:
push
-
Statement type: