Tools for converting OME-Zarr data within the ome2024-ngff-challenge (see https://forum.image.sc/t/ome2024-ngff-challenge/97363)
Project description
ome2024-ngff-challenge
Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data
Challenge overview
The high-level goal of the challenge is to generate OME-Zarr data according to a development version of the specification to drive forward the implementation work and establish a baseline for the conversion costs that members of the community can expect to incur.
Data generated within the challenge will have:
- all v2 arrays converted to v3, optionally sharding the data
- all .zattrs metadata migrated to
zarr.json["attributes"]["ome"]
- a top-level
ro-crate-metadata.json
file with minimal metadata (specimen and imaging modality)
You can example the contents of a sample dataset by using the minio client:
$ mc config host add uk1anon https://uk1s3.embassy.ebi.ac.uk "" ""
Added `uk1anon` successfully.
$ mc ls -r uk1anon/idr/share/ome2024-ngff-challenge/0.0.5/6001240.zarr/
[2024-08-01 14:24:35 CEST] 24MiB STANDARD 0/c/0/0/0/0
[2024-08-01 14:24:28 CEST] 598B STANDARD 0/zarr.json
[2024-08-01 14:24:32 CEST] 6.0MiB STANDARD 1/c/0/0/0/0
[2024-08-01 14:24:28 CEST] 598B STANDARD 1/zarr.json
[2024-08-01 14:24:29 CEST] 1.6MiB STANDARD 2/c/0/0/0/0
[2024-08-01 14:24:28 CEST] 592B STANDARD 2/zarr.json
[2024-08-01 14:24:28 CEST] 1.2KiB STANDARD ro-crate-metadata.json
[2024-08-01 14:24:28 CEST] 2.7KiB STANDARD zarr.json
The dataset can be inspected using a development version of the OME-NGFF Validator available at https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/idr/share/ome2024-ngff-challenge/0.0.4/6001240.zarr
Converting your data
Getting started
The ome2024-ngff-challenge
script can be used to convert an OME-Zarr 0.4
dataset that is based on Zarr v2:
ome2024-ngff-challenge input.zarr output.zarr
If you would like to re-run the script with different parameters, you can
additionally set --output-overwrite
to ignore a previous conversion:
ome2024-ngff-challenge input.zarr output.zarr --output-overwrite
Reading/writing remotely
If you would like to avoid downloading and/or upload the Zarr datasets, you can set S3 parameters on the command-line which will then treat the input and/or output datasets as a prefix within an S3 bucket:
ome2024-ngff-challenge \
--input-bucket=BUCKET \
--input-endpoint=HOST \
--input-anon \
input.zarr \
output.zarr
A small example you can try yourself:
ome2024-ngff-challenge \
--input-bucket=idr \
--input-endpoint=https://uk1s3.embassy.ebi.ac.uk \
--input-anon \
zarr/v0.4/idr0062A/6001240.zarr \
/tmp/6001240.zarr
Reading/writing via a script
Another R/W option is to have resave.py
generate a script which you can
execute later. If you pass --output-script
, then rather than generate the
arrays immediately, a file named convert.sh
will be created which can be
executed later.
For example, running:
ome2024-ngff-challenge dev2/input.zarr /tmp/scripts.zarr --output-script
produces a dataset with one zarr.json
file and 3 convert.sh
scripts:
/tmp/scripts.zarr/0/convert.sh
/tmp/scripts.zarr/1/convert.sh
/tmp/scripts.zarr/2/convert.sh
Each of the scripts contains a statement of the form:
zarrs_reencode --chunk-shape 1,1,275,271 --shard-shape 2,236,275,271 --dimension-names c,z,y,x --validate dev2/input.zarr /tmp/scripts.zarr
Running this script will require having installed zarrs_tools
with:
cargo install zarrs_tools
export PATH=$PATH:$HOME/.cargo/bin
Optimizing chunks and shards
Finally, there is not yet a single heuristic for determining the chunk and shard
sizes that will work for all data. Pass the --output-chunks
and
--output-shards
flags in order to set the size of chunks and shards for all
resolutions:
ome2024-ngff-challenge input.zarr output.zarr --output-chunks=1,1,1,256,256 --output-shards=1,1,1,2048,2048
Alternatively, you can use a JSON file to review and manually optimize the chunking and sharding parameters on a per-resolution basis:
ome2024-ngff-challenge input.zarr parameters.json --output-write-details
This will write a JSON file of the form:
[{"shape": [...], "chunks": [...], "shards": [...]}, ...
where the order of the dictionaries matches the order of the "datasets" field in
the "multiscales". Edits to this file can be read back in using the
output-read-details
flag:
ome2024-ngff-challenge input.zarr output.zarr --output-read-details=parameters.json
Note: Changes to the shape are ignored.
Related work
The following additional PRs are required to work with the data created by the scripts in this repository:
- https://github.com/ome/ome-ngff-validator/pull/36
- https://github.com/ome/ome-zarr-py/pull/383
- https://github.com/hms-dbmi/vizarr/pull/172
- https://github.com/LDeakin/zarrs_tools/issues/8
Slightly less related but important at the moment:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ome2024_ngff_challenge-0.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1fba5ea5e93d05237810b5b9cef7b25d4ac6980d83ccb0b231a57255f44220f |
|
MD5 | b3ff8b699415b0f3721176a8e125aa60 |
|
BLAKE2b-256 | 693340217f1144292a7a2dd28fec7320c65b8f8eb14cf8a72cefedd821e4e61b |
Hashes for ome2024_ngff_challenge-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30294cb200d35b4f5e66b077b95347a804d06cdd885a418a8cd87a9297acde09 |
|
MD5 | 7b6726674f550fdf65df510ad00e081e |
|
BLAKE2b-256 | f4020ae247653614529f319f2a0829bfa9eadacf55496f500f08d6afffc66094 |