Tools for converting OME-Zarr data within the ome2024-ngff-challenge (see https://forum.image.sc/t/ome2024-ngff-challenge/97363)
Project description
ome2024-ngff-challenge
Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data
Challenge overview
The high-level goal of the challenge is to generate OME-Zarr data according to a development version of the specification to drive forward the implementation work and establish a baseline for the conversion costs that members of the community can expect to incur.
Data generated within the challenge will have:
- all v2 arrays converted to v3, optionally sharding the data
- all .zattrs metadata migrated to
zarr.json["attributes"]["ome"]
- a top-level
ro-crate-metadata.json
file with minimal metadata (specimen and imaging modality)
Converting your data
Getting started
The ome2024-ngff-challenge
script can be used to convert an OME-Zarr 0.4
dataset that is based on Zarr v2:
ome2024-ngff-challenge input.zarr output.zarr
If you would like to re-run the script with different parameters, you can
additionally set --output-overwrite
to ignore a previous conversion:
ome2024-ngff-challenge input.zarr output.zarr --output-overwrite
Reading/writing remotely
If you would like to avoid downloading and/or upload the Zarr datasets, you can set S3 parameters on the command-line which will then treat the input and/or output datasets as a prefix within an S3 bucket:
ome2024-ngff-challenge \
--input-bucket=BUCKET \
--input-endpoint=HOST \
--input-anon \
input.zarr \
output.zarr
A small example you can try yourself:
ome2024-ngff-challenge \
--input-bucket=idr \
--input-endpoint=https://uk1s3.embassy.ebi.ac.uk \
--input-anon \
zarr/v0.4/idr0062A/6001240.zarr \
/tmp/6001240.zarr
Reading/writing via a script
Another R/W option is to have resave.py
generate a script which you can
execute later. If you pass --output-script
, then rather than generate the
arrays immediately, a file named convert.sh
will be created which can be
executed later.
For example, running:
ome2024-ngff-challenge dev2/input.zarr /tmp/scripts.zarr --output-script
produces a dataset with one zarr.json
file and 3 convert.sh
scripts:
/tmp/scripts.zarr/0/convert.sh
/tmp/scripts.zarr/1/convert.sh
/tmp/scripts.zarr/2/convert.sh
Each of the scripts contains a statement of the form:
zarrs_reencode --chunk-shape 1,1,275,271 --shard-shape 2,236,275,271 --dimension-names c,z,y,x --validate dev2/input.zarr /tmp/scripts.zarr
Running this script will require having installed zarrs_tools
with:
cargo install zarrs_tools
export PATH=$PATH:$HOME/.cargo/bin
Optimizing chunks and shards
Finally, there is not yet a single heuristic for determining the chunk and shard
sizes that will work for all data. Pass the --output-chunks
and
--output-shards
flags in order to set the size of chunks and shards for all
resolutions:
ome2024-ngff-challenge input.zarr output.zarr --output-chunks=1,1,1,256,256 --output-shards=1,1,1,2048,2048
Alternatively, you can use a JSON file to review and manually optimize the chunking and sharding parameters on a per-resolution basis:
ome2024-ngff-challenge input.zarr parameters.json --output-write-details
This will write a JSON file of the form:
[{"shape": [...], "chunks": [...], "shards": [...]}, ...
where the order of the dictionaries matches the order of the "datasets" field in
the "multiscales". Edits to this file can be read back in using the
output-read-details
flag:
ome2024-ngff-challenge input.zarr output.zarr --output-read-details=parameters.json
Note: Changes to the shape are ignored.
Related work
The following additional PRs are required to work with the data created by the scripts in this repository:
- https://github.com/ome/ome-ngff-validator/pull/36
- https://github.com/ome/ome-zarr-py/pull/383
- https://github.com/hms-dbmi/vizarr/pull/172
- https://github.com/LDeakin/zarrs_tools/issues/8
Slightly less related but important at the moment:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ome2024_ngff_challenge-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d4ad9bf4591a8128eddbd37453bac78a6d47551deca026b25c6c94bcb20bf04 |
|
MD5 | 86f2f137cb79f850390597e1d86f3e5d |
|
BLAKE2b-256 | 28d0baba901265ec9eba47235745d88e1095141c05418e6c702827f02d727519 |
Hashes for ome2024_ngff_challenge-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 343ad9ea761c58d9176de47b182572dc2ef85bd8e67e9e1b7e3cef4a1ddcdb97 |
|
MD5 | ca7a93d9f4b9e1eb55a4bc78c6017483 |
|
BLAKE2b-256 | 26e0a1134c0e5d0c9104af8de34be02894978f3bb7833b1c502860e9cc1ff58c |