Skip to main content

Tools for converting OME-Zarr data within the ome2024-ngff-challenge (see https://forum.image.sc/t/ome2024-ngff-challenge/97363)

Project description

ome2024-ngff-challenge

Actions Status PyPI version PyPI platforms Image.SC Zulip

Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data

Challenge overview

The high-level goal of the challenge is to generate OME-Zarr data according to a development version of the specification to drive forward the implementation work and establish a baseline for the conversion costs that members of the community can expect to incur.

Data generated within the challenge will have:

  • all v2 arrays converted to v3, optionally sharding the data
  • all .zattrs metadata migrated to zarr.json["attributes"]["ome"]
  • a top-level ro-crate-metadata.json file with minimal metadata (specimen and imaging modality)

You can example the contents of a sample dataset by using the minio client:

$ mc config host add uk1anon https://uk1s3.embassy.ebi.ac.uk "" ""
Added `uk1anon` successfully.
$ mc ls -r uk1anon/idr/share/ome2024-ngff-challenge/0.0.5/6001240.zarr/
[2024-08-01 14:24:35 CEST]  24MiB STANDARD 0/c/0/0/0/0
[2024-08-01 14:24:28 CEST]   598B STANDARD 0/zarr.json
[2024-08-01 14:24:32 CEST] 6.0MiB STANDARD 1/c/0/0/0/0
[2024-08-01 14:24:28 CEST]   598B STANDARD 1/zarr.json
[2024-08-01 14:24:29 CEST] 1.6MiB STANDARD 2/c/0/0/0/0
[2024-08-01 14:24:28 CEST]   592B STANDARD 2/zarr.json
[2024-08-01 14:24:28 CEST] 1.2KiB STANDARD ro-crate-metadata.json
[2024-08-01 14:24:28 CEST] 2.7KiB STANDARD zarr.json

The dataset (from idr0062) can be inspected using a development version of the OME-NGFF Validator available at https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/idr/share/ome2024-ngff-challenge/0.0.5/6001240.zarr

Other samples:

Expand for more details on creation of these samples

4496763.json was created with ome2024-ngff-challenge commit 0e1809bf3b.

First the config details were generated with:

$ ome2024-ngff-challenge --input-bucket=idr --input-endpoint=https://uk1s3.embassy.ebi.ac.uk --input-anon zarr/v0.4/idr0047A/4496763.zarr params_4496763.json --output-write-details

The params_4496763.json file was edited to set "shards" to: [4, 1, sizeY, sizeX] for each pyramid resolution to create a single shard for each Z section.

# params_4496763.json
[{"shape": [4, 25, 2048, 2048], "chunks": [1, 1, 2048, 2048], "shards": [4, 1, 2048, 2048]}, {"shape": [4, 25, 1024, 1024], "chunks": [1, 1, 1024, 1024], "shards": [4, 1, 1024, 1024]}, {"shape": [4, 25, 512, 512], "chunks": [1, 1, 512, 512], "shards": [4, 1, 512, 512]}, {"shape": [4, 25, 256, 256], "chunks": [1, 1, 256, 256], "shards": [4, 1, 256, 256]}, {"shape": [4, 25, 128, 128], "chunks": [1, 1, 128, 128], "shards": [4, 1, 128, 128]}, {"shape": [4, 25, 64, 64], "chunks": [1, 1, 64, 64], "shards": [4, 1, 64, 64]}]

This was then used to run the conversion:

ome2024-ngff-challenge --input-bucket=idr --input-endpoint=https://uk1s3.embassy.ebi.ac.uk --input-anon zarr/v0.4/idr0047A/4496763.zarr 4496763.zarr --output-read-details params_4496763.json

9822152.zarr was created with ome2024-ngff-challenge commit f17a6de963.

The chunks and shard shapes are specified to be the same for all resolution levels. This is required since the smaller resolution levels of the source image at https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0083A/9822152.zarr have chunks that correspond to the resolution shape, e,g, 1,1,1,91,141 and this will fail to convert using a shard shape of 1,1,1,4096,4096.

Took 34 minutes to run conversion with this command:

$ ome2024-ngff-challenge --input-bucket=idr --input-endpoint=https://uk1s3.embassy.ebi.ac.uk --input-anon zarr/v0.4/idr0083A/9822152.zarr 9822152.zarr --output-shards=1,1,1,4096,4096 --output-chunks=1,1,1,1024,1024 --log debug

Took 9 hours to run this conversion:

$ ome2024-ngff-challenge 9846151.zarr/0 will/9846151_2D_chunks_3.zarr --output-shards=1,1,1,4096,4096 --output-chunks=1,1,1,1024,1024 --log debug

Plate conversion, took 19 minutes, choosing a shard size that contained a whole image. Image shape is 1,3,1,1024,1280.

$ ome2024-ngff-challenge --input-bucket=bia-integrator-data --input-endpoint=https://uk1s3.embassy.ebi.ac.uk --input-anon S-BIAD847/0762bf96-4f01-454d-9b13-5c8438ea384f/0762bf96-4f01-454d-9b13-5c8438ea384f.zarr /data/will/idr0035/Week9_090907.zarr --output-shards=1,3,1,1024,2048 --output-chunks=1,1,1,1024,1024 --log debug

Converting your data

The ome2024-ngff-challenge tool can be used to convert an OME-Zarr 0.4 dataset that is based on Zarr v2. The input data will not be modified in any way and a full copy of the data will be created at the chosen location.

Getting started

ome2024-ngff-challenge input.zarr output.zarr

is the most basic invocation of the tool. If you would like to re-run the script with different parameters, you can additionally set --output-overwrite to ignore a previous conversion:

ome2024-ngff-challenge input.zarr output.zarr --output-overwrite

Writing in parallel

By default, 16 chunks of data will be processed simultaneously in order to bound memory usage. You can increase this number based on your local resources:

ome2024-ngff-challenge input.zarr output.zarr --output-threads=128

Reading/writing remotely

If you would like to avoid downloading and/or upload the Zarr datasets, you can set S3 parameters on the command-line which will then treat the input and/or output datasets as a prefix within an S3 bucket:

ome2024-ngff-challenge \
        --input-bucket=BUCKET \
        --input-endpoint=HOST \
        --input-anon \
        input.zarr \
        output.zarr

A small example you can try yourself:

ome2024-ngff-challenge \
        --input-bucket=idr \
        --input-endpoint=https://uk1s3.embassy.ebi.ac.uk \
        --input-anon \
        zarr/v0.4/idr0062A/6001240.zarr \
        /tmp/6001240.zarr

Reading/writing via a script

Another R/W option is to have resave.py generate a script which you can execute later. If you pass --output-script, then rather than generate the arrays immediately, a file named convert.sh will be created which can be executed later.

For example, running:

ome2024-ngff-challenge dev2/input.zarr /tmp/scripts.zarr --output-script

produces a dataset with one zarr.json file and 3 convert.sh scripts:

/tmp/scripts.zarr/0/convert.sh
/tmp/scripts.zarr/1/convert.sh
/tmp/scripts.zarr/2/convert.sh

Each of the scripts contains a statement of the form:

zarrs_reencode --chunk-shape 1,1,275,271 --shard-shape 2,236,275,271 --dimension-names c,z,y,x --validate dev2/input.zarr /tmp/scripts.zarr

Running this script will require having installed zarrs_tools with:

cargo install zarrs_tools
export PATH=$PATH:$HOME/.cargo/bin

Optimizing chunks and shards

Finally, there is not yet a single heuristic for determining the chunk and shard sizes that will work for all data. Pass the --output-chunks and --output-shards flags in order to set the size of chunks and shards for all resolutions:

ome2024-ngff-challenge input.zarr output.zarr --output-chunks=1,1,1,256,256 --output-shards=1,1,1,2048,2048

Alternatively, you can use a JSON file to review and manually optimize the chunking and sharding parameters on a per-resolution basis:

ome2024-ngff-challenge input.zarr parameters.json --output-write-details

This will write a JSON file of the form:

[{"shape": [...], "chunks": [...], "shards": [...]}, ...

where the order of the dictionaries matches the order of the "datasets" field in the "multiscales". Edits to this file can be read back in using the output-read-details flag:

ome2024-ngff-challenge input.zarr output.zarr --output-read-details=parameters.json

Note: Changes to the shape are ignored.

Related work

The following additional PRs are required to work with the data created by the scripts in this repository:

Slightly less related but important at the moment:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ome2024_ngff_challenge-0.0.7.tar.gz (19.5 kB view hashes)

Uploaded Source

Built Distribution

ome2024_ngff_challenge-0.0.7-py3-none-any.whl (16.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page