Skip to main content

Zarr-based dataset for PyTorch training pipelines. Written and maintained by the Research IT team at The Jackson Laboratory.

Project description

ZarrDataset

A class for handling large-volume datasets stored in OME-NGFF Zarr format. This can be used primarly with PyTorch's DataLoader in machine learning training workflows.

Usage

import zarrdataset as zds

# Open a set of Zarr files stored locally or in a S3 bucket. Must specify the
# group/component were the arrays are stored within the zarr file, and the 
# order of the axes of the dataset.
my_dataset = zds.ZarrDataset(
  dict(
    modality="images",
    filenames=["https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/9836839.zarr"],
    source_axes="TCZYX",
    data_group="0"
  ),
)

Integration

The ZarrDataset class is derived from PyTorch's IterableDataset class, and can be used with a DataLoader object to generate batches of inputs for machine learning training workflows.

from torch.utils.data import DataLoader
import zarrdataset as zds

my_dataset = zds.ZarrDataset(...)

# Generate batches of 16 images, uisng four threads.
# Pass the worker initialization function from zarrdataset to the DataLoader
my_dataloader = DataLoader(my_dataset,
                           batch_size=16,
                           num_workers=4,
                           worker_init_fn=zds.zarrdataset_worker_init_fn)

for x, t in my_dataloader:
    # The training loop
    ...
    output = model(x)
    loss = criterion(output, t)
    ...

Multithread data loading

Use of multiple workers through multithread requires the use of the zarrdataset_worker_init_fn function provided in this package. This allows to load only a fraction of the dataset on each worker instead of the full dataset.

Patch sampling

ZarrDataset retrieve the whole array contained in data_group by default. To retrieve patches from that array instead, use any of the two samplers provided within this package, or implement a custom one derived from the PatchSampler class.

The two existing samplers are PatchSampler and BlueNoisePatchSampler. PatchSampler retrieves patches from an evenly distributed grid of non-overlapping squared patches of side patch_size. BlueNoisePatchSampler retrieves patches of side patch_size from random locations following blue-noise sampling. The patch sampler can be integated into a ZarrDataset object as follows.

import zarrdataset as zds

# Retrieve patches of size patch_size in an evenly spaced grid from the image.
my_patch_sampler = zds.PatchSampler(patch_size)

my_dataset = zds.ZarrDataset(...,
                             patch_sampler=my_patch_sampler)

Examples of integration of the ZarrDataset class with the PyTorch's DataLoader can be found in the documentation.

Installation

This package can be installed from PyPI with the following command

pip install zarrdataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarrdataset-0.2.2.tar.gz (40.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zarrdataset-0.2.2-py3-none-any.whl (47.3 kB view details)

Uploaded Python 3

File details

Details for the file zarrdataset-0.2.2.tar.gz.

File metadata

  • Download URL: zarrdataset-0.2.2.tar.gz
  • Upload date:
  • Size: 40.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for zarrdataset-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0b72df2b725ec4b15b3e064a174551f55a721186852576bd0ee2904008b2fd9d
MD5 b7565469a2bec8006ca2f28fa6f10d28
BLAKE2b-256 9ef896a5586d2b832e9514bab65faa8e3673243277ab8fceeaa592c7c4b6d9d4

See more details on using hashes here.

File details

Details for the file zarrdataset-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: zarrdataset-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 47.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for zarrdataset-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9418d7d09eed35eaae0d1add3cd9324423be40cf6a3bc12afaa07015b3ea8fc5
MD5 20c61419bd196777d059d11d3617f6b4
BLAKE2b-256 257a58d70e0460a0d30ffbdbb2172f10edcf84226248cb316b353ede22f778b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page