Zarr-based dataset for PyTorch training pipelines. Written and maintained by the Research IT team at The Jackson Laboratory.
Project description
ZarrDataset
A class for handling large-volume datasets stored in OME-NGFF Zarr format. This can be used primarly with PyTorch's DataLoader in machine learning training workflows.
Usage
import zarrdataset as zds
# Open a set of Zarr files stored locally or in a S3 bucket. Must specify the
# group/component were the arrays are stored within the zarr file, and the
# order of the axes of the dataset.
my_dataset = zds.ZarrDataset(
dict(
modality="images",
filenames=["https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/9836839.zarr"],
source_axes="TCZYX",
data_group="0"
),
)
Integration
The ZarrDataset class is derived from PyTorch's IterableDataset class, and can be used with a DataLoader object to generate batches of inputs for machine learning training workflows.
from torch.utils.data import DataLoader
import zarrdataset as zds
my_dataset = zds.ZarrDataset(...)
# Generate batches of 16 images, uisng four threads.
# Pass the worker initialization function from zarrdataset to the DataLoader
my_dataloader = DataLoader(my_dataset,
batch_size=16,
num_workers=4,
worker_init_fn=zds.zarrdataset_worker_init_fn)
for x, t in my_dataloader:
# The training loop
...
output = model(x)
loss = criterion(output, t)
...
Multithread data loading
Use of multiple workers through multithread requires the use of the zarrdataset_worker_init_fn function provided in this package. This allows to load only a fraction of the dataset on each worker instead of the full dataset.
Patch sampling
ZarrDataset retrieve the whole array contained in data_group by default. To retrieve patches from that array instead, use any of the two samplers provided within this package, or implement a custom one derived from the PatchSampler class.
The two existing samplers are PatchSampler and BlueNoisePatchSampler. PatchSampler retrieves patches from an evenly distributed grid of non-overlapping squared patches of side patch_size. BlueNoisePatchSampler retrieves patches of side patch_size from random locations following blue-noise sampling. The patch sampler can be integated into a ZarrDataset object as follows.
import zarrdataset as zds
# Retrieve patches of size patch_size in an evenly spaced grid from the image.
my_patch_sampler = zds.PatchSampler(patch_size)
my_dataset = zds.ZarrDataset(...,
patch_sampler=my_patch_sampler)
Examples of integration of the ZarrDataset class with the PyTorch's DataLoader can be found in the (documentation)[https://thejacksonlaboratory.github.io/zarrdataset/index.html].
Installation
This package can be installed from PyPI with the following command
pip install zarrdataset
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for zarrdataset-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b014380cc6a55d9c16fb71db2faa29e8c67f69833ac7a74002aabb0d636788c4 |
|
MD5 | 11cd0bbe4f4c31fca9a78c2ed4ddd87f |
|
BLAKE2b-256 | 5842543534b98f1187d6406245f6d0813ccb7bad173b16b92e6e8c88d1bb36d1 |