Skip to main content

A minibatch loader for AnnData stores

Project description

annbatch

[!CAUTION] This package does not have a stable API. However, we do not anticipate the on-disk format to change in an incompatible manner.

Tests Documentation

A data loader and io utilities for minibatching on-disk AnnData, co-developed by lamin and scverse

Getting started

Please refer to the documentation, in particular, the API documentation.

Installation

You need to have Python 3.12 or newer installed on your system. If you don't have Python installed, we recommend installing uv.

To install the latest release of annbatch from PyPI:

pip install annbatch

We provide extras in the pyproject.toml for torch, cupy-cuda12, cupy-cuda13, and zarrs-python. cupy provides accelerated handling of the data via preload_to_gpu once it has been read off disk and does not need to be used in conjunction with torch.

[!IMPORTANT] zarrs-python gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.

Basic usage example

Basic preprocessing:

from annbatch import create_anndata_collection

import zarr
from pathlib import Path

# Using zarrs is necessary for local filesystem perforamnce.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

create_anndata_collection(
    adata_paths=[
        "path/to/your/file1.h5ad",
        "path/to/your/file2.h5ad"
    ],
    output_path="path/to/output/collection", # a directory containing `dataset_{i}.zarr`
    shuffle=True,  # shuffling is needed if you want to use chunked access
)

Data loading:

from pathlib import Path

from annbatch import ZarrSparseDataset
import anndata as ad
import zarr

# Using zarrs is necessary for local filesystem perforamnce.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

ds = ZarrSparseDataset(
    batch_size=4096,
    chunk_size=32,
    preload_nchunks=256,
).add_anndatas(
    [
        ad.AnnData(
            # note that you can open an AnnData file using any type of zarr store
            X=ad.io.sparse_dataset(zarr.open(p)["X"]),
            obs=ad.io.read_elem(zarr.open(p)["obs"]),
        )
        for p in Path("path/to/output/collection").glob("*.zarr")
    ],
    obs_keys="label_column",
)

# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
    ...

For usage of our loader inside of torch, please see our this note for more info. At the minimum, be aware that deadlocking will occur on linux unless you pass multiprocessing_context="spawn" to the DataLoader.

For a deeper dive into this example, please see the in-depth section of our docs

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annbatch-0.0.1.tar.gz (226.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annbatch-0.0.1-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file annbatch-0.0.1.tar.gz.

File metadata

  • Download URL: annbatch-0.0.1.tar.gz
  • Upload date:
  • Size: 226.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for annbatch-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4503c5f2b1df925549fea69f8b2731a7eaefbf65757f5b2b2f029da7eb88a0b8
MD5 08d36f47977f7e5d396a842d4bc3c1d0
BLAKE2b-256 206c2ceee1c79a949585d77ed830f72068c8c9f8102db5c15fb4c074188b5fef

See more details on using hashes here.

Provenance

The following attestation bundles were made for annbatch-0.0.1.tar.gz:

Publisher: release.yaml on scverse/annbatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file annbatch-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: annbatch-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for annbatch-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 adb05bafb78c6c41977e5503efc5c33d8aceb6db401884c32018cecd97dd58a9
MD5 0c572977c480403ce20efa62dbfa2f05
BLAKE2b-256 094d687a79504f9d546093319464134af3c7f6df39c2a918db727749b5865276

See more details on using hashes here.

Provenance

The following attestation bundles were made for annbatch-0.0.1-py3-none-any.whl:

Publisher: release.yaml on scverse/annbatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page