Skip to main content

A minibatch loader for AnnData stores

Project description

annbatch

[!IMPORTANT] This package will now only make breaking changes on the minor version release until its major release.

Tests Documentation PyPI Downloads Downloads

A data loader and io utilities for mini-batched data loading of on-disk AnnData files, co-developed by Lamin Labs and scverse

Getting started

Please refer to the documentation, in particular, the API documentation.

Installation

You need to have Python 3.12 or newer installed on your system. If you don't have Python installed, we recommend installing uv.

To install the latest release of annbatch from PyPI:

pip install "annbatch[zarrs]"

We provide extras for torch, cupy-cuda12, cupy-cuda13, and zarrs-python. cupy provides accelerated handling of the data via preload_to_gpu once it has been read off disk and does not need to be used in conjunction with torch.

[!IMPORTANT] zarrs-python gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.

To install all optional dependencies::

pip install "annbatch[zarrs,torch,cupy-cuda13]"

(Note: Replace cupy-cuda13 with the extra matching your local CUDA version)

Detailed tutorial

For a detailed tutorial, please see the in-depth section of our docs

Basic usage example

Basic preprocessing:

from annbatch import DatasetCollection

import zarr
from pathlib import Path

# Using zarrs is necessary for local filesystem performance.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install "annbatch[zarrs]"` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

# Create a collection at the given path. The subgroups will all be anndata stores.
collection = DatasetCollection("path/to/output/collection.zarr")
collection.add_adata(
    adata_paths=[
        "path/to/your/file1.h5ad",
        "path/to/your/file2.h5ad"
    ],
    shuffle=True,  # shuffling is needed if you want to use chunked access, but is the default
)

Data loading:

[!IMPORTANT] Without custom loading via {meth}annbatch.Loader.use_collection or load_adata{s} or load_dataset{s}, all columns of the (obs) {class}pandas.DataFrame will be loaded and yielded potentially degrading performance.

from pathlib import Path

from annbatch import Loader
import anndata as ad
import zarr

# Using zarrs is necessary for local filesystem performance.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install "annbatch[zarrs]"` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

# WARNING: Without custom loading *all* obs columns will be loaded and yielded potentially degrading performance.
def custom_load_func(g: zarr.Group) -> ad.AnnData:
    return ad.AnnData(
        X=ad.io.sparse_dataset(g["layers"]["counts"]),
        obs=ad.io.read_elem(g["obs"])[some_subset_of_columns_useful_for_training]
    )

# A non empty collection
collection = DatasetCollection("path/to/output/collection.zarr")
# This settings override ensures that you don't lose/alter your categorical codes when reading the data in!
with ad.settings.override(remove_unused_categories=False):
    ds = Loader(
        batch_size=4096,
        chunk_size=32,
        preload_nchunks=256,
        to_torch=True
    )
    # `use_collection` automatically uses the on-disk `X` and full `obs` in the `Loader`
    # but the `load_adata` arg can override this behavior
    # (see `custom_load_func` above for an example of customization).
    ds = ds.use_collection(collection, load_adata=custom_load_func)

# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
    x, obs = batch["X"], batch["obs"]
    # Important: For performance reasons convert to dense on GPU
    x = x.cuda().to_dense()

[!IMPORTANT] For usage of our loader inside of torch, please see this note for more info. At the minimum, be aware that deadlocking will occur on linux unless you pass multiprocessing_context="spawn" to the torch.utils.data.DataLoader class.

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annbatch-0.1.0.tar.gz (250.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annbatch-0.1.0-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file annbatch-0.1.0.tar.gz.

File metadata

  • Download URL: annbatch-0.1.0.tar.gz
  • Upload date:
  • Size: 250.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for annbatch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0cc3552867aa36d9a3724cad5c1eb3544c7096fb40378b6f9fff2b0d2331fe0b
MD5 162d57c953b52dd5cec358fb98a69cce
BLAKE2b-256 da57f6891937d2ebca0d552f4075e62ca029af1f16546d3e4de94d1a8bff53b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for annbatch-0.1.0.tar.gz:

Publisher: release.yaml on scverse/annbatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file annbatch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: annbatch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for annbatch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11a99d08d854c69a19a6a1f5579ef71eec8a23d08af1dc8fbea12986dd32b7fb
MD5 411b5ad8d88e322b9f477e9ebf4d39bb
BLAKE2b-256 b514b0bfa276af0c28881e80258796ecaf283f6997e066a636c854079314a809

See more details on using hashes here.

Provenance

The following attestation bundles were made for annbatch-0.1.0-py3-none-any.whl:

Publisher: release.yaml on scverse/annbatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page