annbatch

A minibatch loader for AnnData stores

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ilan-gold

These details have not been verified by PyPI

Project links

Documentation

Project description

[!IMPORTANT] This package will now only make breaking changes on the minor version release until its major release.

A data loader and io utilities for mini-batched data loading of on-disk AnnData files, co-developed by Lamin Labs and scverse

Getting started

Please refer to the documentation, in particular, the API documentation.

Installation

pip install annbatch

Please see our installation page for full documentation about extras, especially zarrs-python which is essential for local filesystems but not for remote ones.

Performance

We provide a speed comparison to other comparable dataloaders below:

A more in-depth comparison and performance analysis is available in our paper (from which the above figure originates, see our citation).

Detailed tutorial

For a detailed tutorial, please see the in-depth section of our docs

Basic usage example

Basic preprocessing:

from annbatch import DatasetCollection

import zarr
from pathlib import Path

# Using zarrs is necessary for local filesystem performance.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install "annbatch[zarrs]"` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

# Create a collection at the given path. The subgroups will all be anndata stores.
collection = DatasetCollection("path/to/output/collection.zarr")
collection.add_adatas(
    adata_paths=[
        "path/to/your/file1.h5ad",
        "path/to/your/file2.h5ad"
    ],
    shuffle=True,  # shuffling is needed if you want to use chunked access, but is the default
)

Data loading:

[!IMPORTANT] Without custom loading via annbatch.Loader.use_collection or load_adata{s} or load_dataset{s}, all columns of the (obs) pandas.DataFrame will be loaded and yielded potentially degrading performance.

from pathlib import Path

from annbatch import DatasetCollection, Loader
import anndata as ad
import zarr

# Using zarrs is necessary for local filesystem performance, but should not be used for remote file systems.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install "annbatch[zarrs]"` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)


# WARNING: Without custom loading *all* obs columns will be loaded and yielded potentially degrading performance.
def custom_load_func(g: zarr.Group) -> ad.AnnData:
    return ad.AnnData(
        X=ad.io.sparse_dataset(g["layers"]["counts"]),
        obs=ad.io.read_elem(g["obs"])[some_subset_of_columns_useful_for_training]
    )


# A non empty collection
collection = DatasetCollection("path/to/output/collection.zarr")
# This settings override ensures that you don't lose/alter your categorical codes when reading the data in!
with ad.settings.override(remove_unused_categories=False):
    ds = Loader(
        batch_size=4096,
        chunk_size=32,
        preload_nchunks=256,
        to_torch=True
    )
    # `use_collection` automatically uses the on-disk `X` and full `obs` in the `Loader`
    # but the `load_adata` arg can override this behavior
    # (see `custom_load_func` above for an example of customization).
    ds = ds.use_collection(collection, load_adata=custom_load_func)

# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
    x, obs = batch["X"], batch["obs"]
    # Important: For performance reasons convert to dense on GPU
    x = x.cuda().to_dense()

[!IMPORTANT] For usage of our loader inside of torch, please see this note for more info. At the minimum, be aware that deadlocking will occur on linux unless you pass multiprocessing_context="spawn" to the torch.utils.data.DataLoader class. However, we strongly discourage using torch.utils.data.DataLoader and if you must, you should not use workers as annbatch is already multi-threaded.

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Citation

If you use annbatch in your work, please cite the annbatch publication as follows:

annbatch unlocks terabyte-scale training of biological data in anndata

Gold, I., Fischer, F., Arnoldt, L., Wolf, F. A., & Theis, F. J. (2026b). annbatch unlocks terabyte-scale training of biological data in anndata. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2604.01949

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ilan-gold

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.2.0

Jun 12, 2026

This version

0.1.6

Jun 3, 2026

0.1.5

May 7, 2026

0.1.4

May 4, 2026

0.1.3

Apr 15, 2026

0.1.2

Mar 26, 2026

0.1.1

Mar 25, 2026

0.1.0

Mar 18, 2026

0.0.8

Feb 12, 2026

0.0.7

Feb 2, 2026

0.0.6

Jan 26, 2026

0.0.5

Jan 23, 2026

0.0.4

Jan 21, 2026

0.0.3

Jan 19, 2026

0.0.2

Jan 16, 2026

0.0.1

Oct 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annbatch-0.1.6.tar.gz (150.4 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

annbatch-0.1.6-py3-none-any.whl (42.2 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file annbatch-0.1.6.tar.gz.

File metadata

Download URL: annbatch-0.1.6.tar.gz
Upload date: Jun 3, 2026
Size: 150.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for annbatch-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`299700f30f7492783b3501401102980d99d0e47052729aab924f3d02baa67c87`
MD5	`a2c843ad08c8e9effeeaa572798189dc`
BLAKE2b-256	`eb2653c0cab71fcc1a7f315c5ecdf7cf9fe066ddb11e95a358f983a7e3f73ce4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for annbatch-0.1.6.tar.gz:

Publisher: release.yaml on scverse/annbatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: annbatch-0.1.6.tar.gz
- Subject digest: 299700f30f7492783b3501401102980d99d0e47052729aab924f3d02baa67c87
- Sigstore transparency entry: 1710018345
- Sigstore integration time: Jun 3, 2026
Source repository:
- Permalink: scverse/annbatch@e3352fba362cb2158bcc6f2df823496039ffe61f
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/scverse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@e3352fba362cb2158bcc6f2df823496039ffe61f
- Trigger Event: release

File details

Details for the file annbatch-0.1.6-py3-none-any.whl.

File metadata

Download URL: annbatch-0.1.6-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 42.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for annbatch-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cae7835eda806abf48afc7c6126fde2c9fbfe6a2fcc102b9c8ea25ae8f4a3adb`
MD5	`81c2e41ea3ebef2d0471019af62e670b`
BLAKE2b-256	`48c0d35db3a74f682317a4afbb5c97402cf6af92b0e4a32bbf59480d647ca8d4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for annbatch-0.1.6-py3-none-any.whl:

Publisher: release.yaml on scverse/annbatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: annbatch-0.1.6-py3-none-any.whl
- Subject digest: cae7835eda806abf48afc7c6126fde2c9fbfe6a2fcc102b9c8ea25ae8f4a3adb
- Sigstore transparency entry: 1710018378
- Sigstore integration time: Jun 3, 2026
Source repository:
- Permalink: scverse/annbatch@e3352fba362cb2158bcc6f2df823496039ffe61f
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/scverse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@e3352fba362cb2158bcc6f2df823496039ffe61f
- Trigger Event: release

annbatch 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Getting started

Installation

Performance

Detailed tutorial

Basic usage example

Release notes

Contact

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance