A minibatch loader for AnnData stores
Project description
annbatch
[!CAUTION] This package does not have a stable API. However, we do not anticipate the on-disk format to change in an incompatible manner.
A data loader and io utilities for minibatching on-disk AnnData, co-developed by Lamin Labs and scverse
Getting started
Please refer to the documentation, in particular, the API documentation.
Installation
You need to have Python 3.12 or newer installed on your system. If you don't have Python installed, we recommend installing uv.
To install the latest release of annbatch from PyPI:
pip install annbatch
We provide extras for torch, cupy-cuda12, cupy-cuda13, and zarrs-python.
cupy provides accelerated handling of the data via preload_to_gpu once it has been read off disk and does not need to be used in conjunction with torch.
[!IMPORTANT] zarrs-python gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.
Detailed tutorial
For a detailed tutorial, please see the in-depth section of our docs
Basic usage example
Basic preprocessing:
from annbatch import DatasetCollection
import zarr
from pathlib import Path
# Using zarrs is necessary for local filesystem performance.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
{"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)
# Create a collection at the given path. The subgroups will all be anndata stores.
collection = DatasetCollection("path/to/output/collection.zarr")
collection.add_adatas(
adata_paths=[
"path/to/your/file1.h5ad",
"path/to/your/file2.h5ad"
],
shuffle=True, # shuffling is needed if you want to use chunked access, but is the default
)
Data loading:
from pathlib import Path
from annbatch import Loader
import anndata as ad
import zarr
# Using zarrs is necessary for local filesystem performance.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
{"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)
def custom_load_func(g: zarr.Group) -> ad.AnnData:
return ad.AnnData(X=ad.io.sparse_dataset(g["layers"]["counts"]), obs=ad.io.read_elem(g["obs"])[some_subset_of_columns])
# This settings override ensures that you don't lose/alter your categorical codes when reading the data in!
with ad.settings.override(remove_unused_categories=False):
ds = Loader(
batch_size=4096,
chunk_size=32,
preload_nchunks=256,
)
# `use_collection` automatically uses the on-disk `X` and full `obs` in the `Loader`
# but the `load_adata` arg can override this behavior
# (see `custom_load_func` above for an example of customization).
ds = ds.use_collection(collection)
# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
...
[!IMPORTANT] For usage of our loader inside of
torch, please see this note for more info. At the minimum, be aware that deadlocking will occur on linux unless you passmultiprocessing_context="spawn"to thetorch.utils.data.DataLoaderclass.
Release notes
See the changelog.
Contact
For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file annbatch-0.0.2.tar.gz.
File metadata
- Download URL: annbatch-0.0.2.tar.gz
- Upload date:
- Size: 227.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ff6de952ca40d6acd91ba52c54f61bee96384be14b4a67c20131151f6059f1b
|
|
| MD5 |
7757ce4ac84f76e298784d66167e7e8a
|
|
| BLAKE2b-256 |
2d45175c4609b936d2085b634e7608267af30f4629eb6e81e583a33529edac0c
|
Provenance
The following attestation bundles were made for annbatch-0.0.2.tar.gz:
Publisher:
release.yaml on scverse/annbatch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
annbatch-0.0.2.tar.gz -
Subject digest:
0ff6de952ca40d6acd91ba52c54f61bee96384be14b4a67c20131151f6059f1b - Sigstore transparency entry: 831722086
- Sigstore integration time:
-
Permalink:
scverse/annbatch@e0b2ebb1245c3f6dd22c401782e5c32fb2d01820 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/scverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@e0b2ebb1245c3f6dd22c401782e5c32fb2d01820 -
Trigger Event:
release
-
Statement type:
File details
Details for the file annbatch-0.0.2-py3-none-any.whl.
File metadata
- Download URL: annbatch-0.0.2-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e8962f8ccfb8fa6f9949c935f2f49f757d7266c62dfeb00ba96830e49f60d7a
|
|
| MD5 |
bf3b00faf801493d99aac687a97a094e
|
|
| BLAKE2b-256 |
e9dc810037512266259200bbef3873f9b678e0d6dbda027b583bf67ec53a3029
|
Provenance
The following attestation bundles were made for annbatch-0.0.2-py3-none-any.whl:
Publisher:
release.yaml on scverse/annbatch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
annbatch-0.0.2-py3-none-any.whl -
Subject digest:
7e8962f8ccfb8fa6f9949c935f2f49f757d7266c62dfeb00ba96830e49f60d7a - Sigstore transparency entry: 831722093
- Sigstore integration time:
-
Permalink:
scverse/annbatch@e0b2ebb1245c3f6dd22c401782e5c32fb2d01820 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/scverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@e0b2ebb1245c3f6dd22c401782e5c32fb2d01820 -
Trigger Event:
release
-
Statement type: