Out-of-core sharding of large .h5ad AnnData files with minimal memory usage.
Project description
annslicer
Out-of-core sharding and merging of large AnnData files with minimal memory usage.
Large single-cell datasets stored as .h5ad or .zarr files can easily exceed available RAM. annslicer slices them into manageable shards — and merges them back — without loading full matrices into memory. It uses best practices from anndata with a few small speed improvements for random shuffling.
Consolidates best practices into a simple command-line tool.
annslicer slice input.h5ad output_prefix
annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad
Features
- Shards and merges
X, alllayers,obs,var,obsm, anduns - Handles both dense and sparse (CSR) matrices
- Constant, low memory footprint regardless of file size
- Input supports both
.h5adand.zarrformats for slicing - Merge output supports both
.h5adand.zarrformats - Optional cell shuffling (
--shuffle) for representative shards without loading the full matrix - Simple CLI and Python API
Installation
pip install annslicer
For Zarr input/output support (optional):
pip install annslicer[zarr]
CLI Usage
annslicer provides two subcommands: slice and merge.
Sharding a large file
annslicer slice input.h5ad output_prefix --size 10000
Both .h5ad and .zarr inputs are supported.
| Argument | Description |
|---|---|
input.h5ad or input.zarr |
Path to the source file |
output_prefix |
Prefix for output files (e.g. atlas → atlas_shard001.h5ad, …) |
--size N |
Number of cells per shard (default: 10000) |
--shuffle |
Randomly assign cells to shards (each shard is a representative draw) |
--seed N |
Random seed for reproducible shuffling (requires --shuffle) |
Example — basic sharding:
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 20000
Example — shuffled sharding from a large h5ad:
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --shuffle --seed 0
Produces: atlas_shard_0.h5ad, atlas_shard_1.h5ad, …
Merging shards back into one file
annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad shard_2.h5ad
Output format is inferred from the extension — use .zarr for Zarr output (requires annslicer[zarr]):
annslicer merge output.zarr shard_0.h5ad shard_1.h5ad shard_2.h5ad
Global options
| Flag | Description |
|---|---|
--debug |
Enable verbose debug-level logging |
Python API
from annslicer import shard_h5ad, merge_out_of_core
# Basic sharding (h5ad or zarr input)
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000)
shard_h5ad("large_atlas.zarr", "atlas", shard_size=20000) # requires annslicer[zarr]
# Shuffled sharding — cells are randomly distributed across shards
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, shuffle=True, seed=0)
# Merge shards back into one file
merge_out_of_core(["atlas_shard_0.h5ad", "atlas_shard_1.h5ad"], "merged.h5ad")
How it works
Slicing
- Opens the input file ("backed" AnnData for
.h5ad;anndata.io.sparse_datasetfor.zarr). - If
shuffle=True, generates a global cell permutation upfront usingnumpy.random.default_rng. - For each shard, reads only the relevant rows from
Xand each layer via sorted fancy indexing — no full matrix is ever loaded into memory. - When shuffling, rows are read in sorted index order (maximising sequential I/O) and then reordered in-memory to the desired shuffled order.
- Reassembles a valid
AnnDataobject per shard and writes it to disk.
Merging
- Reads
obs,var, andunsfrom the shards to build a skeleton output file. - Scans shards to calculate total non-zero sizes for pre-allocation.
- Streams
X, layers, andobsmdata shard-by-shard directly into the pre-allocated output arrays.
Note: CSC (column-compressed) sparse matrices are not supported for out-of-core row-slicing. Convert to CSR before sharding.
Benchmarks
Run on a dummy sparse anndata object with 200k cells and 10k genes.
For h5ad format
| Slicing method | Mean runtime (s) | Peak memory (MB) |
|---|---|---|
annslicer slice |
0.584 | 211.4 |
anndata backed |
0.601 | 203.7 |
annslicer slice --shuffle |
1.731 | 221.8 |
anndata backed with shuffle |
3.830 | 209.1 |
For zarr format
| Slicing method | Mean runtime (s) | Peak memory (MB) |
|---|---|---|
annslicer slice |
1.050 | 62.1 |
anndata backed |
0.799 | 54.4 |
annslicer slice --shuffle |
5.544 | 142.9 |
anndata backed with shuffle |
6.591 | 151.4 |
Based on these benchmarks, for making randomly shuffled data shards, we recommend using annslicer slice --shuffle on an h5ad format file.
License
BSD 3-clause
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file annslicer-0.1.3.tar.gz.
File metadata
- Download URL: annslicer-0.1.3.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e797473d276430f5362d915859eff51190decc6765e3989039f8c4a334b89b0
|
|
| MD5 |
3d6991ccfa9c31efb21789e1a68cd484
|
|
| BLAKE2b-256 |
12fc3a71993e865be2c10c2191f35fe10b7594ed3696c650f0bc9391c020a164
|
Provenance
The following attestation bundles were made for annslicer-0.1.3.tar.gz:
Publisher:
publish.yml on cellarium-ai/annslicer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
annslicer-0.1.3.tar.gz -
Subject digest:
4e797473d276430f5362d915859eff51190decc6765e3989039f8c4a334b89b0 - Sigstore transparency entry: 1049463627
- Sigstore integration time:
-
Permalink:
cellarium-ai/annslicer@4c8fc1be949834bb7a88f9377614ffaab82bdd38 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/cellarium-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c8fc1be949834bb7a88f9377614ffaab82bdd38 -
Trigger Event:
release
-
Statement type:
File details
Details for the file annslicer-0.1.3-py3-none-any.whl.
File metadata
- Download URL: annslicer-0.1.3-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad0afc6de094e3c9a6818d537aeb46fd9fd1748225ad50488ac027029c6494f7
|
|
| MD5 |
589e8f4396e7e92496b9ab140920c792
|
|
| BLAKE2b-256 |
3e19285f4ac687ce4217aa2bbd616c4a28deefcf2b389d409f2d2aaa37d62403
|
Provenance
The following attestation bundles were made for annslicer-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on cellarium-ai/annslicer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
annslicer-0.1.3-py3-none-any.whl -
Subject digest:
ad0afc6de094e3c9a6818d537aeb46fd9fd1748225ad50488ac027029c6494f7 - Sigstore transparency entry: 1049463667
- Sigstore integration time:
-
Permalink:
cellarium-ai/annslicer@4c8fc1be949834bb7a88f9377614ffaab82bdd38 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/cellarium-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c8fc1be949834bb7a88f9377614ffaab82bdd38 -
Trigger Event:
release
-
Statement type: