Out-of-core sharding of large .h5ad AnnData files with minimal memory usage.

These details have not been verified by PyPI

Project links

Project description

annslicer

Out-of-core sharding and merging of large AnnData files with minimal memory usage.

Diagram

Large single-cell datasets stored as .h5ad or .zarr files can easily exceed available RAM. annslicer filters them, slices them into manageable shards and merges them back — without loading full matrices into memory. It uses best practices from anndata with a few small speed improvements for random shuffling.

Consolidates best practices into a simple command-line tool.

annslicer slice input.h5ad output_prefix --size 10000

annslicer slice input.h5ad output_prefix --obs-column cell_type

annslicer filter input.h5ad filtered.h5ad --obs-column keep

annslicer merge output.h5ad shard_*.h5ad

Features

Shards, filters, and merges X, all layers, obs, var, obsm, and uns
Handles both dense and sparse (CSR) matrices
Constant, low memory footprint regardless of file size
Input supports both .h5ad and .zarr formats for slicing and filtering
Merge output supports both .h5ad and .zarr formats
Fixed-size sharding (--size) with optional random cell shuffling
Categorical sharding (--obs-column) — one shard per category value, named by category
Always-include cells — append control cells (e.g. non-targeting controls) to every shard
Auxiliary CSV metadata — provide extra obs columns from a CSV file without modifying the source
Cell filtering — keep only cells matching a boolean obs column, out-of-core
Simple CLI and Python API

Installation

pip install annslicer

For Zarr input/output support (optional):

pip install annslicer[zarr]

CLI Usage

annslicer provides three subcommands: slice, filter, and merge.

Sharding a large file

annslicer slice supports two sharding modes: fixed-size (default) and categorical.

Fixed-size sharding

Split the file into equal-sized shards by cell count:

annslicer slice input.h5ad output_prefix --size 10000

Both .h5ad and .zarr inputs are supported.

Argument	Description
`input.h5ad` or `input.zarr`	Path to the source file
`output_prefix`	Prefix for output files (e.g. `atlas` → `atlas_shard_0.h5ad`, …)
`--size N`	Number of cells per shard (default: `10000`)
`--shuffle`	Randomly assign cells to shards (each shard is a representative draw)
`--seed N`	Random seed for reproducible shuffling (requires `--shuffle`)
`--compression FILTER`	HDF5 compression filter for shard files (e.g. `gzip`, `lzf`); default: no compression

Example — basic sharding:

annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 20000

Example — shuffled sharding:

annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --shuffle --seed 0

Example — gzip-compressed shards:

annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --compression gzip

Produces: atlas_shard_0.h5ad, atlas_shard_1.h5ad, …

Categorical sharding by obs column

Split cells into one shard per category value, named by the category:

annslicer slice input.h5ad output_prefix --obs-column cell_type

Argument	Description
`--obs-column COLUMN`	Categorical obs column to partition on
`--csv-file PATH`	Optional CSV file with extra per-cell metadata (see below)
`--join-column COLUMN`	Column in the CSV to use as the cell-barcode key (default: first column)
`--always-include VALUE [VALUE ...]`	Category values to copy into every shard (e.g. non-targeting controls); no dedicated file is written for these categories
`--compression FILTER`	HDF5 compression filter; default: no compression

The --obs-column column must be a pandas Categorical. If the column comes from --csv-file, it is coerced to categorical automatically.

Example — shard a perturbation dataset by perturbation, including controls in every shard:

annslicer slice perturb.h5ad /outputs/perturb \
    --obs-column perturbation \
    --always-include non-targeting

Produces: perturb_KRAS.h5ad, perturb_TP53.h5ad, … (one file per non-control perturbation, each containing the perturbation's cells plus all non-targeting cells).

Example — obs column from an auxiliary CSV:

annslicer slice atlas.h5ad /outputs/atlas \
    --obs-column tissue \
    --csv-file metadata.csv

The CSV must contain one row per cell. Its first column (or --join-column) is matched to the h5ad obs index. All CSV columns are coerced to categorical.

Filtering cells

Produce a single output file containing only cells for which a boolean obs column is True:

annslicer filter input.h5ad filtered.h5ad --obs-column keep

Argument	Description
`input_file`	Path to the source `.h5ad` or `.zarr` file
`output_file`	Path for the filtered output `.h5ad` file
`--obs-column COLUMN`	(required) Column whose truthy values determine which cells to keep
`--csv-file PATH`	Optional CSV file with extra per-cell metadata
`--join-column COLUMN`	Column in the CSV to use as the cell-barcode key (default: first column)
`--compression FILTER`	HDF5 compression filter; default: no compression

The filter column is interpreted leniently: bool dtype is used directly; numeric columns treat non-zero as True; string columns accept "true"/"false"/"1"/"0" (case-insensitive).

Example — filter using a pre-existing obs column:

annslicer filter atlas.h5ad atlas_qc_pass.h5ad --obs-column qc_pass

Example — filter using a column from an auxiliary CSV:

annslicer filter atlas.h5ad atlas_filtered.h5ad \
    --obs-column keep \
    --csv-file cell_flags.csv

Merging shards back into one file

annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad shard_2.h5ad

Output format is inferred from the extension — use .zarr for Zarr output (requires annslicer[zarr]):

annslicer merge output.zarr shard_0.h5ad shard_1.h5ad shard_2.h5ad

Input files can also be specified as glob patterns (expanded lexicographically):

annslicer merge output.h5ad "shards/atlas_shard_*.h5ad"

Argument	Description
`output_file`	Path for the merged output file (`.h5ad` or `.zarr`)
`input_files`	One or more shard paths or glob patterns, in order
`--join {inner,outer}`	How to join var (gene) axes when files differ (default: `outer`)

When shards have different gene sets, --join outer (default) takes the union of all genes and fills missing entries with zeros; --join inner keeps only genes present in every shard. Layers absent from any shard are always dropped.

Global options

Flag	Description
`--debug`	Enable verbose debug-level logging

Python API

from annslicer import shard_h5ad, shard_by_obs_column, filter_h5ad, merge_out_of_core

# --- Fixed-size sharding ---

# Basic sharding (h5ad or zarr input)
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000)
shard_h5ad("large_atlas.zarr", "atlas", shard_size=20000)  # requires annslicer[zarr]

# Shuffled sharding — cells are randomly distributed across shards
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, shuffle=True, seed=0)

# Gzip-compressed shards — smaller files at the cost of write speed
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, compression="gzip")

# Custom output filenames — provide explicit paths instead of auto-generated names
shard_h5ad(
    "large_atlas.h5ad",
    "atlas",  # ignored when output_filenames is provided
    shard_size=20000,
    output_filenames=["batch_0.h5ad", "batch_1.h5ad", "batch_2.h5ad"],
)

# --- Categorical sharding by obs column ---

# Shard by a categorical obs column — one file per category, named by category value
shard_by_obs_column("atlas.h5ad", "atlas", obs_column="tissue")
# Produces: atlas_brain.h5ad, atlas_liver.h5ad, atlas_lung.h5ad, …

# Perturbation dataset — include control cells in every shard
shard_by_obs_column(
    "perturb.h5ad",
    "perturb",
    obs_column="perturbation",
    always_include=["non-targeting"],
)

# Obs column from an auxiliary CSV (coerced to categorical automatically)
shard_by_obs_column(
    "atlas.h5ad",
    "atlas",
    obs_column="tissue",
    csv_file="metadata.csv",  # first column matched to obs index
)

# --- Filtering ---

# Keep only cells where obs column is truthy (bool, 0/1, or 'True'/'False' strings)
filter_h5ad("atlas.h5ad", "atlas_qc.h5ad", obs_column="qc_pass")

# Filter using a boolean column from an auxiliary CSV
filter_h5ad("atlas.h5ad", "atlas_filtered.h5ad", obs_column="keep", csv_file="flags.csv")

# --- Merging ---

# Merge shards back into one file (identical-var fast path used automatically)
merge_out_of_core(["atlas_shard_0.h5ad", "atlas_shard_1.h5ad"], "merged.h5ad")

# Merge shards with different gene sets — outer join (union, fills absent genes with 0)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="outer")

# Merge shards with different gene sets — inner join (intersection only)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="inner")

How it works

Fixed-size slicing

Opens the input file ("backed" AnnData for .h5ad; anndata.io.sparse_dataset for .zarr).
If shuffle=True, generates a global cell permutation upfront using numpy.random.default_rng.
For each shard, reads only the relevant rows from X and each layer via sorted fancy indexing — no full matrix is ever loaded into memory.
When shuffling, rows are read in sorted index order (maximising sequential I/O) and then reordered in-memory to the desired shuffled order.
Reassembles a valid AnnData object per shard and writes it to disk.

Categorical slicing

Opens the input file in the same backed/lazy mode as fixed-size slicing.
If --csv-file is provided, reads the CSV and merges it into adata.obs in memory (the backing file is never written to). All new columns are coerced to categorical.
Validates that the target column is categorical. Validates that any --always-include values exist in the category list.
Sanitises category names to safe filename fragments (re.sub(r'[^\w.-]', '_', name)); raises an error if two names collide after sanitisation.
For each non-always-include category: collects cell indices via numpy.where, appends always-include indices, sorts for sequential I/O, then writes. Empty categories are skipped with a warning.

Filtering

Opens the input file in backed/lazy mode.
Optionally merges an auxiliary CSV into adata.obs (same merge logic as categorical slicing).
Reads the filter column and coerces it to boolean leniently (bool → direct; numeric → non-zero; string → 'true'/'false'/'1'/'0').
Collects the indices of cells where the column is True and writes them to a new file.

Merging

Reads obs, var, and uns from all shards to build a skeleton output file.
Computes the merged var index: union (outer join) or intersection (inner join) of gene sets across all shards. If every shard shares the identical var, remapping is skipped entirely (fast path).
Scans shards to calculate total non-zero sizes for pre-allocation (for an inner join, entries for excluded genes are filtered during the scan).
Streams X, layers, and obsm data shard-by-shard directly into the pre-allocated output arrays, remapping column indices on the fly where needed.
Layers absent from any shard are dropped so every cell has consistent layer coverage.

Note: CSC (column-compressed) sparse matrices are not supported for out-of-core row-slicing. Convert to CSR before sharding.

Benchmarks

Run on a dummy sparse anndata object with 200k cells and 10k genes.

For h5ad format

Slicing method	Mean runtime (s)	Peak memory (MB)
`annslicer slice`	0.584	211.4
`anndata` backed	0.601	203.7
`annslicer slice --shuffle`	1.731	221.8
`anndata` backed with shuffle	3.830	209.1

For zarr format

Slicing method	Mean runtime (s)	Peak memory (MB)
`annslicer slice`	1.050	62.1
`anndata` backed	0.799	54.4
`annslicer slice --shuffle`	5.544	142.9
`anndata` backed with shuffle	6.591	151.4

Based on these benchmarks, for making randomly shuffled data shards, we recommend using annslicer slice --shuffle on an h5ad format file.

License

BSD 3-clause

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

May 14, 2026

This version

0.2.0

May 12, 2026

0.1.6

Mar 18, 2026

0.1.5

Mar 11, 2026

0.1.4

Mar 6, 2026

0.1.3

Mar 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annslicer-0.2.0.tar.gz (59.1 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

annslicer-0.2.0-py3-none-any.whl (26.1 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file annslicer-0.2.0.tar.gz.

File metadata

Download URL: annslicer-0.2.0.tar.gz
Upload date: May 12, 2026
Size: 59.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for annslicer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`84abf7a424d3fcbe352b8a2d5d3c5d29ffd953cdd68137774d86027aa672dc09`
MD5	`17a1f770994edda33e4420fdecc4299b`
BLAKE2b-256	`7531add291aeef07ae620e3a8c653aa1deca62e2c6b91b93e7aee5c268c6041f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for annslicer-0.2.0.tar.gz:

Publisher: publish.yml on cellarium-ai/annslicer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: annslicer-0.2.0.tar.gz
- Subject digest: 84abf7a424d3fcbe352b8a2d5d3c5d29ffd953cdd68137774d86027aa672dc09
- Sigstore transparency entry: 1520845302
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: cellarium-ai/annslicer@117b3a0ddf9d0e38a3eda33af4c8ab4c19c2ae9a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/cellarium-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@117b3a0ddf9d0e38a3eda33af4c8ab4c19c2ae9a
- Trigger Event: release

File details

Details for the file annslicer-0.2.0-py3-none-any.whl.

File metadata

Download URL: annslicer-0.2.0-py3-none-any.whl
Upload date: May 12, 2026
Size: 26.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for annslicer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bdd445336e3891a39e71f3c2d05140e849d483f711977b3b748ca415e09ecb89`
MD5	`a7cb5a8640c8ab9c723a0570f93b767c`
BLAKE2b-256	`a4df5e35126f0eadf554f215930b30a7e8b51cee963d800df75f4edfa9c20efc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for annslicer-0.2.0-py3-none-any.whl:

Publisher: publish.yml on cellarium-ai/annslicer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: annslicer-0.2.0-py3-none-any.whl
- Subject digest: bdd445336e3891a39e71f3c2d05140e849d483f711977b3b748ca415e09ecb89
- Sigstore transparency entry: 1520845389
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: cellarium-ai/annslicer@117b3a0ddf9d0e38a3eda33af4c8ab4c19c2ae9a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/cellarium-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@117b3a0ddf9d0e38a3eda33af4c8ab4c19c2ae9a
- Trigger Event: release

annslicer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

annslicer

Features

Installation

CLI Usage

Sharding a large file

Fixed-size sharding

Categorical sharding by obs column

Filtering cells

Merging shards back into one file

Global options

Python API

How it works

Fixed-size slicing

Categorical slicing

Filtering

Merging

Benchmarks

For h5ad format

For zarr format

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance