Skip to main content

Cell-centric ML training backend on LanceDB and sharded zarr

Project description

lancell

Multimodal single-cell database built on LanceDB and Zarr. Designed for building heterogeneous cell atlases and training foundation models on them.

Cell metadata lives in LanceDB — queryable with SQL predicates, vector search, and full-text search. Raw array data (count matrices, embeddings, images) lives in sharded Zarr. A PyTorch-native data loading layer reads directly from those stores without intermediate copies or format conversions.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.13.

pip install lancell          # core: atlas, querying, ingestion
pip install lancell[ml]      # + PyTorch dataloader
pip install lancell[bio]     # + scanpy, bionty, GEOparse
pip install lancell[io]      # + S3/GCS/Azure, image codecs
pip install lancell[viz]     # + marimo, matplotlib
pip install lancell[all]     # everything

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release

The RaggedAtlas

Real-world atlas building involves datasets that were not designed to be compatible — different gene panels, different assay types, different obs schemas. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).

Lancell's RaggedAtlas takes a different approach: each dataset occupies its own Zarr group with its own feature ordering. Every cell carries a pointer into its group. The reconstruction layer handles union/intersection/feature-filter logic at query time — no padding is stored, no information is discarded at ingest.

Cell table (shared)                Zarr (per-dataset)
──────────────────                 ──────────────────
cell A  gene_expression → pbmc3k/  pbmc3k/   1838 genes, 2638 cells
cell B  gene_expression → pbmc3k/  pbmc68k/   765 genes,  700 cells
cell C  gene_expression → pbmc68k/

At query time, the reconstruction layer joins the feature spaces: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData with every cell correctly placed.

Quickstart

import obstore.store
from lancell.atlas import RaggedAtlas
from lancell.schema import LancellBaseSchema, FeatureBaseSchema, SparseZarrPointer
from lancell.ingestion import add_from_anndata

class GeneFeature(FeatureBaseSchema):
    gene_symbol: str

class CellSchema(LancellBaseSchema):
    cell_type: str | None = None
    gene_expression: SparseZarrPointer | None = None

store = obstore.store.LocalStore("/data/atlas/arrays")
atlas = RaggedAtlas.create(
    db_uri="/data/atlas/db",
    cell_table_name="cells",
    cell_schema=CellSchema,
    store=store,
    registry_schemas={"gene_expression": GeneFeature},
)

atlas.register_features("gene_expression", features)
add_from_anndata(atlas, adata, feature_space="gene_expression",
                 zarr_layer="counts", dataset_record=record)
atlas.optimize()
atlas.snapshot()

atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", store=store)
adata = atlas_r.query().where("cell_type = 'T cells'").to_anndata()

Opening a public atlas

The CellxGene Census mouse atlas (~44M cells) is available on S3. No schema class or store construction needed — just db_uri and S3 config:

from lancell.atlas import RaggedAtlas

atlas = RaggedAtlas.checkout_latest(
    db_uri="s3://epiblast-public/cellxgene_mouse_lancell/lance_db",
    store_kwargs={"config": {"skip_signature": True, "region": "us-east-2"}},
)

atlas.query().count()                                           # 43,969,325
adata = atlas.query().where("cell_type = 'neural cell'").limit(5000).to_anndata()

Querying

The cell table is a LanceDB table — the full query surface is available without custom loaders.

# SQL filter
adata = atlas_r.query().where("tissue = 'lung' AND cell_type IS NOT NULL").to_anndata()

# Vector similarity search
hits = atlas_r.query().search(query_vec, vector_column_name="embedding").limit(50).to_anndata()

# Feature-filtered query — reads only the byte ranges for those genes (CSC index)
adata = atlas_r.query().features(["CD3D", "CD19", "MS4A1"], "gene_expression").to_anndata()

# Intersection across ragged datasets (only genes shared by all)
shared = atlas_r.query().feature_join("intersection").to_anndata()

# Count by cell type — cheap, only fetches the grouping column
atlas_r.query().count(group_by="cell_type")

For large results, .to_batches() provides a streaming iterator that avoids materialising everything at once. .to_mudata() returns one AnnData per modality for multimodal atlases.


Example Notebooks

The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install lancell — no repo clone needed.

Notebook Description
scbasecount_ragged_atlas.py Explore a small 7.3M-cell atlas built from scBaseCount data (human + C. elegans). Covers versioning, metadata queries, ragged union/intersection joins, feature selection, AnnData reconstruction, and the PyTorch dataloader.
cellxgene_tiledb_vs_lancell_benchmark.py Load the 44M-cell CellxGene Census mouse atlas stored in lancell format and benchmark it against TileDB-SOMA for ML dataloader throughput and AnnData query latency.

Performance

Benchmarked against TileDB-SOMA on a ~44M cell mouse atlas (CellxGene Census), reading from S3.

ML dataloader throughput

CellDataset is a map-style PyTorch dataset in contrast to the TileDB iterable-style dataset. This allows it to leverage PyTorch's DataLoader for parallelism and locality-aware batching. Lancell's dataloader achieves an order of magnitude higher throughput than TileDB-SOMA on a single worker even with fully random data shuffling.

Dataloader throughput: lancell vs TileDB-SOMA

Workers TileDB-SOMA lancell Speedup
0 (in-process) ~150 cells/s ~1,600 cells/s ~10x
4 workers ~500 cells/s ~3,150 cells/s ~6x

Query → AnnData latency

Three access patterns: cell-oriented (filter by cell type, full matrix), feature-oriented (subset genes across a population), and combined.

Query latency: lancell vs TileDB-SOMA

Lancell is 1.7–3x faster across patterns, with the largest margin on feature-oriented queries where the CSC index avoids scanning irrelevant cells entirely.

Fast cloud reads: RustShardReader

Zarr's sharded format packs many chunks into a single object-store file, with an index recording each chunk's byte offset. The Python zarr stack issues one HTTP request per chunk even when chunks could be coalesced.

Lancell's RustShardReader handles shard reads in Rust: it batches all requested ranges, issues one get_ranges call per shard file, and decodes chunks in parallel via rayon. On S3 and GCS this typically cuts latency-dominated read time by an order of magnitude compared to sequential per-chunk fetches.

BP-128 bitpacking (from BPCells)

When ingesting integer count data, lancell automatically applies BP-128 bitpacking with delta encoding to the sparse indices array, and BP-128 (no delta) to the values array. BP-128 is a SIMD-accelerated codec that packs integers using the minimum number of bits required per 128-element block.

This delivers compression ratios comparable to zstd on typical single-cell count matrices while decoding at memory bandwidth speeds — making it strictly better than general-purpose codecs for this data type. Chunk sizes that are multiples of 128 align perfectly with the codec's block boundaries.


Versioning

Lancell separates the writable ingest path from the read/query path with an explicit snapshot model:

  1. Ingest — write Zarr arrays and cell records freely, in parallel if needed.
  2. optimize() — compact Lance fragments, assign global_index to newly registered features, rebuild FTS indexes.
  3. snapshot() — validate consistency and record the current Lance table versions. Returns a version number.
  4. checkout(version) — open a read-only atlas pinned to that snapshot. Every table is pinned to the exact Lance version recorded at snapshot time.
atlas.optimize()
v0 = atlas.snapshot()       # validate + commit; returns version int

# read-only handle pinned to v0 — concurrent ingestion won't affect it
atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", store=store)

# inspect available snapshots
RaggedAtlas.list_versions("/data/atlas/db")

Queries and training runs execute against a frozen, reproducible view of the atlas. Concurrent ingestion into the live atlas does not affect any checked-out handle.


Documentation

  • Data Structure — LanceDB + Zarr layout, pointer types, _feature_layouts feature mapping, versioning model.
  • Building an Atlas — end-to-end walkthrough with two heterogeneous datasets.
  • Array Storageadd_from_anndata internals, BP-128 bitpacking, CSC column index for fast feature-filtered reads.
  • QueryingAtlasQuery fluent builder: filtering, feature reconstruction, union/intersection joins, terminal methods.
  • PyTorch Data LoadingCellDataset, CellSampler, locality-aware bin-packing, make_loader.
  • Versioning — snapshot lifecycle, parallel write safety, checkout(), list_versions().
  • SchemasLancellBaseSchema, pointer types, FeatureBaseSchema, DatasetRecord.
  • Full docs site

Acknowledgements

Methods

Datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lancell-0.1.1.tar.gz (467.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lancell-0.1.1-cp312-abi3-win_amd64.whl (5.5 MB view details)

Uploaded CPython 3.12+Windows x86-64

lancell-0.1.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

lancell-0.1.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

lancell-0.1.1-cp312-abi3-macosx_11_0_arm64.whl (5.9 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

lancell-0.1.1-cp312-abi3-macosx_10_12_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file lancell-0.1.1.tar.gz.

File metadata

  • Download URL: lancell-0.1.1.tar.gz
  • Upload date:
  • Size: 467.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lancell-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5bc4024e12ff6b9a3d3be5c88a7d930d653bebfbf200b5b34361a91ef65c42d4
MD5 70b3ba09d1f29d89682273f4e4ce9776
BLAKE2b-256 904b006071cac2ace4ba3f100d15f3d8509764115fd24716da3f1c398afe9d80

See more details on using hashes here.

Provenance

The following attestation bundles were made for lancell-0.1.1.tar.gz:

Publisher: release.yml on epiblastai/lancell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lancell-0.1.1-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: lancell-0.1.1-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.5 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lancell-0.1.1-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 cbe803f4c26af2902faa6ea26c7d704df9a6ef63e69212a33547bb023f490c2e
MD5 2a996d535b5f9758668c1ce98ecbb2d7
BLAKE2b-256 420f2a558fe1003a6f3beacbd16d5a6f2362625cc09d23cc5d155be480923184

See more details on using hashes here.

Provenance

The following attestation bundles were made for lancell-0.1.1-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/lancell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lancell-0.1.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lancell-0.1.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b101e4c2e0c3714b0118acc269b6f66fc56e020e557acb9fab039e043317db9f
MD5 1e842b4e774d83f6da70d8705e6b170a
BLAKE2b-256 742552c41f4efc684e7451c5f7ea9d8e55159a7f854785c9edaf8072abef32cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for lancell-0.1.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/lancell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lancell-0.1.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lancell-0.1.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ac350ebca782bc31289f4000f508a1cd96c58e3fe09f676ad4a091004f303b11
MD5 5209c208f318ffd403f287e990f4ee92
BLAKE2b-256 ef18dc008f9cb8a0ce353a172de9442afbae8e86ced40596711f81799e9e7b82

See more details on using hashes here.

Provenance

The following attestation bundles were made for lancell-0.1.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/lancell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lancell-0.1.1-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lancell-0.1.1-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b38753c06fb19e3e725af7b8d59bc7d4b2a220de0ade240da61891a5782d0bcb
MD5 115091eb63e5f58e2496bf7b705ba8cd
BLAKE2b-256 4e30ada54bea3cb8ee73b259e9ab8e62bbf6b90bc459e310342913d7428a1c9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for lancell-0.1.1-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/lancell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lancell-0.1.1-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lancell-0.1.1-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fd1311d6b49c8f98dbc8ffcc25ccab437acc5d6eee4c11e6ca8964a3ca57a9b3
MD5 a4a7d6a19f9f81b9c79d2b1b7b59b396
BLAKE2b-256 bd38cf94dc652b6ae422138bf3c3e8c8e2e5be6484d94632a5dab76952c6c213

See more details on using hashes here.

Provenance

The following attestation bundles were made for lancell-0.1.1-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/lancell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page