Skip to main content

Cell-centric ML training backend on LanceDB and sharded zarr

Project description

homeobox

Multimodal single-cell database built on LanceDB and Zarr. Designed for building heterogeneous cell atlases and training foundation models on them.

Cell metadata lives in LanceDB, queryable with SQL predicates, vector search, and full-text search. Raw array data (count matrices, embeddings, images) lives in sharded Zarr. A PyTorch-native data loading layer reads directly from those stores without intermediate copies or format conversions.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.13.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[bio]     # + scanpy, GEOparse
pip install homeobox[io]      # + S3/GCS/Azure, image codecs
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release

The RaggedAtlas

Real-world atlas building involves datasets that were not designed to be compatible: different gene panels, different assay types, different obs schemas. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).

Homeobox's RaggedAtlas takes a different approach: each dataset occupies its own Zarr group with its own feature ordering. Every cell carries a pointer into its group. The reconstruction layer handles union/intersection/feature-filter logic at query time. No padding is stored, no information is discarded at ingest.

Cell table (shared)                Zarr (per-dataset)
──────────────────                 ──────────────────
cell A  gene_expression → pbmc3k/  pbmc3k/   1838 genes, 2638 cells
cell B  gene_expression → pbmc3k/  pbmc68k/   765 genes,  700 cells
cell C  gene_expression → pbmc68k/

At query time, the reconstruction layer joins the feature spaces: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData with every cell correctly placed.

Quickstart

import os, tempfile
import scanpy as sc
import obstore.store
import homeobox as hox
from homeobox.schema import SparseZarrPointer

# 1. Define schemas: one for gene features, one for cell metadata
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str

class CellSchema(hox.HoxBaseSchema):
    gene_expression: SparseZarrPointer | None = None

# 2. Create an atlas
atlas_dir = "./hox_example_atlas/"
os.makedirs(atlas_dir, exist_ok=True)
atlas = hox.create_or_open_atlas(
    atlas_path=atlas_dir,
    cell_table_name="cells",
    cell_schema=CellSchema,
    dataset_table_name="datasets",
    dataset_schema=DatasetRecord,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset and register its genes
adata = sc.datasets.pbmc3k()  # 2 700 PBMCs, raw counts, sparse CSR
features = [GeneFeature(uid=g, gene_symbol=g) for g in adata.var_names]
atlas.register_features("gene_expression", features)

# 4. Prepare var and ingest
adata.var["global_feature_uid"] = adata.var_names
record = DatasetRecord(
    zarr_group="pbmc3k", feature_space="gene_expression", n_cells=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, feature_space="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 5. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 6. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest(atlas_dir)
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Opening a public atlas

The CellxGene Census mouse atlas (about 44M cells) is available on S3. No schema class or store construction needed, just db_uri and S3 config:

import homeobox as hox

atlas = hox.RaggedAtlas.checkout_latest(
    db_uri="s3://epiblast-public/cellxgene_mouse_homeobox/lance_db",
    store_kwargs={"config": {"skip_signature": True, "region": "us-east-2"}},
)

atlas.query().count()                                           # 43,969,325
adata = atlas.query().where("cell_type = 'neural cell'").limit(5000).to_anndata()

Querying

The cell table is a LanceDB table. The full query surface is available without custom loaders.

# SQL filter
adata = atlas_r.query().where("tissue = 'lung' AND cell_type IS NOT NULL").to_anndata()

# Vector similarity search
hits = atlas_r.query().search(query_vec, vector_column_name="embedding").limit(50).to_anndata()

# Feature-filtered query: reads only the byte ranges for those genes (CSC index)
adata = atlas_r.query().features(["CD3D", "CD19", "MS4A1"], "gene_expression").to_anndata()

# Intersection across ragged datasets (only genes shared by all)
shared = atlas_r.query().feature_join("intersection").to_anndata()

# Count by cell type (cheap, only fetches the grouping column)
atlas_r.query().count(group_by="cell_type")

For large results, .to_batches() provides a streaming iterator that avoids materialising everything at once. .to_mudata() returns one AnnData per modality for multimodal atlases.


Example Notebooks

The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install homeobox (no repo clone needed).

Notebook Description
scbasecount_ragged_atlas.py Explore a small 7.3M-cell atlas built from scBaseCount data (human + C. elegans). Covers versioning, metadata queries, ragged union/intersection joins, feature selection, AnnData reconstruction, and the PyTorch dataloader.
cellxgene_tiledb_vs_homeobox_benchmark.py Load the 44M-cell CellxGene Census mouse atlas stored in homeobox format and benchmark it against TileDB-SOMA for ML dataloader throughput and AnnData query latency.

Performance

Benchmarked against TileDB-SOMA on a ~44M cell mouse atlas (CellxGene Census), reading from S3.

ML dataloader throughput

CellDataset is a map-style PyTorch dataset in contrast to the TileDB iterable-style dataset. This allows it to leverage PyTorch's DataLoader for parallelism and locality-aware batching. Homeobox's dataloader achieves an order of magnitude higher throughput than TileDB-SOMA on a single worker even with fully random data shuffling.

Dataloader throughput: homeobox vs TileDB-SOMA

Workers TileDB-SOMA homeobox Speedup
0 (in-process) ~150 cells/s ~1,600 cells/s ~10x
4 workers ~500 cells/s ~3,150 cells/s ~6x

Query → AnnData latency

Three access patterns: cell-oriented (filter by cell type, full matrix), feature-oriented (subset genes across a population), and combined.

Query latency: homeobox vs TileDB-SOMA

Homeobox is 1.7–3x faster across patterns, with the largest margin on feature-oriented queries where the CSC index avoids scanning irrelevant cells entirely.

Fast cloud reads: RustShardReader

Zarr's sharded format packs many chunks into a single object-store file, with an index recording each chunk's byte offset. The Python zarr stack issues one HTTP request per chunk even when chunks could be coalesced.

Homeobox's RustShardReader handles shard reads in Rust: it batches all requested ranges, issues one get_ranges call per shard file, and decodes chunks in parallel via rayon. On S3 and GCS this typically cuts latency-dominated read time by an order of magnitude compared to sequential per-chunk fetches.

BP-128 bitpacking (from BPCells)

When ingesting integer count data, homeobox automatically applies BP-128 bitpacking with delta encoding to the sparse indices array, and BP-128 (no delta) to the values array. BP-128 is a SIMD-accelerated codec that packs integers using the minimum number of bits required per 128-element block.

This delivers compression ratios comparable to zstd on typical single-cell count matrices while decoding at memory bandwidth speeds, making it strictly better than general-purpose codecs for this data type. Chunk sizes that are multiples of 128 align perfectly with the codec's block boundaries.


Versioning

Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model:

  1. Ingest: write Zarr arrays and cell records freely, in parallel if needed.
  2. optimize(): compact Lance fragments, assign global_index to newly registered features, rebuild FTS indexes.
  3. snapshot(): validate consistency and record the current Lance table versions. Returns a version number.
  4. checkout(version): open a read-only atlas pinned to that snapshot. Every table is pinned to the exact Lance version recorded at snapshot time.
atlas.optimize()
v0 = atlas.snapshot()       # validate + commit; returns version int

# read-only handle pinned to v0; concurrent ingestion won't affect it
atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", store=store)

# inspect available snapshots
RaggedAtlas.list_versions("/data/atlas/db")

Queries and training runs execute against a frozen, reproducible view of the atlas. Concurrent ingestion into the live atlas does not affect any checked-out handle.


Documentation

  • Data Structure: LanceDB + Zarr layout, pointer types, _feature_layouts feature mapping, versioning model.
  • Building an Atlas: end-to-end walkthrough with two heterogeneous datasets.
  • Array Storage: add_from_anndata internals, BP-128 bitpacking, CSC column index for fast feature-filtered reads.
  • Querying: AtlasQuery fluent builder, filtering, feature reconstruction, union/intersection joins, terminal methods.
  • PyTorch Data Loading: CellDataset, CellSampler, locality-aware bin-packing, make_loader.
  • Versioning: snapshot lifecycle, parallel write safety, checkout(), list_versions().
  • Schemas: HoxBaseSchema, pointer types, FeatureBaseSchema, DatasetRecord.
  • Full docs site

Acknowledgements

Methods

Datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.1.tar.gz (648.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.1-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.1.tar.gz.

File metadata

  • Download URL: homeobox-0.2.1.tar.gz
  • Upload date:
  • Size: 648.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for homeobox-0.2.1.tar.gz
Algorithm Hash digest
SHA256 6bd4dfc5c967b123107591040bcd75c8d7231da903424795d7fbbfd9da0e6ef5
MD5 0f2b2124e8f7d19be64162c36cdd02cc
BLAKE2b-256 82ace65aab1ab699ed5ea169a7ff8625d86cb033d72750c4f072f129186df257

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.1.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.1-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.1-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for homeobox-0.2.1-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 079abb88ca4a509b13a384bc2affd5826da16d1f3015a2296fba6d078f260e48
MD5 216a244ab6bbec5d2120cf6e4700695c
BLAKE2b-256 0136e866f0c31206abb0472410253f1bb6bb4be4faba2751f50cad95ad935993

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c55c1c4cb74d709ea4c995c53a40cc6363f4d68d9a929176eab791f07754e9d1
MD5 50d4c98c18cbdbdddf057468fc604e7f
BLAKE2b-256 e3ce26bf2169f32a4bde16146795214010eee1e99844f420835f69bcc65ec823

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 271013e8d46bef76039b36e2b5d53185cc603ff8b9442da3c7737a95939d8e35
MD5 f2df43bf26a39dd6b6a27a060d8e37e1
BLAKE2b-256 c28103ad45af20e8efbb59dd8bb21ef0eeaab4e019e7c07fd006030b207b2cbe

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d2eef56da966d5fb586f70092a7b78d2d3a6e746ed1d3202fcdc8b6c1acd6e1b
MD5 2dfad65efdde71f394d308147db511cb
BLAKE2b-256 85083183ebab6bf510e6b1e4a3284e8303eddc6ac45424580158f761acea7b96

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0a15467a902ad144583759bc194409b87ba63eabe1ba6c693905a7217f1d69f5
MD5 889884abe6401f096b3367e9726b42f6
BLAKE2b-256 fd8450d3fd3d47ec222b45ab0772cc2ee37612f8f23846ee6eced52dc4cf5f41

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page