Skip to main content

Cell-centric ML training backend on LanceDB and sharded zarr

Project description

homeobox

Designed for building heterogeneous biomedical data atlases for interactive analysis and ML training.

Cell metadata lives in LanceDB, queryable with SQL predicates, vector search, and full-text search. Raw array data (count matrices, embeddings, images) lives in sharded Zarr.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.13.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[bio]     # + scanpy, GEOparse
pip install homeobox[io]      # + S3/GCS/Azure, image codecs
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release

The RaggedAtlas

Real-world atlas building involves datasets that were not designed to be compatible: different gene panels, different assay types, different obs schemas. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).

Homeobox's RaggedAtlas takes a different approach: each dataset occupies its own Zarr group with its own feature ordering. Every cell carries a pointer into its group.

Cell table (shared)                Zarr (per-dataset)
──────────────────                 ──────────────────
cell A  gene_expression → pbmc3k/  pbmc3k/   1838 genes, 2638 cells
cell B  gene_expression → pbmc3k/  pbmc68k/   765 genes,  700 cells
cell C  gene_expression → pbmc68k/

At query time, the reconstruction layer joins the feature spaces: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData with every cell correctly placed.

Quickstart

import os
import scanpy as sc
import obstore.store
import homeobox as hox
from homeobox.schema import SparseZarrPointer

# 1. Define schemas: one for gene features, one for cell metadata
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str

class CellSchema(hox.HoxBaseSchema):
    gene_expression: SparseZarrPointer | None = None

# 2. Create an atlas
atlas_dir = "./hox_example_atlas/"
os.makedirs(atlas_dir, exist_ok=True)
atlas = hox.create_or_open_atlas(
    atlas_path=atlas_dir,
    cell_table_name="cells",
    cell_schema=CellSchema,
    dataset_table_name="datasets",
    dataset_schema=hox.DatasetRecord,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset and register its genes
adata = sc.datasets.pbmc3k()  # 2 700 PBMCs, raw counts, sparse CSR
features = [GeneFeature(uid=g, gene_symbol=g) for g in adata.var_names]
atlas.register_features("gene_expression", features)

# 4. Prepare var and ingest
adata.var["global_feature_uid"] = adata.var_names
record = hox.DatasetRecord(
    zarr_group="pbmc3k", feature_space="gene_expression", n_cells=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, feature_space="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 5. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 6. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest(atlas_dir)
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Opening a public atlas

The CellxGene Census mouse atlas (about 44M cells) is available on S3. No schema class or store construction needed, just db_uri and S3 config:

import homeobox as hox

atlas = hox.RaggedAtlas.checkout_latest(
    db_uri="s3://epiblast-public/cellxgene_mouse_homeobox/lance_db",
    store_kwargs={"config": {"skip_signature": True, "region": "us-east-2"}},
)
adata = atlas.query().where("cell_type = 'neural cell'").limit(5000).to_anndata()

Querying

The cell table is a LanceDB table. The full query surface is available without custom loaders.

# SQL filter
adata = atlas_r.query().where("tissue = 'lung' AND cell_type IS NOT NULL").to_anndata()

# Vector similarity search
hits = atlas_r.query().search(query_vec, vector_column_name="embedding").limit(50).to_anndata()

# Feature-filtered query: reads only the byte ranges for those genes (CSC index)
adata = atlas_r.query().features(["CD3D", "CD19", "MS4A1"], "gene_expression").to_anndata()

# Intersection across ragged datasets (only genes shared by all)
shared = atlas_r.query().feature_join("intersection").to_anndata()

# Count by cell type (cheap, only fetches the grouping column)
atlas_r.query().count(group_by="cell_type")

For large results, .to_batches() provides a streaming iterator that avoids materialising everything at once. .to_mudata() returns one AnnData per modality for multimodal atlases.


Example Notebooks

The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install homeobox (no repo clone needed).

Notebook Description
scbasecount_ragged_atlas.py Explore a small 7.3M-cell atlas built from scBaseCount data (human + C. elegans). Covers versioning, metadata queries, ragged union/intersection joins, feature selection, AnnData reconstruction, and the PyTorch dataloader.
cellxgene_tiledb_vs_homeobox_benchmark.py Load the 44M-cell CellxGene Census mouse atlas stored in homeobox format and benchmark it against TileDB-SOMA for ML dataloader throughput and AnnData query latency.

Performance

Benchmarked against TileDB-SOMA on a ~44M cell mouse atlas (CellxGene Census), reading from S3.

ML dataloader throughput

CellDataset is a map-style PyTorch dataset in contrast to the TileDB iterable-style dataset. This allows it to leverage PyTorch's DataLoader for parallelism and locality-aware batching. Homeobox's dataloader achieves an order of magnitude higher throughput than TileDB-SOMA on a single worker even with fully random data shuffling.

Dataloader throughput: homeobox vs TileDB-SOMA

Workers TileDB-SOMA homeobox Speedup
0 (in-process) ~150 cells/s ~1,600 cells/s ~10x
4 workers ~500 cells/s ~3,150 cells/s ~6x

Query → AnnData latency

Three access patterns: cell-oriented (filter by cell type, full matrix), feature-oriented (subset genes across a population), and combined.

Query latency: homeobox vs TileDB-SOMA

Homeobox is 1.7–3x faster across patterns, with the largest margin on feature-oriented queries where the CSC index avoids scanning irrelevant cells entirely.

Fast cloud reads: RustShardReader

Zarr's sharded format packs many chunks into a single object-store file, with an index recording each chunk's byte offset. The Python zarr stack issues one HTTP request per chunk even when chunks could be coalesced.

Homeobox's RustShardReader handles shard reads in Rust: it batches all requested ranges, issues one get_ranges call per shard file, and decodes chunks in parallel via rayon. On S3 and GCS this typically cuts latency-dominated read time by an order of magnitude compared to sequential per-chunk fetches.

BP-128 bitpacking (from BPCells)

When ingesting integer count data, homeobox automatically applies BP-128 bitpacking with delta encoding to the sparse indices array, and BP-128 (no delta) to the values array. BP-128 is a SIMD-accelerated codec that packs integers using the minimum number of bits required per 128-element block.

This delivers compression ratios comparable to zstd on typical single-cell count matrices while decoding at memory bandwidth speeds, making it strictly better than general-purpose codecs for this data type. Chunk sizes that are multiples of 128 align perfectly with the codec's block boundaries.


Versioning

Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model:

  1. Ingest: write Zarr arrays and cell records freely, in parallel if needed.
  2. optimize(): compact Lance fragments, assign global_index to newly registered features, rebuild FTS indexes.
  3. snapshot(): validate consistency and record the current Lance table versions. Returns a version number.
  4. checkout(version): open a read-only atlas pinned to that snapshot. Every table is pinned to the exact Lance version recorded at snapshot time.
atlas.optimize()
v0 = atlas.snapshot()       # validate + commit; returns version int

# read-only handle pinned to v0; concurrent ingestion won't affect it
atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", store=store)

# inspect available snapshots
RaggedAtlas.list_versions("/data/atlas/db")

Queries and training runs execute against a frozen, reproducible view of the atlas. Concurrent ingestion into the live atlas does not affect any checked-out handle.


Documentation

  • Data Structure: LanceDB + Zarr layout, pointer types, _feature_layouts feature mapping, versioning model.
  • Building an Atlas: end-to-end walkthrough with two heterogeneous datasets.
  • Array Storage: add_from_anndata internals, BP-128 bitpacking, CSC column index for fast feature-filtered reads.
  • Querying: AtlasQuery fluent builder, filtering, feature reconstruction, union/intersection joins, terminal methods.
  • PyTorch Data Loading: CellDataset, CellSampler, locality-aware bin-packing, make_loader.
  • Versioning: snapshot lifecycle, parallel write safety, checkout(), list_versions().
  • Schemas: HoxBaseSchema, pointer types, FeatureBaseSchema, DatasetRecord.
  • Full docs site

Acknowledgements

Methods

Datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.2.tar.gz (660.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.2-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.2-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.2-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.2.tar.gz.

File metadata

  • Download URL: homeobox-0.2.2.tar.gz
  • Upload date:
  • Size: 660.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for homeobox-0.2.2.tar.gz
Algorithm Hash digest
SHA256 4b38ccf809ce0c9474da1dbb94016d8cefda828ddfcfc1a3d080c2e71fe26b88
MD5 ab840ba5f92707be32a508d3a388ea6a
BLAKE2b-256 ce1fb939ad45a3fc733086929d15be2d19495a5cbd569fa1e2c1a115e9a9fea7

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.2.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.2-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.2-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for homeobox-0.2.2-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3401d91b568b53493cdc0c8fd97038a2965bae76c54d55e47a97f86e751772e0
MD5 6894afd0944bf0aaf3aac2cdc6c72847
BLAKE2b-256 19b4b7555d795d89fc5c7440414ca918fab5875f6107b137553b0f316a5d5228

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.2-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f0898c609824d5baca4cf999f63b941a17b717ddffe8101c2ae649b02c443c37
MD5 56e73cacf0fe2c3253c3c569c20e7d25
BLAKE2b-256 52c9de5f2cdfca00d67205fa45b3ced9877a84a21f9403f3cd5f47448cb03b98

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8742c6c868a7e276b3d6590f5ee4d5ba833243d37800eb21ea6b2f94fc2f9e01
MD5 b4cc011de9c30d98f106d2c6652af621
BLAKE2b-256 04cc64381f9bee9ac8f9a155b3b636dc0dfe4a7ed288b9be2da9264a98987d03

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.2-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.2-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.2-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 37a049d4a14d6277a66c40091d86bb85d9954ef72602ef0f6cd48282a2a7c8bb
MD5 f4122634f1511763239d45f54478acde
BLAKE2b-256 fbfa644b7a877ddbac5eec9d5b198674a0287ade046fe2675f92ecef6254ae66

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.2-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.2-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.2-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 cc820958561fdd0408516d257f2624d50a69c625fd28f870a298a2a337791e98
MD5 6d2a2ef88432d118b6c3d7262fdf4f17
BLAKE2b-256 eabbbef8d30d6e18bc7e9925c0ddf8e5ea51d333a0c103147aa9872432f61446

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.2-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page