Skip to main content

Cell-centric ML training backend on LanceDB and sharded zarr

Project description

homeobox

Designed for building heterogeneous biomedical data atlases for interactive analysis and ML training.

Cell metadata lives in LanceDB, queryable with SQL predicates, vector search, and full-text search. Raw array data (count matrices, embeddings, images) lives in sharded Zarr.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.13.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[bio]     # + scanpy, GEOparse
pip install homeobox[io]      # + S3/GCS/Azure, image codecs
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release

The RaggedAtlas

Real-world atlas building involves datasets that were not designed to be compatible: different gene panels, different assay types, different obs schemas. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).

Homeobox's RaggedAtlas takes a different approach: each dataset occupies its own Zarr group with its own feature ordering. Every cell carries a pointer into its group.

Cell table (shared)                Zarr (per-dataset)
──────────────────                 ──────────────────
cell A  gene_expression → pbmc3k/  pbmc3k/   1838 genes, 2638 cells
cell B  gene_expression → pbmc3k/  pbmc68k/   765 genes,  700 cells
cell C  gene_expression → pbmc68k/

At query time, the reconstruction layer joins the feature spaces: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData with every cell correctly placed.

Quickstart

import os
import scanpy as sc
import obstore.store
import homeobox as hox
from homeobox.schema import SparseZarrPointer

# 1. Define schemas: one for gene features, one for cell metadata
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str

class CellSchema(hox.HoxBaseSchema):
    gene_expression: SparseZarrPointer | None = None

# 2. Create an atlas
atlas_dir = "./hox_example_atlas/"
os.makedirs(atlas_dir, exist_ok=True)
atlas = hox.create_or_open_atlas(
    atlas_path=atlas_dir,
    cell_table_name="cells",
    cell_schema=CellSchema,
    dataset_table_name="datasets",
    dataset_schema=hox.DatasetRecord,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset and register its genes
adata = sc.datasets.pbmc3k()  # 2 700 PBMCs, raw counts, sparse CSR
features = [GeneFeature(uid=g, gene_symbol=g) for g in adata.var_names]
atlas.register_features("gene_expression", features)

# 4. Prepare var and ingest
adata.var["global_feature_uid"] = adata.var_names
record = hox.DatasetRecord(
    zarr_group="pbmc3k", feature_space="gene_expression", n_cells=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, feature_space="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 5. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 6. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest(atlas_dir)
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Opening a public atlas

The CellxGene Census mouse atlas (about 44M cells) is available on S3. No schema class or store construction needed, just db_uri and S3 config:

import homeobox as hox

atlas = hox.RaggedAtlas.checkout_latest(
    db_uri="s3://epiblast-public/cellxgene_mouse_homeobox/lance_db",
    store_kwargs={"config": {"skip_signature": True, "region": "us-east-2"}},
)
adata = atlas.query().where("cell_type = 'neural cell'").limit(5000).to_anndata()

Querying

The cell table is a LanceDB table. The full query surface is available without custom loaders.

# SQL filter
adata = atlas_r.query().where("tissue = 'lung' AND cell_type IS NOT NULL").to_anndata()

# Vector similarity search
hits = atlas_r.query().search(query_vec, vector_column_name="embedding").limit(50).to_anndata()

# Feature-filtered query: reads only the byte ranges for those genes (CSC index)
adata = atlas_r.query().features(["CD3D", "CD19", "MS4A1"], "gene_expression").to_anndata()

# Intersection across ragged datasets (only genes shared by all)
shared = atlas_r.query().feature_join("intersection").to_anndata()

# Count by cell type (cheap, only fetches the grouping column)
atlas_r.query().count(group_by="cell_type")

For large results, .to_batches() provides a streaming iterator that avoids materialising everything at once. .to_mudata() returns one AnnData per modality for multimodal atlases.


Example Notebooks

The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install homeobox (no repo clone needed).

Notebook Description
scbasecount_ragged_atlas.py Explore a small 7.3M-cell atlas built from scBaseCount data (human + C. elegans). Covers versioning, metadata queries, ragged union/intersection joins, feature selection, AnnData reconstruction, and the PyTorch dataloader.
cellxgene_tiledb_vs_homeobox_benchmark.py Load the 44M-cell CellxGene Census mouse atlas stored in homeobox format and benchmark it against TileDB-SOMA for ML dataloader throughput and AnnData query latency.

Performance

Benchmarked against TileDB-SOMA on a ~44M cell mouse atlas (CellxGene Census), reading from S3.

ML dataloader throughput

CellDataset is a map-style PyTorch dataset in contrast to the TileDB iterable-style dataset. This allows it to leverage PyTorch's DataLoader for parallelism and locality-aware batching. Homeobox's dataloader achieves an order of magnitude higher throughput than TileDB-SOMA on a single worker even with fully random data shuffling.

Dataloader throughput: homeobox vs TileDB-SOMA

Workers TileDB-SOMA homeobox Speedup
0 (in-process) ~150 cells/s ~1,600 cells/s ~10x
4 workers ~500 cells/s ~3,150 cells/s ~6x

Query → AnnData latency

Three access patterns: cell-oriented (filter by cell type, full matrix), feature-oriented (subset genes across a population), and combined.

Query latency: homeobox vs TileDB-SOMA

Homeobox is 1.7–3x faster across patterns, with the largest margin on feature-oriented queries where the CSC index avoids scanning irrelevant cells entirely.

Fast cloud reads: RustShardReader

Zarr's sharded format packs many chunks into a single object-store file, with an index recording each chunk's byte offset. The Python zarr stack issues one HTTP request per chunk even when chunks could be coalesced.

Homeobox's RustShardReader handles shard reads in Rust: it batches all requested ranges, issues one get_ranges call per shard file, and decodes chunks in parallel via rayon. On S3 and GCS this typically cuts latency-dominated read time by an order of magnitude compared to sequential per-chunk fetches.

BP-128 bitpacking (from BPCells)

When ingesting integer count data, homeobox automatically applies BP-128 bitpacking with delta encoding to the sparse indices array, and BP-128 (no delta) to the values array. BP-128 is a SIMD-accelerated codec that packs integers using the minimum number of bits required per 128-element block.

This delivers compression ratios comparable to zstd on typical single-cell count matrices while decoding at memory bandwidth speeds, making it strictly better than general-purpose codecs for this data type. Chunk sizes that are multiples of 128 align perfectly with the codec's block boundaries.


Versioning

Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model:

  1. Ingest: write Zarr arrays and cell records freely, in parallel if needed.
  2. optimize(): compact Lance fragments, assign global_index to newly registered features, rebuild FTS indexes.
  3. snapshot(): validate consistency and record the current Lance table versions. Returns a version number.
  4. checkout(version): open a read-only atlas pinned to that snapshot. Every table is pinned to the exact Lance version recorded at snapshot time.
atlas.optimize()
v0 = atlas.snapshot()       # validate + commit; returns version int

# read-only handle pinned to v0; concurrent ingestion won't affect it
atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", store=store)

# inspect available snapshots
RaggedAtlas.list_versions("/data/atlas/db")

Queries and training runs execute against a frozen, reproducible view of the atlas. Concurrent ingestion into the live atlas does not affect any checked-out handle.


Documentation

  • Data Structure: LanceDB + Zarr layout, pointer types, _feature_layouts feature mapping, versioning model.
  • Building an Atlas: end-to-end walkthrough with two heterogeneous datasets.
  • Array Storage: add_from_anndata internals, BP-128 bitpacking, CSC column index for fast feature-filtered reads.
  • Querying: AtlasQuery fluent builder, filtering, feature reconstruction, union/intersection joins, terminal methods.
  • PyTorch Data Loading: CellDataset, CellSampler, locality-aware bin-packing, make_loader.
  • Versioning: snapshot lifecycle, parallel write safety, checkout(), list_versions().
  • Schemas: HoxBaseSchema, pointer types, FeatureBaseSchema, DatasetRecord.
  • Full docs site

Acknowledgements

Methods

Datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.3.tar.gz (660.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.3-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.3-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.3-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.3-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.3.tar.gz.

File metadata

  • Download URL: homeobox-0.2.3.tar.gz
  • Upload date:
  • Size: 660.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for homeobox-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1da2bcbfc4661824403d00ee51a7d39c57f99df47b8aef96658d8ba43187eaf3
MD5 7faa119d853b87b46a990e8bf3d26359
BLAKE2b-256 5e3e34c61b1157b1065aa87fe7ff5c5e55ea14415e1436bffc6c93b09c07d387

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.3.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.3-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.3-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for homeobox-0.2.3-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fcccd5ec91176cca09cba3e5735e1dab5457442cceba0692f8c09921e7074213
MD5 d06026fe1c5a5441bb95a8168f22f4ca
BLAKE2b-256 076d29e1f808fd82dae414f7358600232af8a6c924f0f293e527a99547e62858

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.3-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 902178c1f4c3ed761e111d4ed3f62835f7656171ff6c31aa6189d0a5127f89d6
MD5 15b09f3e2645110aa9e37b4514079ff8
BLAKE2b-256 ed66e4cf23ea63044ecb5dfc63840a90352b479a6c1185b47e349edfccc5e5a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.3-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.3-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5b2ab85338293bf18547d3ad9296aaeb80cb0486bb94576c2d4eca45c2db3b79
MD5 6805952c488101e91bea6686b305154b
BLAKE2b-256 f5b87d59c54390107a638340b0e2c4e429d6b05cad01f66e9391c9484a691052

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.3-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.3-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.3-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 303cbba7216ffc43497aeb719efc9dcc642ed4cc80d6bcad809945a828b0cb21
MD5 5b91d927f47817a88e1bda9333da2629
BLAKE2b-256 5c832ab7c4ebb4984402dc10ed41df89ef6fcc18b1f3a5638ffdc06981e22153

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.3-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.3-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.3-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 82faf7497162eb546fd5b102255f42d6abf029b273703c6892344dd8251cae77
MD5 5561c96c3321d9b64a44d072387602df
BLAKE2b-256 008173586491ae794da23bdfab95f244aefab954571132467572bbab72be0826

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.3-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page