Cell-centric ML training backend on LanceDB and sharded zarr
Project description
homeobox
Multimodal single-cell database built on LanceDB and Zarr. Designed for building heterogeneous cell atlases and training foundation models on them.
Cell metadata lives in LanceDB, queryable with SQL predicates, vector search, and full-text search. Raw array data (count matrices, embeddings, images) lives in sharded Zarr. A PyTorch-native data loading layer reads directly from those stores without intermediate copies or format conversions.
Installation
Prebuilt wheels are available on PyPI. Requires Python 3.13.
pip install homeobox # core: atlas, querying, ingestion
pip install homeobox[ml] # + PyTorch dataloader
pip install homeobox[bio] # + scanpy, GEOparse
pip install homeobox[io] # + S3/GCS/Azure, image codecs
pip install homeobox[viz] # + marimo, matplotlib
pip install homeobox[all] # everything
To build from source (requires a Rust toolchain):
curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release
The RaggedAtlas
Real-world atlas building involves datasets that were not designed to be compatible: different gene panels, different assay types, different obs schemas. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).
Homeobox's RaggedAtlas takes a different approach: each dataset occupies its own Zarr group with its own feature ordering. Every cell carries a pointer into its group. The reconstruction layer handles union/intersection/feature-filter logic at query time. No padding is stored, no information is discarded at ingest.
Cell table (shared) Zarr (per-dataset)
────────────────── ──────────────────
cell A gene_expression → pbmc3k/ pbmc3k/ 1838 genes, 2638 cells
cell B gene_expression → pbmc3k/ pbmc68k/ 765 genes, 700 cells
cell C gene_expression → pbmc68k/
At query time, the reconstruction layer joins the feature spaces: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData with every cell correctly placed.
Quickstart
import os, tempfile
import scanpy as sc
import obstore.store
import homeobox as hox
from homeobox.schema import SparseZarrPointer
# 1. Define schemas: one for gene features, one for cell metadata
class GeneFeature(hox.FeatureBaseSchema):
gene_symbol: str
class CellSchema(hox.HoxBaseSchema):
gene_expression: SparseZarrPointer | None = None
# 2. Create an atlas
atlas_dir = "./hox_example_atlas/"
os.makedirs(atlas_dir, exist_ok=True)
atlas = hox.create_or_open_atlas(
atlas_path=atlas_dir,
cell_table_name="cells",
cell_schema=CellSchema,
dataset_table_name="datasets",
dataset_schema=DatasetRecord,
registry_schemas={"gene_expression": GeneFeature},
)
# 3. Load a dataset and register its genes
adata = sc.datasets.pbmc3k() # 2 700 PBMCs, raw counts, sparse CSR
features = [GeneFeature(uid=g, gene_symbol=g) for g in adata.var_names]
atlas.register_features("gene_expression", features)
# 4. Prepare var and ingest
adata.var["global_feature_uid"] = adata.var_names
record = DatasetRecord(
zarr_group="pbmc3k", feature_space="gene_expression", n_cells=adata.n_obs,
)
hox.add_from_anndata(
atlas, adata, feature_space="gene_expression",
zarr_layer="counts", dataset_record=record,
)
# 5. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()
# 6. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest(atlas_dir)
result = atlas_r.query().limit(500).to_anndata()
print(result) # AnnData object with n_obs × n_vars = 500 × 32738
Opening a public atlas
The CellxGene Census mouse atlas (about 44M cells) is available on S3.
No schema class or store construction needed, just db_uri and S3 config:
import homeobox as hox
atlas = hox.RaggedAtlas.checkout_latest(
db_uri="s3://epiblast-public/cellxgene_mouse_homeobox/lance_db",
store_kwargs={"config": {"skip_signature": True, "region": "us-east-2"}},
)
atlas.query().count() # 43,969,325
adata = atlas.query().where("cell_type = 'neural cell'").limit(5000).to_anndata()
Querying
The cell table is a LanceDB table. The full query surface is available without custom loaders.
# SQL filter
adata = atlas_r.query().where("tissue = 'lung' AND cell_type IS NOT NULL").to_anndata()
# Vector similarity search
hits = atlas_r.query().search(query_vec, vector_column_name="embedding").limit(50).to_anndata()
# Feature-filtered query: reads only the byte ranges for those genes (CSC index)
adata = atlas_r.query().features(["CD3D", "CD19", "MS4A1"], "gene_expression").to_anndata()
# Intersection across ragged datasets (only genes shared by all)
shared = atlas_r.query().feature_join("intersection").to_anndata()
# Count by cell type (cheap, only fetches the grouping column)
atlas_r.query().count(group_by="cell_type")
For large results, .to_batches() provides a streaming iterator that avoids materialising everything at once. .to_mudata() returns one AnnData per modality for multimodal atlases.
Example Notebooks
The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install homeobox (no repo clone needed).
| Notebook | Description |
|---|---|
scbasecount_ragged_atlas.py |
Explore a small 7.3M-cell atlas built from scBaseCount data (human + C. elegans). Covers versioning, metadata queries, ragged union/intersection joins, feature selection, AnnData reconstruction, and the PyTorch dataloader. |
cellxgene_tiledb_vs_homeobox_benchmark.py |
Load the 44M-cell CellxGene Census mouse atlas stored in homeobox format and benchmark it against TileDB-SOMA for ML dataloader throughput and AnnData query latency. |
Performance
Benchmarked against TileDB-SOMA on a ~44M cell mouse atlas (CellxGene Census), reading from S3.
ML dataloader throughput
CellDataset is a map-style PyTorch dataset in contrast to the TileDB iterable-style dataset. This allows it to leverage PyTorch's DataLoader for parallelism and locality-aware batching. Homeobox's dataloader achieves an order of magnitude higher throughput than TileDB-SOMA on a single worker even with fully random data shuffling.
| Workers | TileDB-SOMA | homeobox | Speedup |
|---|---|---|---|
| 0 (in-process) | ~150 cells/s | ~1,600 cells/s | ~10x |
| 4 workers | ~500 cells/s | ~3,150 cells/s | ~6x |
Query → AnnData latency
Three access patterns: cell-oriented (filter by cell type, full matrix), feature-oriented (subset genes across a population), and combined.
Homeobox is 1.7–3x faster across patterns, with the largest margin on feature-oriented queries where the CSC index avoids scanning irrelevant cells entirely.
Fast cloud reads: RustShardReader
Zarr's sharded format packs many chunks into a single object-store file, with an index recording each chunk's byte offset. The Python zarr stack issues one HTTP request per chunk even when chunks could be coalesced.
Homeobox's RustShardReader handles shard reads in Rust: it batches all requested ranges, issues one get_ranges call per shard file, and decodes chunks in parallel via rayon. On S3 and GCS this typically cuts latency-dominated read time by an order of magnitude compared to sequential per-chunk fetches.
BP-128 bitpacking (from BPCells)
When ingesting integer count data, homeobox automatically applies BP-128 bitpacking with delta encoding to the sparse indices array, and BP-128 (no delta) to the values array. BP-128 is a SIMD-accelerated codec that packs integers using the minimum number of bits required per 128-element block.
This delivers compression ratios comparable to zstd on typical single-cell count matrices while decoding at memory bandwidth speeds, making it strictly better than general-purpose codecs for this data type. Chunk sizes that are multiples of 128 align perfectly with the codec's block boundaries.
Versioning
Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model:
- Ingest: write Zarr arrays and cell records freely, in parallel if needed.
optimize(): compact Lance fragments, assignglobal_indexto newly registered features, rebuild FTS indexes.snapshot(): validate consistency and record the current Lance table versions. Returns a version number.checkout(version): open a read-only atlas pinned to that snapshot. Every table is pinned to the exact Lance version recorded at snapshot time.
atlas.optimize()
v0 = atlas.snapshot() # validate + commit; returns version int
# read-only handle pinned to v0; concurrent ingestion won't affect it
atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", store=store)
# inspect available snapshots
RaggedAtlas.list_versions("/data/atlas/db")
Queries and training runs execute against a frozen, reproducible view of the atlas. Concurrent ingestion into the live atlas does not affect any checked-out handle.
Documentation
- Data Structure: LanceDB + Zarr layout, pointer types,
_feature_layoutsfeature mapping, versioning model. - Building an Atlas: end-to-end walkthrough with two heterogeneous datasets.
- Array Storage:
add_from_anndatainternals, BP-128 bitpacking, CSC column index for fast feature-filtered reads. - Querying:
AtlasQueryfluent builder, filtering, feature reconstruction, union/intersection joins, terminal methods. - PyTorch Data Loading:
CellDataset,CellSampler, locality-aware bin-packing,make_loader. - Versioning: snapshot lifecycle, parallel write safety,
checkout(),list_versions(). - Schemas:
HoxBaseSchema, pointer types,FeatureBaseSchema,DatasetRecord. - Full docs site
Acknowledgements
Methods
- BPCells: Parks and Greenleaf, Scalable high-performance single cell data analysis with BPCells, bioRxiv 2025. BP-128 bitpacking in homeobox is inspired by this work. https://www.biorxiv.org/content/10.1101/2025.03.27.645853v1.full
Datasets
- CellxGene Census: Chan Zuckerberg Initiative, CellxGene Census. The mouse atlas used in the benchmark. https://chanzuckerberg.github.io/cellxgene-census/
- scBaseCount: Youngblut et al., scBaseCount: an AI agent-curated, uniformly processed, and autonomously updated single cell data repository, bioRxiv 2025. https://www.biorxiv.org/content/10.1101/2025.02.27.640494v3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file homeobox-0.2.1.tar.gz.
File metadata
- Download URL: homeobox-0.2.1.tar.gz
- Upload date:
- Size: 648.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bd4dfc5c967b123107591040bcd75c8d7231da903424795d7fbbfd9da0e6ef5
|
|
| MD5 |
0f2b2124e8f7d19be64162c36cdd02cc
|
|
| BLAKE2b-256 |
82ace65aab1ab699ed5ea169a7ff8625d86cb033d72750c4f072f129186df257
|
Provenance
The following attestation bundles were made for homeobox-0.2.1.tar.gz:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.1.tar.gz -
Subject digest:
6bd4dfc5c967b123107591040bcd75c8d7231da903424795d7fbbfd9da0e6ef5 - Sigstore transparency entry: 1200673428
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@430df32fda57ec29fc72294beee966f4ec4781df -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@430df32fda57ec29fc72294beee966f4ec4781df -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.1-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: homeobox-0.2.1-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 5.6 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
079abb88ca4a509b13a384bc2affd5826da16d1f3015a2296fba6d078f260e48
|
|
| MD5 |
216a244ab6bbec5d2120cf6e4700695c
|
|
| BLAKE2b-256 |
0136e866f0c31206abb0472410253f1bb6bb4be4faba2751f50cad95ad935993
|
Provenance
The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-win_amd64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.1-cp312-abi3-win_amd64.whl -
Subject digest:
079abb88ca4a509b13a384bc2affd5826da16d1f3015a2296fba6d078f260e48 - Sigstore transparency entry: 1200673681
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@430df32fda57ec29fc72294beee966f4ec4781df -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@430df32fda57ec29fc72294beee966f4ec4781df -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 6.7 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c55c1c4cb74d709ea4c995c53a40cc6363f4d68d9a929176eab791f07754e9d1
|
|
| MD5 |
50d4c98c18cbdbdddf057468fc604e7f
|
|
| BLAKE2b-256 |
e3ce26bf2169f32a4bde16146795214010eee1e99844f420835f69bcc65ec823
|
Provenance
The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
c55c1c4cb74d709ea4c995c53a40cc6363f4d68d9a929176eab791f07754e9d1 - Sigstore transparency entry: 1200673619
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@430df32fda57ec29fc72294beee966f4ec4781df -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@430df32fda57ec29fc72294beee966f4ec4781df -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
271013e8d46bef76039b36e2b5d53185cc603ff8b9442da3c7737a95939d8e35
|
|
| MD5 |
f2df43bf26a39dd6b6a27a060d8e37e1
|
|
| BLAKE2b-256 |
c28103ad45af20e8efbb59dd8bb21ef0eeaab4e019e7c07fd006030b207b2cbe
|
Provenance
The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
271013e8d46bef76039b36e2b5d53185cc603ff8b9442da3c7737a95939d8e35 - Sigstore transparency entry: 1200673733
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@430df32fda57ec29fc72294beee966f4ec4781df -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@430df32fda57ec29fc72294beee966f4ec4781df -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 6.0 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2eef56da966d5fb586f70092a7b78d2d3a6e746ed1d3202fcdc8b6c1acd6e1b
|
|
| MD5 |
2dfad65efdde71f394d308147db511cb
|
|
| BLAKE2b-256 |
85083183ebab6bf510e6b1e4a3284e8303eddc6ac45424580158f761acea7b96
|
Provenance
The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.1-cp312-abi3-macosx_11_0_arm64.whl -
Subject digest:
d2eef56da966d5fb586f70092a7b78d2d3a6e746ed1d3202fcdc8b6c1acd6e1b - Sigstore transparency entry: 1200673487
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@430df32fda57ec29fc72294beee966f4ec4781df -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@430df32fda57ec29fc72294beee966f4ec4781df -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.12+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a15467a902ad144583759bc194409b87ba63eabe1ba6c693905a7217f1d69f5
|
|
| MD5 |
889884abe6401f096b3367e9726b42f6
|
|
| BLAKE2b-256 |
fd8450d3fd3d47ec222b45ab0772cc2ee37612f8f23846ee6eced52dc4cf5f41
|
Provenance
The following attestation bundles were made for homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl -
Subject digest:
0a15467a902ad144583759bc194409b87ba63eabe1ba6c693905a7217f1d69f5 - Sigstore transparency entry: 1200673556
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@430df32fda57ec29fc72294beee966f4ec4781df -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@430df32fda57ec29fc72294beee966f4ec4781df -
Trigger Event:
release
-
Statement type: