Multimodal biomedical atlas builder
Project description
homeobox
Homeobox is a multimodal database for interactive analysis and ML training at scale. It combines the search and versioning capabilities of LanceDB with the scalable array storage of Zarr.
A single homeobox atlas can hold sparse single-cell gene expression, dense protein and embedding features, 2D/3D/4D/5D images, biomolecular structures, and free text. A single dataloader streams batches across all of them with no intermediate ML-only copies and no special modality-specific entrypoints. Our design philosophy is to be extremely flexible, while still quite fast.
Why homeobox
Motivating cases
- Hundreds or thousands of
h5adorh5mufiles from different assays, panels, and organisms that you want to query and train on as a single collection. - Repositories of large images stored in Zarr / OME-Zarr, DICOM, or TIFF — 2D, 3D, or 4D, sometimes >1 TB each, with associated text descriptions.
- Single-cell images, masks, and associated feature data (e.g. CellProfiler vectors).
- Any combination of the above, in one queryable store.
Existing tools tend to optimise for single large datasets from a single modality, often through a laborious standardisation step that drops or duplicates data to fit a rectangular schema. Homeobox's RaggedAtlas unifies heterogeneous data into a single store that supports SQL / vector / full-text search, interactive AnnData / MuData reconstruction, and ML streaming — without that flattening step.
Ragged feature spaces, unified obs
Real-world atlases pull together datasets that were not designed to be compatible: different feature panels, different assays and imaging modalities, different metadata fields. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).
A RaggedAtlas keeps a single shared obs table while letting each dataset retain its own feature axis (or no features at all, for raw images). The obs table lives in LanceDB; each dataset occupies its own Zarr group with its own feature ordering; every row carries a pointer into its group.
At query time, the reconstruction layer joins the feature spaces on the fly: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData / MuData with every row correctly placed. Nothing is dropped at ingest, and there is no ambiguity about whether a value is a true zero or padding.
Installation
Prebuilt wheels are available on PyPI. Requires Python 3.13.
pip install homeobox # core: atlas, querying, ingestion
pip install homeobox[ml] # + PyTorch dataloader
pip install homeobox[io] # + S3/GCS/Azure
pip install homeobox[viz] # + marimo, matplotlib
pip install homeobox[all] # everything
To build from source (requires a Rust toolchain):
curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release
Quickstart
import numpy as np
import pandas as pd
import polars as pl
import scanpy as sc
import homeobox as hox
# 1. Define schemas: one for gene features, one for cell metadata.
# `StableUIDField` marks `gene_symbol` as the deterministic source of
# `uid` (so parallel ingest jobs converge on the same uid for the same
# gene). Each pointer column is declared with `PointerField.declare`,
# which binds the column name to a registered feature_space.
class GeneFeature(hox.FeatureBaseSchema):
gene_symbol: str = hox.StableUIDField.declare(default=...)
class CellSchema(hox.HoxBaseSchema):
gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
feature_space="gene_expression"
)
# 2. Create an atlas
atlas = hox.create_or_open_atlas(
atlas_path="./hox_example_atlas",
obs_schemas={"cells": CellSchema},
dataset_table_name="datasets",
dataset_schema=hox.DatasetSchema,
registry_schemas={"gene_expression": GeneFeature},
)
# 3. Load a dataset
adata = sc.datasets.pbmc3k() # 2 700 PBMCs, raw counts, sparse CSR
adata.X = adata.X.astype(np.uint32) # the counts layer must be np.uint32
# 4. Build the var DataFrame (one row per local feature, columns matching
# the registry schema + `uid`), use it for both feature registration and
# as adata.var. `compute_stable_uids` writes deterministic uids in place.
var_df = pd.DataFrame(
{"gene_symbol": adata.var_names.tolist()},
index=adata.var_names,
)
GeneFeature.compute_stable_uids(var_df)
atlas.register_features("gene_expression", pl.from_pandas(var_df))
adata.var = var_df
# 5. Ingest. `field_name` selects the cell-schema column to populate;
# its feature_space is resolved from PointerField.declare.
record = hox.DatasetSchema(
zarr_group="pbmc3k", feature_space="gene_expression", n_rows=adata.n_obs,
)
hox.add_from_anndata(
atlas, adata, field_name="gene_expression",
zarr_layer="counts", dataset_record=record,
)
# 6. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()
# 7. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest("./hox_example_atlas")
result = atlas_r.query().limit(500).to_anndata()
print(result) # AnnData object with n_obs × n_vars = 500 × 32738
Multimodal in one row
The same shape scales to any number of modalities — declare one pointer column per feature space on a single obs schema:
class MultimodalCell(hox.HoxBaseSchema):
# Shared obs fields
cell_type: str | None
tissue: str | None
# Optional pointers — cells measured by only one assay are first-class,
# no padding rows, no presence flags inserted at ingest.
gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
feature_space="gene_expression"
)
protein_abundance: hox.DenseZarrPointer | None = hox.PointerField.declare(
feature_space="protein_abundance"
)
image_tiles: hox.DenseZarrPointer | None = hox.PointerField.declare(
feature_space="image_tiles"
)
A query against this atlas streams within-row multimodal batches through a single DataLoader, regardless of how many modalities each cell has. See homeobox_examples/multimodal_perturbation_atlas/schema.py for a five-modality production schema (gene expression, chromatin accessibility, protein abundance, image features, image tiles) plus perturbation, publication, and donor tables.
Example Notebooks
The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install homeobox (no repo clone needed).
| Notebook | Description |
|---|---|
multimodal_perturbation_atlas.py |
Explore a 120M+, agent-curated, cell atlas with over 130,000 genetic, chemical, and biologic perturbations and 5 modalities. |
Performance
Beyond raw numbers, the case for homeobox is generality and integration. One library handles cell tables, sparse matrices, dense features, images, embeddings, and text — there is no separate stack for non-tabular modalities. New modalities are added by writing a feature-space spec, not by waiting for upstream support. And because storage is plain LanceDB + Zarr, homeobox plays directly with the broader Python + Rust data ecosystem (Lance, DuckDB, Polars, zarrs).
On a 1M-cell × 20k-gene synthetic atlas, the homeobox iterable dataloader sustains ~70k cells/sec on local NVMe and ~40k cells/sec streaming from S3 at a single worker — saturating local disk and running roughly an order of magnitude faster than the next remote-capable system in the sweep.
See docs/dataloader_benchmark.md for the full sweep across nine dataloaders (SLAF, scDataset, BioNeMo SCDL, annbatch, TileDB-SOMA, cell-load, and the two homeobox surfaces), including local/remote/perturbation workloads, memory profiles, and reproducible scripts.
Versioning
Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model: ingest writes Zarr arrays and cell records freely (in parallel if needed), optimize() compacts Lance fragments and rebuilds indexes, snapshot() validates consistency and records the current Lance table versions, and checkout(version) opens a read-only atlas pinned to that snapshot. Queries and training runs execute against a frozen, reproducible view; concurrent ingestion does not affect any checked-out handle. See docs/versioning.md for the full lifecycle.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file homeobox-0.2.4.tar.gz.
File metadata
- Download URL: homeobox-0.2.4.tar.gz
- Upload date:
- Size: 931.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55119b63bef7433fa3bcbba92bc6bb1d150e9a80a978c9b64e159e8f4ae68dbd
|
|
| MD5 |
cfe9a3247820c7fb00cd38c24ecd4500
|
|
| BLAKE2b-256 |
f67406602a3ebac22340f0e3830cf01e7e15172afc7c31a0ca0ad845d4e17d28
|
Provenance
The following attestation bundles were made for homeobox-0.2.4.tar.gz:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.4.tar.gz -
Subject digest:
55119b63bef7433fa3bcbba92bc6bb1d150e9a80a978c9b64e159e8f4ae68dbd - Sigstore transparency entry: 1540272388
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.4-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: homeobox-0.2.4-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 5.6 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bbb523942aaaddb3902934839b1d6d29a78245c6c1187045af6c01b69938057
|
|
| MD5 |
bd99c5b11ea9302b4a28eee2110b05d0
|
|
| BLAKE2b-256 |
f71c4492a85fe73c1cc13bfa1cc00488fce26ecfa7e4dbb4c540f3928eb00ed9
|
Provenance
The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-win_amd64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.4-cp312-abi3-win_amd64.whl -
Subject digest:
6bbb523942aaaddb3902934839b1d6d29a78245c6c1187045af6c01b69938057 - Sigstore transparency entry: 1540272549
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 6.7 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16d55a1c578b662acd7b851015fde71dc8d1a9fe2cc32d184634a674c9436fcb
|
|
| MD5 |
ea57526bcc4ab2e1d4f3fb5bb61291eb
|
|
| BLAKE2b-256 |
9e1fe3b921281d12d3052eab1d9ee8a5da271b984514d26b4314935c9574a2ca
|
Provenance
The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
16d55a1c578b662acd7b851015fde71dc8d1a9fe2cc32d184634a674c9436fcb - Sigstore transparency entry: 1540272634
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a62b48f92633e6f4cf9c7d6fcf962ef9fe483fab098cb4af49622823dd904327
|
|
| MD5 |
77206ee37184354daaab6210bc9720e5
|
|
| BLAKE2b-256 |
929efacd53b8079d1eecf95be5bef31fdf7f0c13d95f481b1fd9f3836b7e69cc
|
Provenance
The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
a62b48f92633e6f4cf9c7d6fcf962ef9fe483fab098cb4af49622823dd904327 - Sigstore transparency entry: 1540272745
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 6.0 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e32c322228bfee807d616c9ee3d99b2dc56b33c891808d01eb54edfa9d4da4d
|
|
| MD5 |
679f3e03d70890cc28328692284ae573
|
|
| BLAKE2b-256 |
0d2286a4ce0ba5338b83b51596958c76a188779c6d6ea37bc11819fa99b395d0
|
Provenance
The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl -
Subject digest:
3e32c322228bfee807d616c9ee3d99b2dc56b33c891808d01eb54edfa9d4da4d - Sigstore transparency entry: 1540272876
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.12+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c7e0683d0105a20395c3ace9f31e8715ae0323383274e71cfff15a927821d37
|
|
| MD5 |
f3441daeefc61e701118c1bd5b2bb5d8
|
|
| BLAKE2b-256 |
2871be32386fbd2d90382ad89088ee7a5aa18639aecdfd1b738109edbe9244af
|
Provenance
The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl -
Subject digest:
7c7e0683d0105a20395c3ace9f31e8715ae0323383274e71cfff15a927821d37 - Sigstore transparency entry: 1540272470
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2951e5208c83cffe5bcb23c6ef9d376afeffb599 -
Trigger Event:
release
-
Statement type: