Multimodal biomedical atlas builder
Project description
Homeobox
Homeobox is a database for multimodal biomedical atlases that do not fit cleanly into one matrix, one modality, or one shared feature space.
A single Homeobox atlas can hold sparse single-cell gene expression, dense protein and embedding features, 2D/3D/4D/5D images, biomolecular structures, free text, and auxiliary metadata tables. You can query it, snapshot it, reconstruct results as AnnData / MuData, and stream batches to PyTorch without creating separate ML-only copies.
Under the hood, Homeobox combines the search and versioning capabilities of LanceDB with the array storage of Zarr.
- Quick install:
pip install homeobox - Documentation
How it compares to existing tools
| If your main problem is... | You probably want... |
|---|---|
| Querying, versioning, reconstructing, and training from many heterogeneous biomedical datasets with different feature spaces | Homeobox |
| Dissatisfaction with TileDB ML-support and developer experience | Homeobox |
| Analyzing one clean matrix or a small number of aligned modalities | AnnData / MuData directly |
| Metadata, vector, or text search without large array payloads | LanceDB, a vector database, or a regular database |
At a glance
| Multimodal storage | ML-ready access |
|---|---|
| Gene expression, chromatin accessibility, protein abundance Images, image features, embeddings Biomolecular structures and text |
Fully random iterable streaming for throughput Map-style random access for arbitrary samplers No intermediate training-only copies |
| Query and reconstruction | Reproducibility |
|---|---|
| SQL / vector / full-text search over LanceDB metadata Reconstruct query results as AnnData or MuDataZarr-backed sparse and dense payloads |
Explicit snapshot() / checkout(version) lifecycleRead-only atlas views for training and analysis See docs/versioning.md |
The Ragged Atlas
The core abstraction in Homeobox is the Ragged Atlas, which is designed to support heterogeneous datasets without shared feature spaces. Some motivating use cases are:
- Hundreds or thousands of
h5adorh5mufiles from different assays, panels, and organisms that you want to query and train on as a single collection. - Repositories of large images stored in Zarr / OME-Zarr, DICOM, or TIFF — 2D, 3D, or 4D, sometimes >1 TB each, with associated text descriptions.
- Single-cell images, masks, and associated feature data (e.g. CellProfiler vectors).
- Any combination of the above, in one queryable store.
Existing tools optimize for single large datasets from one modality. Homeobox's RaggedAtlas allows a shared obs table and search indexes while letting each dataset retain its own feature axis.
At query time, reconstruction joins the feature spaces on the fly and returns a single AnnData / MuData with every column correctly placed.
Installation
Prebuilt wheels are available on PyPI. Requires Python 3.12 or newer.
pip install homeobox # core: atlas, querying, ingestion
pip install homeobox[ml] # + PyTorch dataloader
pip install homeobox[io] # + S3/GCS/Azure
pip install homeobox[viz] # + marimo, matplotlib
pip install homeobox[all] # everything
The quickstart below also uses scanpy to fetch a small example dataset:
pip install scanpy
To build from source (requires a Rust toolchain):
curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
uv run maturin develop --release
Example Notebooks
| Notebook | Description |
|---|---|
explore_perturbation_atlas_colab.py (Colab) |
Explore an atlas with 120M+ cells, over 130,000 genetic, chemical, and biologic perturbations, and 5 modalities. |
Quickstart
import numpy as np
import pandas as pd
import polars as pl
import scanpy as sc
import homeobox as hox
# 1. Define schemas: one for gene features, one for cell metadata.
# `StableUIDField` marks `gene_symbol` as the deterministic source of
# `uid` (so parallel ingest jobs converge on the same uid for the same
# gene). Each pointer column is declared with `PointerField.declare`,
# which binds the column name to a registered feature_space.
class GeneFeature(hox.FeatureBaseSchema):
gene_symbol: str = hox.StableUIDField.declare(default=...)
class CellSchema(hox.HoxBaseSchema):
gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
feature_space="gene_expression"
)
# 2. Create an atlas
atlas = hox.create_or_open_atlas(
atlas_path="./hox_example_atlas",
obs_schemas={"cells": CellSchema},
dataset_table_name="datasets",
dataset_schema=hox.DatasetSchema,
registry_schemas={"gene_expression": GeneFeature},
)
# 3. Load a dataset
adata = sc.datasets.pbmc3k() # 2,700 PBMCs, raw counts, sparse CSR
adata.X = adata.X.astype(np.uint32) # the counts layer must be np.uint32
# 4. Build the var DataFrame (one row per local feature, columns matching
# the registry schema + `uid`), use it for both feature registration and
# as adata.var. `compute_stable_uids` writes deterministic uids in place.
var_df = pd.DataFrame(
{"gene_symbol": adata.var_names.tolist()},
index=adata.var_names,
)
GeneFeature.compute_stable_uids(var_df)
atlas.register_features("gene_expression", pl.from_pandas(var_df))
adata.var = var_df
# 5. Ingest. `field_name` selects the cell-schema column to populate;
# its feature_space is resolved from PointerField.declare.
record = hox.DatasetSchema(
zarr_group="pbmc3k", feature_space="gene_expression", n_rows=adata.n_obs,
)
hox.add_from_anndata(
atlas, adata, field_name="gene_expression",
zarr_layer="counts", dataset_record=record,
)
# 6. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()
# 7. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest("./hox_example_atlas")
result = atlas_r.query().limit(500).to_anndata()
print(result) # AnnData object with n_obs × n_vars = 500 × 32738
Multimodal in one row
The same shape scales to any number of modalities — declare one pointer column per feature space on a single obs schema:
class MultimodalCell(hox.HoxBaseSchema):
# Shared obs fields
cell_type: str | None
tissue: str | None
# Optional pointers — cells measured by only one assay are first-class,
# no padding rows, no presence flags inserted at ingest.
gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
feature_space="gene_expression"
)
protein_abundance: hox.DenseZarrPointer | None = hox.PointerField.declare(
feature_space="protein_abundance"
)
image_tiles: hox.DenseZarrPointer | None = hox.PointerField.declare(
feature_space="image_tiles"
)
A query against this atlas streams within-row multimodal batches through a single DataLoader, regardless of how many modalities each cell has. See homeobox_examples/multimodal_perturbation_atlas/schema.py for a five-modality production schema (gene expression, chromatin accessibility, protein abundance, image features, image tiles) plus perturbation, publication, and donor tables.
Dataloaders and performance
Homeobox is intended to be the source of truth for analysis and model training, not just a staging format. The same snapshot you query can feed a PyTorch training loop.
Homeobox exposes two PyTorch dataset surfaces over the same atlas:
- Homeobox-Iter: fully random iterable streaming. It reads large shuffled I/O blocks through a background prefetcher and slices training batches from that queue, which maximizes throughput for standard full-atlas training epochs.
- Homeobox-Map: map-style random access. It supports
__getitem__(indices)so regular PyTorch samplers, group-aware samplers, custom subsets, and perturbation-style batches can read arbitrary rows.
Capability summary from the benchmark suite:
| System | Map-style | Remote storage | Training-only format | Versioned snapshots | Ragged features |
|---|---|---|---|---|---|
| Homeobox-Map | ✓ | ✓ | – | ✓ | ✓ |
| Homeobox-Iter | – | ✓ | – | ✓ | ✓ |
| SLAF | – | ✓ | – | ✓ | – |
| scDataset | – | – | – | – | – |
| AnnDataLoader | ✓ | – | – | – | – |
| AnnLoader | ✓ | – | – | – | – |
| BioNeMo SCDL | ✓ | – | ✓ | – | – |
| annbatch | – | ✓ | ✓ | – | – |
| TileDB-SOMA | – | ✓ | – | ✓ | – |
| cell-load | – | – | ✓ | – | – |
In this table, "training-only format" means the data must be copied into a layout that exists only to feed a training loop; a dash is better. "Ragged features" means datasets with different feature sets can coexist without padding to a union or intersecting to common features.
On a 1M-cell × 20k-gene synthetic atlas, the homeobox iterable dataloader sustains ~70k cells/sec on local NVMe and ~40k cells/sec streaming from S3 at a single worker — saturating local disk and running roughly an order of magnitude faster than the next remote-capable system in the sweep.
Local throughput on NVMe, cells/sec at workers=0:
| System | b=64 | b=512 | b=4096 |
|---|---|---|---|
| Homeobox-Iter | 69,658 | 73,171 | 72,548 |
| annbatch | 56,154 | 67,459 | 76,314 |
| BioNeMo SCDL | 5,455 | 72,570 | 66,124 |
| scDataset | 28,151 | 41,525 | 52,923 |
| SLAF | 30,118 | 33,374 | 37,940 |
| AnnDataLoader | 21,446 | 25,926 | 26,403 |
| Homeobox-Map | 9,553 | 22,749 | 25,049 |
| TileDB-SOMA | 11,268 | 11,972 | 12,153 |
| AnnLoader | 10,509 | 12,699 | 10,656 |
Remote throughput from S3, cells/sec at workers=0:
| System | b=64 | b=512 | b=4096 |
|---|---|---|---|
| Homeobox-Iter | 40,378 | 42,344 | 41,453 |
| SLAF | 3,611 | 4,233 | 10,320 |
| TileDB-SOMA | 5,873 | 5,845 | 5,945 |
| Homeobox-Map | 576 | 1,884 | 3,300 |
| annbatch | 1,050 | 1,314 | 1,594 |
Perturbation-style group-aware random reads, cells/sec at workers=0:
| System | b=64 | b=512 | b=1024 |
|---|---|---|---|
| Homeobox-Map | 9,842 | 13,677 | 12,265 |
| cell-load | 4,936 | 26,678 | 27,096 |
See docs/dataloader_benchmark.md for the full sweep across nine dataloaders (SLAF, scDataset, BioNeMo SCDL, annbatch, TileDB-SOMA, cell-load, and the two homeobox surfaces), including local/remote/perturbation workloads, memory profiles, and reproducible scripts.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file homeobox-0.2.5.tar.gz.
File metadata
- Download URL: homeobox-0.2.5.tar.gz
- Upload date:
- Size: 889.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35b9e501e22d84bc7d7c40bc731fc4084ca3cc6c35712681c321d27533746cec
|
|
| MD5 |
3e6207e6a56847bc7ae20f8a57631db3
|
|
| BLAKE2b-256 |
5d83d94846914b12e341385e42bf69d0bec3ef3293c44cc153a1bfafb016077e
|
Provenance
The following attestation bundles were made for homeobox-0.2.5.tar.gz:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.5.tar.gz -
Subject digest:
35b9e501e22d84bc7d7c40bc731fc4084ca3cc6c35712681c321d27533746cec - Sigstore transparency entry: 1550076050
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.5-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: homeobox-0.2.5-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 5.6 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4d3ee8767a6fbeab10c5889553af2c236bee69495af2482b1ab7d1e0a7c807f
|
|
| MD5 |
05cbc4cde81864da655d4e92aa5513d5
|
|
| BLAKE2b-256 |
0489766c97f732a1ff5e2e966325ea2009968a9af57ad5e0d7769e80f7dfdb65
|
Provenance
The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-win_amd64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.5-cp312-abi3-win_amd64.whl -
Subject digest:
e4d3ee8767a6fbeab10c5889553af2c236bee69495af2482b1ab7d1e0a7c807f - Sigstore transparency entry: 1550076240
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 6.7 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8742a0a90a6161e18258bad0ca20df6290ed00903e52059bea3e519e51a05cc
|
|
| MD5 |
e8ca6c739d47d98a5cae55d402c76ffa
|
|
| BLAKE2b-256 |
517c0b78a7bb13c1ceafb20a64c91993a874f4ff2e6f3cdb829d9eb331407e61
|
Provenance
The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
a8742a0a90a6161e18258bad0ca20df6290ed00903e52059bea3e519e51a05cc - Sigstore transparency entry: 1550076264
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd5a20fdc829918a486946e900c544528a4fd7835443ae7156d8025247c2086f
|
|
| MD5 |
871d17e0b6975662e7c7948b1c50259f
|
|
| BLAKE2b-256 |
21029f48b2191836da297452a875e82c5a7dee4b8bda35bbaa9389ff24a2e00c
|
Provenance
The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
bd5a20fdc829918a486946e900c544528a4fd7835443ae7156d8025247c2086f - Sigstore transparency entry: 1550076167
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 6.0 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c9014eb89e45f7b03b791bdc6d2a61ec827a31c83f2fd315936d5efc5342c2f
|
|
| MD5 |
f6c342d4e63ed859775fe5b85c5467eb
|
|
| BLAKE2b-256 |
df5ce8c09b0ae9fadd55d09d5d65af390d194cf254ca3a9dc7d20d3b3a74af72
|
Provenance
The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl -
Subject digest:
7c9014eb89e45f7b03b791bdc6d2a61ec827a31c83f2fd315936d5efc5342c2f - Sigstore transparency entry: 1550076124
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Trigger Event:
release
-
Statement type:
File details
Details for the file homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.12+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
deb6dc6beb98485c66f00313ce0927869fc36c5321140d8a8a442efde612005f
|
|
| MD5 |
8ec07b7aabab284dae6607eacb166fad
|
|
| BLAKE2b-256 |
04deefae4d69bd25382e7e0248da940944de5a7d103b5d1b5daa3bb79ca2bd7f
|
Provenance
The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on epiblastai/homeobox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl -
Subject digest:
deb6dc6beb98485c66f00313ce0927869fc36c5321140d8a8a442efde612005f - Sigstore transparency entry: 1550076215
- Sigstore integration time:
-
Permalink:
epiblastai/homeobox@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/epiblastai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cfc66a13edd6ff022dea564f9b71d5dadc07f580 -
Trigger Event:
release
-
Statement type: