Skip to main content

Multimodal biomedical atlas builder

Project description

Homeobox

Multimodal schema with auxiliary metadata tables

Homeobox is a database for multimodal biomedical atlases that do not fit cleanly into one matrix, one modality, or one shared feature space.

A single Homeobox atlas can hold sparse single-cell gene expression, dense protein and embedding features, 2D/3D/4D/5D images, biomolecular structures, free text, and auxiliary metadata tables. You can query it, snapshot it, reconstruct results as AnnData / MuData, and stream batches to PyTorch without creating separate ML-only copies.

Under the hood, Homeobox combines the search and versioning capabilities of LanceDB with the array storage of Zarr.


How it compares to existing tools

If your main problem is... You probably want...
Querying, versioning, reconstructing, and training from many heterogeneous biomedical datasets with different feature spaces Homeobox
Dissatisfaction with TileDB ML-support and developer experience Homeobox
Analyzing one clean matrix or a small number of aligned modalities AnnData / MuData directly
Metadata, vector, or text search without large array payloads LanceDB, a vector database, or a regular database

At a glance

Multimodal storage ML-ready access
Gene expression, chromatin accessibility, protein abundance
Images, image features, embeddings
Biomolecular structures and text
Fully random iterable streaming for throughput
Map-style random access for arbitrary samplers
No intermediate training-only copies
Query and reconstruction Reproducibility
SQL / vector / full-text search over LanceDB metadata
Reconstruct query results as AnnData or MuData
Zarr-backed sparse and dense payloads
Explicit snapshot() / checkout(version) lifecycle
Read-only atlas views for training and analysis
See docs/versioning.md

The Ragged Atlas

The core abstraction in Homeobox is the Ragged Atlas, which is designed to support heterogeneous datasets without shared feature spaces. Some motivating use cases are:

  • Hundreds or thousands of h5ad or h5mu files from different assays, panels, and organisms that you want to query and train on as a single collection.
  • Repositories of large images stored in Zarr / OME-Zarr, DICOM, or TIFF — 2D, 3D, or 4D, sometimes >1 TB each, with associated text descriptions.
  • Single-cell images, masks, and associated feature data (e.g. CellProfiler vectors).
  • Any combination of the above, in one queryable store.

Existing tools optimize for single large datasets from one modality. Homeobox's RaggedAtlas allows a shared obs table and search indexes while letting each dataset retain its own feature axis.

At query time, reconstruction joins the feature spaces on the fly and returns a single AnnData / MuData with every column correctly placed.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.12 or newer.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[io]      # + S3/GCS/Azure
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

The quickstart below also uses scanpy to fetch a small example dataset:

pip install scanpy

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
uv run maturin develop --release

Example Notebooks

Notebook Description
explore_perturbation_atlas_colab.py (Colab) Explore an atlas with 120M+ cells, over 130,000 genetic, chemical, and biologic perturbations, and 5 modalities.

Quickstart

import numpy as np
import pandas as pd
import polars as pl
import scanpy as sc
import homeobox as hox

# 1. Define schemas: one for gene features, one for cell metadata.
#    `StableUIDField` marks `gene_symbol` as the deterministic source of
#    `uid` (so parallel ingest jobs converge on the same uid for the same
#    gene). Each pointer column is declared with `PointerField.declare`,
#    which binds the column name to a registered feature_space.
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str = hox.StableUIDField.declare(default=...)

class CellSchema(hox.HoxBaseSchema):
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )

# 2. Create an atlas
atlas = hox.create_or_open_atlas(
    atlas_path="./hox_example_atlas",
    obs_schemas={"cells": CellSchema},
    dataset_table_name="datasets",
    dataset_schema=hox.DatasetSchema,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset
adata = sc.datasets.pbmc3k()  # 2,700 PBMCs, raw counts, sparse CSR
adata.X = adata.X.astype(np.uint32)  # the counts layer must be np.uint32

# 4. Build the var DataFrame (one row per local feature, columns matching
#    the registry schema + `uid`), use it for both feature registration and
#    as adata.var. `compute_stable_uids` writes deterministic uids in place.
var_df = pd.DataFrame(
    {"gene_symbol": adata.var_names.tolist()},
    index=adata.var_names,
)
GeneFeature.compute_stable_uids(var_df)
atlas.register_features("gene_expression", pl.from_pandas(var_df))
adata.var = var_df

# 5. Ingest. `field_name` selects the cell-schema column to populate;
#    its feature_space is resolved from PointerField.declare.
record = hox.DatasetSchema(
    zarr_group="pbmc3k", feature_space="gene_expression", n_rows=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, field_name="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 6. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 7. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest("./hox_example_atlas")
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Multimodal in one row

The same shape scales to any number of modalities — declare one pointer column per feature space on a single obs schema:

class MultimodalCell(hox.HoxBaseSchema):
    # Shared obs fields
    cell_type: str | None
    tissue: str | None

    # Optional pointers — cells measured by only one assay are first-class,
    # no padding rows, no presence flags inserted at ingest.
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )
    protein_abundance: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="protein_abundance"
    )
    image_tiles: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="image_tiles"
    )

A query against this atlas streams within-row multimodal batches through a single DataLoader, regardless of how many modalities each cell has. See homeobox_examples/multimodal_perturbation_atlas/schema.py for a five-modality production schema (gene expression, chromatin accessibility, protein abundance, image features, image tiles) plus perturbation, publication, and donor tables.


Dataloaders and performance

Homeobox is intended to be the source of truth for analysis and model training, not just a staging format. The same snapshot you query can feed a PyTorch training loop.

Homeobox exposes two PyTorch dataset surfaces over the same atlas:

  • Homeobox-Iter: fully random iterable streaming. It reads large shuffled I/O blocks through a background prefetcher and slices training batches from that queue, which maximizes throughput for standard full-atlas training epochs.
  • Homeobox-Map: map-style random access. It supports __getitem__(indices) so regular PyTorch samplers, group-aware samplers, custom subsets, and perturbation-style batches can read arbitrary rows.

Capability summary from the benchmark suite:

System Map-style Remote storage Training-only format Versioned snapshots Ragged features
Homeobox-Map
Homeobox-Iter
SLAF
scDataset
AnnDataLoader
AnnLoader
BioNeMo SCDL
annbatch
TileDB-SOMA
cell-load

In this table, "training-only format" means the data must be copied into a layout that exists only to feed a training loop; a dash is better. "Ragged features" means datasets with different feature sets can coexist without padding to a union or intersecting to common features.

On a 1M-cell × 20k-gene synthetic atlas, the homeobox iterable dataloader sustains ~70k cells/sec on local NVMe and ~40k cells/sec streaming from S3 at a single worker — saturating local disk and running roughly an order of magnitude faster than the next remote-capable system in the sweep.

Local throughput on NVMe, cells/sec at workers=0:

System b=64 b=512 b=4096
Homeobox-Iter 69,658 73,171 72,548
annbatch 56,154 67,459 76,314
BioNeMo SCDL 5,455 72,570 66,124
scDataset 28,151 41,525 52,923
SLAF 30,118 33,374 37,940
AnnDataLoader 21,446 25,926 26,403
Homeobox-Map 9,553 22,749 25,049
TileDB-SOMA 11,268 11,972 12,153
AnnLoader 10,509 12,699 10,656

Remote throughput from S3, cells/sec at workers=0:

System b=64 b=512 b=4096
Homeobox-Iter 40,378 42,344 41,453
SLAF 3,611 4,233 10,320
TileDB-SOMA 5,873 5,845 5,945
Homeobox-Map 576 1,884 3,300
annbatch 1,050 1,314 1,594

Perturbation-style group-aware random reads, cells/sec at workers=0:

System b=64 b=512 b=1024
Homeobox-Map 9,842 13,677 12,265
cell-load 4,936 26,678 27,096

See docs/dataloader_benchmark.md for the full sweep across nine dataloaders (SLAF, scDataset, BioNeMo SCDL, annbatch, TileDB-SOMA, cell-load, and the two homeobox surfaces), including local/remote/perturbation workloads, memory profiles, and reproducible scripts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.7.tar.gz (906.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.7-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.7-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.7-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.7-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.7-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.7.tar.gz.

File metadata

  • Download URL: homeobox-0.2.7.tar.gz
  • Upload date:
  • Size: 906.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.7.tar.gz
Algorithm Hash digest
SHA256 e91a87d917626913e61151b62db752e7f3618328f271196bcb3564e12bbbbde7
MD5 470c1ed4f0bfe8b9d5893690d66013da
BLAKE2b-256 15832f1efbe84e9ce9bb2830f062b1147ff40397441fb10af47cbd57ec2df2b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.7.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.7-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.7-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.7-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e54a4112a69bc7e9bd4228204447c865cb592468ac5e4ffcd6ff4b44dc2a5cfe
MD5 d4c92dde1fca179a8ec0a9b3bdd50b8d
BLAKE2b-256 256571dd971c8b3fa16c48e506079adef30e0eb9a13829d419782f45fe4e097e

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.7-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.7-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.7-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9172b4982a49433cc2b6bb7bb2f84a4a0787fe505f1ab6dbbb9910c17ed765bb
MD5 729a0fc6bd926f4dd59e2af9a22281e6
BLAKE2b-256 4158ec39e3e970c4863677ccf209560e234d439a625ada0233959a3416924d8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.7-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.7-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.7-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 687c9691c33bb61f54ff1873846c49d2bf36ab9aed43591fe4a6418b11cdc826
MD5 bc8fc7eeba72e1a3dd64ffbdc2f8d1e2
BLAKE2b-256 6b1e0f5ecd50346f88441a3a0c02e1ec6bd24da86d1f521924d1e1bc881d6d9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.7-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.7-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.7-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d01acedd909392f1f9441d31ad671388386fddd2e54f1ef8af6375190589ac1d
MD5 d71ba029bd1a24244735b4d3c5c83997
BLAKE2b-256 b52358fa79deb8cf3e5dd179e3a2940cac119f1cb46e9c4d27bc2f9e5b563e11

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.7-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.7-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.7-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e6da2700a57d5ed29095461733b5031f42e2695b87313466de1e49714262c0af
MD5 1a6fc85a5bffca9d06e888a690d11a6b
BLAKE2b-256 cfca4f4bac8825004c495ceb55d63cca14f9e1f0a5973a413c6ee9bca51bb811

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.7-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page