Skip to main content

Multimodal biomedical atlas builder

Project description

homeobox

Homeobox is a multimodal database for interactive analysis and ML training at scale. It combines the search and versioning capabilities of LanceDB with the scalable array storage of Zarr.

A single homeobox atlas can hold sparse single-cell gene expression, dense protein and embedding features, 2D/3D/4D/5D images, biomolecular structures, and free text. A single dataloader streams batches across all of them with no intermediate ML-only copies and no special modality-specific entrypoints. Our design philosophy is to be extremely flexible, while still quite fast.


Why homeobox

Multimodal schema with auxiliary metadata tables

Motivating cases

  • Hundreds or thousands of h5ad or h5mu files from different assays, panels, and organisms that you want to query and train on as a single collection.
  • Repositories of large images stored in Zarr / OME-Zarr, DICOM, or TIFF — 2D, 3D, or 4D, sometimes >1 TB each, with associated text descriptions.
  • Single-cell images, masks, and associated feature data (e.g. CellProfiler vectors).
  • Any combination of the above, in one queryable store.

Existing tools tend to optimise for single large datasets from a single modality, often through a laborious standardisation step that drops or duplicates data to fit a rectangular schema. Homeobox's RaggedAtlas unifies heterogeneous data into a single store that supports SQL / vector / full-text search, interactive AnnData / MuData reconstruction, and ML streaming — without that flattening step.

Ragged feature spaces, unified obs

Real-world atlases pull together datasets that were not designed to be compatible: different feature panels, different assays and imaging modalities, different metadata fields. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).

A RaggedAtlas keeps a single shared obs table while letting each dataset retain its own feature axis (or no features at all, for raw images). The obs table lives in LanceDB; each dataset occupies its own Zarr group with its own feature ordering; every row carries a pointer into its group.

At query time, the reconstruction layer joins the feature spaces on the fly: it computes the union or intersection of global feature indices, scatters each group's data into the right columns, and returns a single AnnData / MuData with every row correctly placed. Nothing is dropped at ingest, and there is no ambiguity about whether a value is a true zero or padding.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.13.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[io]      # + S3/GCS/Azure
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release

Quickstart

import numpy as np
import pandas as pd
import polars as pl
import scanpy as sc
import homeobox as hox

# 1. Define schemas: one for gene features, one for cell metadata.
#    `StableUIDField` marks `gene_symbol` as the deterministic source of
#    `uid` (so parallel ingest jobs converge on the same uid for the same
#    gene). Each pointer column is declared with `PointerField.declare`,
#    which binds the column name to a registered feature_space.
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str = hox.StableUIDField.declare(default=...)

class CellSchema(hox.HoxBaseSchema):
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )

# 2. Create an atlas
atlas = hox.create_or_open_atlas(
    atlas_path="./hox_example_atlas",
    obs_schemas={"cells": CellSchema},
    dataset_table_name="datasets",
    dataset_schema=hox.DatasetSchema,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset
adata = sc.datasets.pbmc3k()  # 2 700 PBMCs, raw counts, sparse CSR
adata.X = adata.X.astype(np.uint32)  # the counts layer must be np.uint32

# 4. Build the var DataFrame (one row per local feature, columns matching
#    the registry schema + `uid`), use it for both feature registration and
#    as adata.var. `compute_stable_uids` writes deterministic uids in place.
var_df = pd.DataFrame(
    {"gene_symbol": adata.var_names.tolist()},
    index=adata.var_names,
)
GeneFeature.compute_stable_uids(var_df)
atlas.register_features("gene_expression", pl.from_pandas(var_df))
adata.var = var_df

# 5. Ingest. `field_name` selects the cell-schema column to populate;
#    its feature_space is resolved from PointerField.declare.
record = hox.DatasetSchema(
    zarr_group="pbmc3k", feature_space="gene_expression", n_rows=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, field_name="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 6. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 7. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest("./hox_example_atlas")
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Multimodal in one row

The same shape scales to any number of modalities — declare one pointer column per feature space on a single obs schema:

class MultimodalCell(hox.HoxBaseSchema):
    # Shared obs fields
    cell_type: str | None
    tissue: str | None

    # Optional pointers — cells measured by only one assay are first-class,
    # no padding rows, no presence flags inserted at ingest.
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )
    protein_abundance: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="protein_abundance"
    )
    image_tiles: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="image_tiles"
    )

A query against this atlas streams within-row multimodal batches through a single DataLoader, regardless of how many modalities each cell has. See homeobox_examples/multimodal_perturbation_atlas/schema.py for a five-modality production schema (gene expression, chromatin accessibility, protein abundance, image features, image tiles) plus perturbation, publication, and donor tables.


Example Notebooks

The notebooks/ directory contains self-contained marimo notebooks that work after a plain pip install homeobox (no repo clone needed).

Notebook Description
multimodal_perturbation_atlas.py Explore a 120M+, agent-curated, cell atlas with over 130,000 genetic, chemical, and biologic perturbations and 5 modalities.

Performance

Beyond raw numbers, the case for homeobox is generality and integration. One library handles cell tables, sparse matrices, dense features, images, embeddings, and text — there is no separate stack for non-tabular modalities. New modalities are added by writing a feature-space spec, not by waiting for upstream support. And because storage is plain LanceDB + Zarr, homeobox plays directly with the broader Python + Rust data ecosystem (Lance, DuckDB, Polars, zarrs).

On a 1M-cell × 20k-gene synthetic atlas, the homeobox iterable dataloader sustains ~70k cells/sec on local NVMe and ~40k cells/sec streaming from S3 at a single worker — saturating local disk and running roughly an order of magnitude faster than the next remote-capable system in the sweep.

Remote throughput vs batch size

See docs/dataloader_benchmark.md for the full sweep across nine dataloaders (SLAF, scDataset, BioNeMo SCDL, annbatch, TileDB-SOMA, cell-load, and the two homeobox surfaces), including local/remote/perturbation workloads, memory profiles, and reproducible scripts.


Versioning

Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model: ingest writes Zarr arrays and cell records freely (in parallel if needed), optimize() compacts Lance fragments and rebuilds indexes, snapshot() validates consistency and records the current Lance table versions, and checkout(version) opens a read-only atlas pinned to that snapshot. Queries and training runs execute against a frozen, reproducible view; concurrent ingestion does not affect any checked-out handle. See docs/versioning.md for the full lifecycle.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.4.tar.gz (931.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.4-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.4.tar.gz.

File metadata

  • Download URL: homeobox-0.2.4.tar.gz
  • Upload date:
  • Size: 931.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.4.tar.gz
Algorithm Hash digest
SHA256 55119b63bef7433fa3bcbba92bc6bb1d150e9a80a978c9b64e159e8f4ae68dbd
MD5 cfe9a3247820c7fb00cd38c24ecd4500
BLAKE2b-256 f67406602a3ebac22340f0e3830cf01e7e15172afc7c31a0ca0ad845d4e17d28

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.4.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.4-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.4-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.4-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6bbb523942aaaddb3902934839b1d6d29a78245c6c1187045af6c01b69938057
MD5 bd99c5b11ea9302b4a28eee2110b05d0
BLAKE2b-256 f71c4492a85fe73c1cc13bfa1cc00488fce26ecfa7e4dbb4c540f3928eb00ed9

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 16d55a1c578b662acd7b851015fde71dc8d1a9fe2cc32d184634a674c9436fcb
MD5 ea57526bcc4ab2e1d4f3fb5bb61291eb
BLAKE2b-256 9e1fe3b921281d12d3052eab1d9ee8a5da271b984514d26b4314935c9574a2ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a62b48f92633e6f4cf9c7d6fcf962ef9fe483fab098cb4af49622823dd904327
MD5 77206ee37184354daaab6210bc9720e5
BLAKE2b-256 929efacd53b8079d1eecf95be5bef31fdf7f0c13d95f481b1fd9f3836b7e69cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3e32c322228bfee807d616c9ee3d99b2dc56b33c891808d01eb54edfa9d4da4d
MD5 679f3e03d70890cc28328692284ae573
BLAKE2b-256 0d2286a4ce0ba5338b83b51596958c76a188779c6d6ea37bc11819fa99b395d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7c7e0683d0105a20395c3ace9f31e8715ae0323383274e71cfff15a927821d37
MD5 f3441daeefc61e701118c1bd5b2bb5d8
BLAKE2b-256 2871be32386fbd2d90382ad89088ee7a5aa18639aecdfd1b738109edbe9244af

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.4-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page