Skip to main content

Multimodal biomedical atlas builder

Project description

Homeobox

Multimodal schema with auxiliary metadata tables

Homeobox is a database for multimodal biomedical atlases that do not fit cleanly into one matrix, one modality, or one shared feature space.

A single Homeobox atlas can hold sparse single-cell gene expression, dense protein and embedding features, 2D/3D/4D/5D images, biomolecular structures, free text, and auxiliary metadata tables. You can query it, snapshot it, reconstruct results as AnnData / MuData, and stream batches to PyTorch without creating separate ML-only copies.

Under the hood, Homeobox combines the search and versioning capabilities of LanceDB with the array storage of Zarr.


How it compares to existing tools

If your main problem is... You probably want...
Querying, versioning, reconstructing, and training from many heterogeneous biomedical datasets with different feature spaces Homeobox
Dissatisfaction with TileDB ML-support and developer experience Homeobox
Analyzing one clean matrix or a small number of aligned modalities AnnData / MuData directly
Metadata, vector, or text search without large array payloads LanceDB, a vector database, or a regular database

At a glance

Multimodal storage ML-ready access
Gene expression, chromatin accessibility, protein abundance
Images, image features, embeddings
Biomolecular structures and text
Fully random iterable streaming for throughput
Map-style random access for arbitrary samplers
No intermediate training-only copies
Query and reconstruction Reproducibility
SQL / vector / full-text search over LanceDB metadata
Reconstruct query results as AnnData or MuData
Zarr-backed sparse and dense payloads
Explicit snapshot() / checkout(version) lifecycle
Read-only atlas views for training and analysis
See docs/versioning.md

The Ragged Atlas

The core abstraction in Homeobox is the Ragged Atlas, which is designed to support heterogeneous datasets without shared feature spaces. Some motivating use cases are:

  • Hundreds or thousands of h5ad or h5mu files from different assays, panels, and organisms that you want to query and train on as a single collection.
  • Repositories of large images stored in Zarr / OME-Zarr, DICOM, or TIFF — 2D, 3D, or 4D, sometimes >1 TB each, with associated text descriptions.
  • Single-cell images, masks, and associated feature data (e.g. CellProfiler vectors).
  • Any combination of the above, in one queryable store.

Existing tools optimize for single large datasets from one modality. Homeobox's RaggedAtlas allows a shared obs table and search indexes while letting each dataset retain its own feature axis.

At query time, reconstruction joins the feature spaces on the fly and returns a single AnnData / MuData with every column correctly placed.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.12 or newer.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[io]      # + S3/GCS/Azure
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

The quickstart below also uses scanpy to fetch a small example dataset:

pip install scanpy

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
uv run maturin develop --release

Example Notebooks

Notebook Description
explore_perturbation_atlas_colab.py (Colab) Explore an atlas with 120M+ cells, over 130,000 genetic, chemical, and biologic perturbations, and 5 modalities.

Quickstart

import numpy as np
import pandas as pd
import polars as pl
import scanpy as sc
import homeobox as hox

# 1. Define schemas: one for gene features, one for cell metadata.
#    `StableUIDField` marks `gene_symbol` as the deterministic source of
#    `uid` (so parallel ingest jobs converge on the same uid for the same
#    gene). Each pointer column is declared with `PointerField.declare`,
#    which binds the column name to a registered feature_space.
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str = hox.StableUIDField.declare(default=...)

class CellSchema(hox.HoxBaseSchema):
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )

# 2. Create an atlas
atlas = hox.create_or_open_atlas(
    atlas_path="./hox_example_atlas",
    obs_schemas={"cells": CellSchema},
    dataset_table_name="datasets",
    dataset_schema=hox.DatasetSchema,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset
adata = sc.datasets.pbmc3k()  # 2,700 PBMCs, raw counts, sparse CSR
adata.X = adata.X.astype(np.uint32)  # the counts layer must be np.uint32

# 4. Build the var DataFrame (one row per local feature, columns matching
#    the registry schema + `uid`), use it for both feature registration and
#    as adata.var. `compute_stable_uids` writes deterministic uids in place.
var_df = pd.DataFrame(
    {"gene_symbol": adata.var_names.tolist()},
    index=adata.var_names,
)
GeneFeature.compute_stable_uids(var_df)
atlas.register_features("gene_expression", pl.from_pandas(var_df))
adata.var = var_df

# 5. Ingest. `field_name` selects the cell-schema column to populate;
#    its feature_space is resolved from PointerField.declare.
record = hox.DatasetSchema(
    zarr_group="pbmc3k", feature_space="gene_expression", n_rows=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, field_name="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 6. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 7. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest("./hox_example_atlas")
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Multimodal in one row

The same shape scales to any number of modalities — declare one pointer column per feature space on a single obs schema:

class MultimodalCell(hox.HoxBaseSchema):
    # Shared obs fields
    cell_type: str | None
    tissue: str | None

    # Optional pointers — cells measured by only one assay are first-class,
    # no padding rows, no presence flags inserted at ingest.
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )
    protein_abundance: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="protein_abundance"
    )
    image_tiles: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="image_tiles"
    )

A query against this atlas streams within-row multimodal batches through a single DataLoader, regardless of how many modalities each cell has. See homeobox_examples/multimodal_perturbation_atlas/schema.py for a five-modality production schema (gene expression, chromatin accessibility, protein abundance, image features, image tiles) plus perturbation, publication, and donor tables.


Dataloaders and performance

Homeobox is intended to be the source of truth for analysis and model training, not just a staging format. The same snapshot you query can feed a PyTorch training loop.

Homeobox exposes two PyTorch dataset surfaces over the same atlas:

  • Homeobox-Iter: fully random iterable streaming. It reads large shuffled I/O blocks through a background prefetcher and slices training batches from that queue, which maximizes throughput for standard full-atlas training epochs.
  • Homeobox-Map: map-style random access. It supports __getitem__(indices) so regular PyTorch samplers, group-aware samplers, custom subsets, and perturbation-style batches can read arbitrary rows.

Capability summary from the benchmark suite:

System Map-style Remote storage Training-only format Versioned snapshots Ragged features
Homeobox-Map
Homeobox-Iter
SLAF
scDataset
AnnDataLoader
AnnLoader
BioNeMo SCDL
annbatch
TileDB-SOMA
cell-load

In this table, "training-only format" means the data must be copied into a layout that exists only to feed a training loop; a dash is better. "Ragged features" means datasets with different feature sets can coexist without padding to a union or intersecting to common features.

On a 1M-cell × 20k-gene synthetic atlas, the homeobox iterable dataloader sustains ~70k cells/sec on local NVMe and ~40k cells/sec streaming from S3 at a single worker — saturating local disk and running roughly an order of magnitude faster than the next remote-capable system in the sweep.

Local throughput on NVMe, cells/sec at workers=0:

System b=64 b=512 b=4096
Homeobox-Iter 69,658 73,171 72,548
annbatch 56,154 67,459 76,314
BioNeMo SCDL 5,455 72,570 66,124
scDataset 28,151 41,525 52,923
SLAF 30,118 33,374 37,940
AnnDataLoader 21,446 25,926 26,403
Homeobox-Map 9,553 22,749 25,049
TileDB-SOMA 11,268 11,972 12,153
AnnLoader 10,509 12,699 10,656

Remote throughput from S3, cells/sec at workers=0:

System b=64 b=512 b=4096
Homeobox-Iter 40,378 42,344 41,453
SLAF 3,611 4,233 10,320
TileDB-SOMA 5,873 5,845 5,945
Homeobox-Map 576 1,884 3,300
annbatch 1,050 1,314 1,594

Perturbation-style group-aware random reads, cells/sec at workers=0:

System b=64 b=512 b=1024
Homeobox-Map 9,842 13,677 12,265
cell-load 4,936 26,678 27,096

See docs/dataloader_benchmark.md for the full sweep across nine dataloaders (SLAF, scDataset, BioNeMo SCDL, annbatch, TileDB-SOMA, cell-load, and the two homeobox surfaces), including local/remote/perturbation workloads, memory profiles, and reproducible scripts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.5.tar.gz (889.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.5-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.5.tar.gz.

File metadata

  • Download URL: homeobox-0.2.5.tar.gz
  • Upload date:
  • Size: 889.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.5.tar.gz
Algorithm Hash digest
SHA256 35b9e501e22d84bc7d7c40bc731fc4084ca3cc6c35712681c321d27533746cec
MD5 3e6207e6a56847bc7ae20f8a57631db3
BLAKE2b-256 5d83d94846914b12e341385e42bf69d0bec3ef3293c44cc153a1bfafb016077e

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.5.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.5-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.5-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.5-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e4d3ee8767a6fbeab10c5889553af2c236bee69495af2482b1ab7d1e0a7c807f
MD5 05cbc4cde81864da655d4e92aa5513d5
BLAKE2b-256 0489766c97f732a1ff5e2e966325ea2009968a9af57ad5e0d7769e80f7dfdb65

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a8742a0a90a6161e18258bad0ca20df6290ed00903e52059bea3e519e51a05cc
MD5 e8ca6c739d47d98a5cae55d402c76ffa
BLAKE2b-256 517c0b78a7bb13c1ceafb20a64c91993a874f4ff2e6f3cdb829d9eb331407e61

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bd5a20fdc829918a486946e900c544528a4fd7835443ae7156d8025247c2086f
MD5 871d17e0b6975662e7c7948b1c50259f
BLAKE2b-256 21029f48b2191836da297452a875e82c5a7dee4b8bda35bbaa9389ff24a2e00c

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c9014eb89e45f7b03b791bdc6d2a61ec827a31c83f2fd315936d5efc5342c2f
MD5 f6c342d4e63ed859775fe5b85c5467eb
BLAKE2b-256 df5ce8c09b0ae9fadd55d09d5d65af390d194cf254ca3a9dc7d20d3b3a74af72

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 deb6dc6beb98485c66f00313ce0927869fc36c5321140d8a8a442efde612005f
MD5 8ec07b7aabab284dae6607eacb166fad
BLAKE2b-256 04deefae4d69bd25382e7e0248da940944de5a7d103b5d1b5daa3bb79ca2bd7f

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.5-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page