Skip to main content

Multimodal biomedical atlas builder

Project description

Homeobox

Multimodal schema with auxiliary metadata tables

Homeobox is a database for multimodal biomedical atlases that do not fit cleanly into one matrix, one modality, or one shared feature space.

A single Homeobox atlas can hold sparse single-cell gene expression, dense protein and embedding features, 2D/3D/4D/5D images, biomolecular structures, free text, and auxiliary metadata tables. You can query it, snapshot it, reconstruct results as AnnData / MuData, and stream batches to PyTorch without creating separate ML-only copies.

Under the hood, Homeobox combines the search and versioning capabilities of LanceDB with the array storage of Zarr.


How it compares to existing tools

If your main problem is... You probably want...
Querying, versioning, reconstructing, and training from many heterogeneous biomedical datasets with different feature spaces Homeobox
Dissatisfaction with TileDB ML-support and developer experience Homeobox
Analyzing one clean matrix or a small number of aligned modalities AnnData / MuData directly
Metadata, vector, or text search without large array payloads LanceDB, a vector database, or a regular database

At a glance

Multimodal storage ML-ready access
Gene expression, chromatin accessibility, protein abundance
Images, image features, embeddings
Biomolecular structures and text
Fully random iterable streaming for throughput
Map-style random access for arbitrary samplers
No intermediate training-only copies
Query and reconstruction Reproducibility
SQL / vector / full-text search over LanceDB metadata
Reconstruct query results as AnnData or MuData
Zarr-backed sparse and dense payloads
Explicit snapshot() / checkout(version) lifecycle
Read-only atlas views for training and analysis
See docs/versioning.md

The Ragged Atlas

The core abstraction in Homeobox is the Ragged Atlas, which is designed to support heterogeneous datasets without shared feature spaces. Some motivating use cases are:

  • Hundreds or thousands of h5ad or h5mu files from different assays, panels, and organisms that you want to query and train on as a single collection.
  • Repositories of large images stored in Zarr / OME-Zarr, DICOM, or TIFF — 2D, 3D, or 4D, sometimes >1 TB each, with associated text descriptions.
  • Single-cell images, masks, and associated feature data (e.g. CellProfiler vectors).
  • Any combination of the above, in one queryable store.

Existing tools optimize for single large datasets from one modality. Homeobox's RaggedAtlas allows a shared obs table and search indexes while letting each dataset retain its own feature axis.

At query time, reconstruction joins the feature spaces on the fly and returns a single AnnData / MuData with every column correctly placed.


Installation

Prebuilt wheels are available on PyPI. Requires Python 3.12 or newer.

pip install homeobox          # core: atlas, querying, ingestion
pip install homeobox[ml]      # + PyTorch dataloader
pip install homeobox[io]      # + S3/GCS/Azure
pip install homeobox[viz]     # + marimo, matplotlib
pip install homeobox[all]     # everything

The quickstart below also uses scanpy to fetch a small example dataset:

pip install scanpy

To build from source (requires a Rust toolchain):

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
uv run maturin develop --release

Example Notebooks

Notebook Description
explore_perturbation_atlas_colab.py (Colab) Explore an atlas with 120M+ cells, over 130,000 genetic, chemical, and biologic perturbations, and 5 modalities.

Quickstart

import numpy as np
import pandas as pd
import polars as pl
import scanpy as sc
import homeobox as hox

# 1. Define schemas: one for gene features, one for cell metadata.
#    `StableUIDField` marks `gene_symbol` as the deterministic source of
#    `uid` (so parallel ingest jobs converge on the same uid for the same
#    gene). Each pointer column is declared with `PointerField.declare`,
#    which binds the column name to a registered feature_space.
class GeneFeature(hox.FeatureBaseSchema):
    gene_symbol: str = hox.StableUIDField.declare(default=...)

class CellSchema(hox.HoxBaseSchema):
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )

# 2. Create an atlas
atlas = hox.create_or_open_atlas(
    atlas_path="./hox_example_atlas",
    obs_schemas={"cells": CellSchema},
    dataset_table_name="datasets",
    dataset_schema=hox.DatasetSchema,
    registry_schemas={"gene_expression": GeneFeature},
)

# 3. Load a dataset
adata = sc.datasets.pbmc3k()  # 2,700 PBMCs, raw counts, sparse CSR
adata.X = adata.X.astype(np.uint32)  # the counts layer must be np.uint32

# 4. Build the var DataFrame (one row per local feature, columns matching
#    the registry schema + `uid`), use it for both feature registration and
#    as adata.var. `compute_stable_uids` writes deterministic uids in place.
var_df = pd.DataFrame(
    {"gene_symbol": adata.var_names.tolist()},
    index=adata.var_names,
)
GeneFeature.compute_stable_uids(var_df)
atlas.register_features("gene_expression", pl.from_pandas(var_df))
adata.var = var_df

# 5. Ingest. `field_name` selects the cell-schema column to populate;
#    its feature_space is resolved from PointerField.declare.
record = hox.DatasetSchema(
    zarr_group="pbmc3k", feature_space="gene_expression", n_rows=adata.n_obs,
)
hox.add_from_anndata(
    atlas, adata, field_name="gene_expression",
    zarr_layer="counts", dataset_record=record,
)

# 6. Optimize tables and create a snapshot
atlas.optimize()
atlas.snapshot()

# 7. Open the atlas and query
atlas_r = hox.RaggedAtlas.checkout_latest("./hox_example_atlas")
result = atlas_r.query().limit(500).to_anndata()
print(result)  # AnnData object with n_obs × n_vars = 500 × 32738

Multimodal in one row

The same shape scales to any number of modalities — declare one pointer column per feature space on a single obs schema:

class MultimodalCell(hox.HoxBaseSchema):
    # Shared obs fields
    cell_type: str | None
    tissue: str | None

    # Optional pointers — cells measured by only one assay are first-class,
    # no padding rows, no presence flags inserted at ingest.
    gene_expression: hox.SparseZarrPointer | None = hox.PointerField.declare(
        feature_space="gene_expression"
    )
    protein_abundance: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="protein_abundance"
    )
    image_tiles: hox.DenseZarrPointer | None = hox.PointerField.declare(
        feature_space="image_tiles"
    )

A query against this atlas streams within-row multimodal batches through a single DataLoader, regardless of how many modalities each cell has. See homeobox_examples/multimodal_perturbation_atlas/schema.py for a five-modality production schema (gene expression, chromatin accessibility, protein abundance, image features, image tiles) plus perturbation, publication, and donor tables.


Dataloaders and performance

Homeobox is intended to be the source of truth for analysis and model training, not just a staging format. The same snapshot you query can feed a PyTorch training loop.

Homeobox exposes two PyTorch dataset surfaces over the same atlas:

  • Homeobox-Iter: fully random iterable streaming. It reads large shuffled I/O blocks through a background prefetcher and slices training batches from that queue, which maximizes throughput for standard full-atlas training epochs.
  • Homeobox-Map: map-style random access. It supports __getitem__(indices) so regular PyTorch samplers, group-aware samplers, custom subsets, and perturbation-style batches can read arbitrary rows.

Capability summary from the benchmark suite:

System Map-style Remote storage Training-only format Versioned snapshots Ragged features
Homeobox-Map
Homeobox-Iter
SLAF
scDataset
AnnDataLoader
AnnLoader
BioNeMo SCDL
annbatch
TileDB-SOMA
cell-load

In this table, "training-only format" means the data must be copied into a layout that exists only to feed a training loop; a dash is better. "Ragged features" means datasets with different feature sets can coexist without padding to a union or intersecting to common features.

On a 1M-cell × 20k-gene synthetic atlas, the homeobox iterable dataloader sustains ~70k cells/sec on local NVMe and ~40k cells/sec streaming from S3 at a single worker — saturating local disk and running roughly an order of magnitude faster than the next remote-capable system in the sweep.

Local throughput on NVMe, cells/sec at workers=0:

System b=64 b=512 b=4096
Homeobox-Iter 69,658 73,171 72,548
annbatch 56,154 67,459 76,314
BioNeMo SCDL 5,455 72,570 66,124
scDataset 28,151 41,525 52,923
SLAF 30,118 33,374 37,940
AnnDataLoader 21,446 25,926 26,403
Homeobox-Map 9,553 22,749 25,049
TileDB-SOMA 11,268 11,972 12,153
AnnLoader 10,509 12,699 10,656

Remote throughput from S3, cells/sec at workers=0:

System b=64 b=512 b=4096
Homeobox-Iter 40,378 42,344 41,453
SLAF 3,611 4,233 10,320
TileDB-SOMA 5,873 5,845 5,945
Homeobox-Map 576 1,884 3,300
annbatch 1,050 1,314 1,594

Perturbation-style group-aware random reads, cells/sec at workers=0:

System b=64 b=512 b=1024
Homeobox-Map 9,842 13,677 12,265
cell-load 4,936 26,678 27,096

See docs/dataloader_benchmark.md for the full sweep across nine dataloaders (SLAF, scDataset, BioNeMo SCDL, annbatch, TileDB-SOMA, cell-load, and the two homeobox surfaces), including local/remote/perturbation workloads, memory profiles, and reproducible scripts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

homeobox-0.2.8.tar.gz (925.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

homeobox-0.2.8-cp312-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

homeobox-0.2.8-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

homeobox-0.2.8-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

homeobox-0.2.8-cp312-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

homeobox-0.2.8-cp312-abi3-macosx_10_12_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file homeobox-0.2.8.tar.gz.

File metadata

  • Download URL: homeobox-0.2.8.tar.gz
  • Upload date:
  • Size: 925.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.8.tar.gz
Algorithm Hash digest
SHA256 66644a7652419087f71eb57e32d9395dc0b879b58a897f69058c969e7487bcc2
MD5 2765eedc4bd98ddf682cb2244424417d
BLAKE2b-256 9f316e45e28a73e5f12085df9e1c1041a8c77b39f182048ba0ce309782de76ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.8.tar.gz:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.8-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: homeobox-0.2.8-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for homeobox-0.2.8-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 51faa84ba458f6b39f4987c6d9af1967f5ef3f9edcdd830fdf0d3f30dd8f1597
MD5 75c6aaace98c5614c73fde85d131f160
BLAKE2b-256 b14d4f9415ebfcb4c3584827b14180ccebb852e09cf1c9b14c524a3769fd40f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.8-cp312-abi3-win_amd64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.8-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.8-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0a138f1a9e89f9b44137dc70b17b6244db298be8de82cc7444f45409b4252958
MD5 7a561243bb71a09c7321904fb752a973
BLAKE2b-256 341ce6c811162c82aefee0b2c599e701552c3aa3403bc09b07e85c2b11858a6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.8-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.8-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.8-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 57495ac6af754d035cf548295a373ab978696764ce46af1bbd027509ba2bef21
MD5 aa08af27241622384b3ab47390f412f2
BLAKE2b-256 12c1d6f55412819cd7135c90f10f7bbd1abfe36e84b802919bdd61a7a8b27177

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.8-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.8-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.8-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5ee5c891b303348b4dd3c287f5caffe886b40e6f1a2f65aeb287c473fe16570a
MD5 eee0979c7c5e115f607cb78cd5997a82
BLAKE2b-256 2aa4df629ec3af19b938e473945f622b283950b77c6c3a44176eb1cce2cc0dc8

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.8-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file homeobox-0.2.8-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for homeobox-0.2.8-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ef66ec0ec7bdb5605d7e24f8c2f265d2565c3bfecfa74c0fee090df3d08a813b
MD5 a0588fd1d6a812239595acb1801426a8
BLAKE2b-256 4312822aecf602c30e7a12b6c553249ca84201e270240dbb8ac17e91f8884d76

See more details on using hashes here.

Provenance

The following attestation bundles were made for homeobox-0.2.8-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on epiblastai/homeobox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page