Skip to main content

A package for biological embeddings in the perturbation experimental space

Project description

embpy

Tests Documentation

embpy is a Python toolkit for generating biological embeddings with one unified API.

Use it to embed genes, proteins, small molecules, morphology perturbations, and single cells; annotate the resulting objects; and compare embeddings with scverse-friendly plotting and analysis utilities.

embpy architecture

What embpy Does

  • Embeds biological entities through BioEmbedder.embed(...).
  • Resolves biological identifiers into model-ready inputs, such as gene sequences, protein sequences, SMILES strings, and morphology images.
  • Returns AnnData, tables, or payloads with provenance and canonical IDs.
  • Stores generated embeddings outside .X, using .obsm, .varm, or .uns according to the entity type.
  • Adds real metadata annotations for genes, proteins, molecules, and cell lines.
  • Provides plotting and comparison helpers for embedding quality checks.

Install

Pixi is recommended for development and GPU work:

pixi install -e default
pixi run -e default verify

For a pip install:

pip install embpy

For optional GPU/model extras, see the technical guide.

Quick Start

from embpy import BioEmbedder

embedder = BioEmbedder(device="auto", organism="human")

Embed genes with multiple model families:

gene_adata = embedder.embed(
    ["TP53", "EGFR", "MYC"],
    entity_type="gene",
    id_type="symbol",
    model=["hyenadna_tiny_1k", "esm2_8M", "minilm_l6_v2"],
    output="anndata",
)

gene_adata.varm.keys()
gene_adata.uns["embeddings"].keys()

Embed gene perturbation labels as row-aligned action embeddings:

# pert_adata.obs["perturbation"] contains symbols such as TP53/MYC.
pert_adata = embedder.embed(
    pert_adata,
    entity_type="gene",
    obs_column="perturbation",
    id_type="symbol",
    model="esm2_650M",
    output="anndata",
    is_perturbation=True,
    key="X_pert_esm2_650M",
)

pert_adata.obsm["X_pert_esm2_650M"]

Embed proteins:

protein_adata = embedder.embed(
    ["TP53", "EGFR", "BRCA1"],
    entity_type="protein",
    id_type="symbol",
    model="esm2_8M",
    output="anndata",
)

Embed small molecules:

smiles = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # aspirin
    "Cn1cnc2c1c(=O)n(C)c(=O)n2C",  # caffeine
]

molecule_adata = embedder.embed(
    smiles,
    entity_type="molecule",
    id_type="smiles",
    model="morgan_fp",
    output="anndata",
    key="X_morgan_fp",
)

Embed cells from AnnData with model-aware preprocessing:

cell_adata = embedder.embed(
    adata,
    entity_type="cell",
    model="pca",
    preprocessing="auto",
    output="anndata",
    key="X_pca",
)

cell_adata.obsm["X_pca"]
cell_adata.uns["embpy_cell_embeddings"]

Annotate and plot:

from embpy import tl, pl

molecule_adata.obs["smiles"] = molecule_adata.obs_names
molecule_adata = tl.annotate_molecules(
    molecule_adata,
    column="smiles",
    sources=["structural", "bioactivity", "ontology"],
)

pl.plot_embedding_space(
    molecule_adata,
    obsm_key="X_morgan_fp",
    method="pca",
    color="mol_logp",
)

Tutorials

The tutorials are organized by biological entity:

Each notebook uses real BioEmbedder.embed(...) calls, real annotation APIs, and embpy plotting/comparison utilities.

Model Families

embpy supports models across:

  • DNA and regulatory sequence models
  • protein language and structure models
  • small-molecule fingerprints and chemical language models
  • single-cell foundation models and classical baselines
  • morphology models for HPA and JUMP-style images
  • text models for biological descriptions

Use:

embedder.list_available_models()

for the model keys available in your environment.

Output Contract

BioEmbedder.embed(...) follows a scverse-friendly output contract:

  • genes are feature-like and live in .varm by default
  • gene perturbation labels use is_perturbation=True and live in .obsm
  • proteins are feature-like and live in .varm
  • molecules, text, sequences, and cells are observation-like and live in .obsm
  • perturbation/action embeddings can be kept entity-aligned in .uns
  • .X remains expression/count-like data or a sparse placeholder

See the technical guide for the full contract.

Documentation

Citation

If you use embpy in your work, please cite the repository for now. A formal citation will be added when the package is released.

Contact

For questions, issues, or feature requests, open a GitHub issue or contact the maintainers listed in the package metadata.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embpy-0.0.1.tar.gz (3.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embpy-0.0.1-py3-none-any.whl (406.4 kB view details)

Uploaded Python 3

File details

Details for the file embpy-0.0.1.tar.gz.

File metadata

  • Download URL: embpy-0.0.1.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embpy-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2c2731d55030177a5c1d0552df6a1803797705d53ff102887614121180b39a2c
MD5 f806380f56ea266a5ed78671fe4cd908
BLAKE2b-256 c3b7af62068e7a5b699ac641d660dfa46835c5853713c7f0790cab847ccabaea

See more details on using hashes here.

Provenance

The following attestation bundles were made for embpy-0.0.1.tar.gz:

Publisher: release.yaml on theislab/embpy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embpy-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: embpy-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 406.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embpy-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d8eaa39569273c9e914cf1b54493d007362dec2351b691ce913656fd9ce47a19
MD5 35cabf6802e8a931de3431158c615b91
BLAKE2b-256 579bef8b07586f1fa9d350003d32659bdc24987236e0b7980867431e7ef5ec88

See more details on using hashes here.

Provenance

The following attestation bundles were made for embpy-0.0.1-py3-none-any.whl:

Publisher: release.yaml on theislab/embpy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page