Skip to main content

A package for biological embeddings in the perturbation experimental space

Project description

embpy

Tests Documentation

embpy is a Python toolkit for generating biological embeddings with one unified API.

Use it to embed genes, proteins, small molecules, morphology perturbations, and single cells; annotate the resulting objects; and compare embeddings with scverse-friendly plotting and analysis utilities.

embpy architecture

What embpy Does

  • Embeds biological entities through BioEmbedder.embed(...).
  • Resolves biological identifiers into model-ready inputs, such as gene sequences, protein sequences, SMILES strings, and morphology images.
  • Returns AnnData, tables, or payloads with provenance and canonical IDs.
  • Stores generated embeddings outside .X, using .obsm, .varm, or .uns according to the entity type.
  • Adds real metadata annotations for genes, proteins, molecules, and cell lines.
  • Provides plotting and comparison helpers for embedding quality checks.

Install

Pixi is recommended for development and GPU work:

pixi install -e default
pixi run -e default verify

For a pip install:

pip install embpy

For optional GPU/model extras, see the technical guide.

Quick Start

from embpy import BioEmbedder

embedder = BioEmbedder(device="auto", organism="human")

Embed genes with multiple model families:

gene_adata = embedder.embed(
    ["TP53", "EGFR", "MYC"],
    entity_type="gene",
    id_type="symbol",
    model=["hyenadna_tiny_1k", "esm2_8M", "minilm_l6_v2"],
    output="anndata",
)

gene_adata.varm.keys()
gene_adata.uns["embeddings"].keys()

Embed gene perturbation labels as row-aligned action embeddings:

# pert_adata.obs["perturbation"] contains symbols such as TP53/MYC.
pert_adata = embedder.embed(
    pert_adata,
    entity_type="gene",
    obs_column="perturbation",
    id_type="symbol",
    model="esm2_650M",
    output="anndata",
    is_perturbation=True,
    key="X_pert_esm2_650M",
)

pert_adata.obsm["X_pert_esm2_650M"]

Embed proteins:

protein_adata = embedder.embed(
    ["TP53", "EGFR", "BRCA1"],
    entity_type="protein",
    id_type="symbol",
    model="esm2_8M",
    output="anndata",
)

Embed small molecules:

smiles = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # aspirin
    "Cn1cnc2c1c(=O)n(C)c(=O)n2C",  # caffeine
]

molecule_adata = embedder.embed(
    smiles,
    entity_type="molecule",
    id_type="smiles",
    model="morgan_fp",
    output="anndata",
    key="X_morgan_fp",
)

Embed cells from AnnData with model-aware preprocessing:

cell_adata = embedder.embed(
    adata,
    entity_type="cell",
    model="pca",
    preprocessing="auto",
    output="anndata",
    key="X_pca",
)

cell_adata.obsm["X_pca"]
cell_adata.uns["embpy_cell_embeddings"]

Annotate and plot:

from embpy import tl, pl

molecule_adata.obs["smiles"] = molecule_adata.obs_names
molecule_adata = tl.annotate_molecules(
    molecule_adata,
    column="smiles",
    sources=["structural", "bioactivity", "ontology"],
)

pl.plot_embedding_space(
    molecule_adata,
    obsm_key="X_morgan_fp",
    method="pca",
    color="mol_logp",
)

Tutorials

The tutorials are organized by biological entity:

Each notebook uses real BioEmbedder.embed(...) calls, real annotation APIs, and embpy plotting/comparison utilities.

Model Families

embpy supports models across:

  • DNA and regulatory sequence models
  • protein language and structure models
  • small-molecule fingerprints and chemical language models
  • single-cell foundation models and classical baselines
  • morphology models for HPA and JUMP-style images
  • text models for biological descriptions

Use:

embedder.list_available_models()

for the model keys available in your environment.

Output Contract

BioEmbedder.embed(...) follows a scverse-friendly output contract:

  • genes are feature-like and live in .varm by default
  • gene perturbation labels use is_perturbation=True and live in .obsm
  • proteins are feature-like and live in .varm
  • molecules, text, sequences, and cells are observation-like and live in .obsm
  • perturbation/action embeddings can be kept entity-aligned in .uns
  • .X remains expression/count-like data or a sparse placeholder

See the technical guide for the full contract.

Documentation

Citation

If you use embpy in your work, please cite the repository for now. A formal citation will be added when the package is released.

Contact

For questions, issues, or feature requests, open a GitHub issue or contact the maintainers listed in the package metadata.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embpy-0.1.1.tar.gz (3.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embpy-0.1.1-py3-none-any.whl (406.4 kB view details)

Uploaded Python 3

File details

Details for the file embpy-0.1.1.tar.gz.

File metadata

  • Download URL: embpy-0.1.1.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embpy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 139d7a9dffe1511eab9fefeb471223a2229a38a35fc6382acddcb294723ea6ba
MD5 d2265b5b6338fcb912bcbcc8fe3f1658
BLAKE2b-256 2649c02fef5d48ce876ec4208e710989967546019bfb9cb881c0a38c664fb8e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for embpy-0.1.1.tar.gz:

Publisher: release.yaml on theislab/embpy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embpy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: embpy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 406.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embpy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6c06f8e1e6600d3ebb4b5163b7298ab9b705dab2efe13133e7a7f6fba7b81e23
MD5 db00d35883739d6a7efedcca4324adf4
BLAKE2b-256 c8369df43c00cafb1b0cafbcb28a6cdfdbd0af443c6f43b96347416c1ad7f8c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for embpy-0.1.1-py3-none-any.whl:

Publisher: release.yaml on theislab/embpy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page