Protein Embedding Model for Structure Search

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

joan.segura

These details have not been verified by PyPI

Project description

RCSB Embedding Model

Version 0.0.51

Overview

RCSB Embedding Model is a neural network architecture designed to encode macromolecular 3D structures into fixed-length vector embeddings for efficient large-scale structure similarity search.

Preprint: Multi-scale structural similarity embedding search across entire proteomes.

A web-based implementation using this model for structure similarity search is available at rcsb-embedding-search.

If you are interested in training the model with a new dataset, visit the rcsb-embedding-search repository, which provides scripts and documentation for training.

Features

Residue-level embeddings computed using the ESM3 protein language model
Structure-level embeddings aggregated via a transformer-based aggregator network
Command-line interface implemented with Typer for high-throughput inference workflows
Python API for interactive embedding computation and integration into analysis pipelines
High-performance inference leveraging PyTorch Lightning, with multi-node and multi-GPU support

Installation

From PyPI

pip install rcsb-embedding-model

From Source (Development)

git clone https://github.com/rcsb/rcsb-embedding-model.git
cd rcsb-embedding-model
pip install -e .

Requirements:

Python ≥ 3.12
ESM 3.2.3
Lightning 2.6.1
Typer 0.24.1
Biotite 1.6.0
FAISS 1.13.2
igraph 1.0.0
leidenalg 0.11.0
PyTorch with CUDA support (recommended for GPU acceleration)

Optional Dependencies:

faiss-gpu for GPU-accelerated similarity search (instead of faiss-cpu)

Download Pre-trained Models

Before using the package, download the pre-trained ESM3 and aggregator models:

inference download-models

Usage

The package provides two main interfaces:

Command-line Interface (CLI) for batch processing and high-throughput workflows
Python API for interactive use and integration into custom pipelines

Command-Line Interface (CLI)

The CLI provides two main command groups: inference for computing embeddings and search for similarity search operations.

Inference Commands

`inference residue-embedding`

Calculate residue-level embeddings using ESM3. Outputs are stored as PyTorch tensor files.

inference residue-embedding \
  --src-file data/structures.csv \
  --output-path results/residue_embeddings \
  --structure-format mmcif \
  --batch-size 8 \
  --devices auto

Key Options:

--src-file: CSV file with 4 columns: Structure Name | File Path/URL | Chain ID | Output Name
--output-path: Directory to store tensor files
--output-format: separated (individual files) or grouped (single JSON)
--output-name: Filename when using grouped format (default: inference)
--structure-format: mmcif, binarycif, or pdb
--min-res-n: Minimum residue count for chain filtering (default: 0)
--batch-size: Batch size for processing (default: 1)
--num-workers: Data loader workers (default: 0)
--num-nodes: Number of nodes for distributed inference (default: 1)
--accelerator: Device type - auto, cpu, cuda, gpu (default: auto)
--devices: Device indices (can specify multiple with --devices 0 --devices 1) or auto
--strategy: Lightning distribution strategy (default: auto)

`inference structure-embedding`

Calculate complete structure embeddings (residue + aggregator) from structural files. Outputs stored as a single DataFrame.

inference structure-embedding \
  --src-file data/structures.csv \
  --output-path results/structure_embeddings \
  --output-name embeddings \
  --batch-size 4 \
  --devices 0 --devices 1

Key Options:

Same as residue-embedding, plus:
--output-name: Output DataFrame filename (default: inference)

`inference chain-embedding`

Aggregate residue embeddings into chain-level embeddings. Requires pre-computed residue embeddings.

inference chain-embedding \
  --src-file data/structures.csv \
  --res-embedding-location results/residue_embeddings \
  --output-path results/chain_embeddings \
  --batch-size 4

Key Options:

--res-embedding-location: Directory containing residue embedding tensor files
All other options similar to residue-embedding

`inference assembly-embedding`

Aggregate residue embeddings into assembly-level embeddings.

inference assembly-embedding \
  --src-file data/assemblies.csv \
  --res-embedding-location results/residue_embeddings \
  --output-path results/assembly_embeddings \
  --min-res-n 10 \
  --max-res-n 10000

Key Options:

--src-file: CSV with columns: Structure Name | File Path/URL | Assembly ID | Output Name
--res-embedding-location: Directory with pre-computed residue embeddings
--min-res-n: Minimum residues per chain (default: 0)
--max-res-n: Maximum total residues for assembly (default: unlimited)

`inference complete-embedding`

End-to-end pipeline: compute residue, chain, and assembly embeddings in one command.

inference complete-embedding \
  --src-chain-file data/chains.csv \
  --src-assembly-file data/assemblies.csv \
  --output-res-path results/residues \
  --output-chain-path results/chains \
  --output-assembly-path results/assemblies \
  --batch-size-res 8 \
  --batch-size-chain 4 \
  --batch-size-assembly 2

Key Options:

--src-chain-file: Chain input CSV
--src-assembly-file: Assembly input CSV
--output-res-path, --output-chain-path, --output-assembly-path: Output directories
--batch-size-res, --num-workers-res, --num-nodes-res: Residue embedding settings
--batch-size-chain, --num-workers-chain: Chain embedding settings
--batch-size-assembly, --num-workers-assembly, --num-nodes-assembly: Assembly settings

`inference download-models`

Download ESM3 and aggregator models from Hugging Face.

inference download-models

Search Commands

`search build-db`

Build a FAISS database from structure files for similarity search.

search build-db \
  --structure-dir data/pdb_files \
  --output-db databases/my_structures \
  --tmp-dir tmp \
  --granularity chain \
  --min-res 10 \
  --use-gpu-index

Key Options:

--structure-dir: Directory containing structure files
--output-db: Database path (prefix for .index and .metadata files)
--tmp-dir: Temporary directory for intermediate files
--structure-format: mmcif, binarycif, or pdb
--granularity: chain or assembly level embeddings
--file-extension: Filter files by extension (e.g., .cif, .bcif, .pdb)
--min-res: Minimum residue count (default: 10)
--use-gpu-index: Use GPU for FAISS index construction
--accelerator, --devices, --strategy: Inference device settings
--batch-size-res, --num-workers-res, --num-nodes-res: Residue embedding settings
--batch-size-aggregator, --num-workers-aggregator, --num-nodes-aggregator: Aggregator settings

`search query`

Search the database for structures similar to a query structure.

search query \
  --db-path databases/my_structures \
  --query-structure query.cif \
  --structure-format mmcif \
  --granularity chain \
  --top-k 100 \
  --threshold 0.8 \
  --output-csv results.csv

Key Options:

--db-path: Path to FAISS database
--query-structure: Query structure file
--structure-format: mmcif or pdb
--granularity: chain or assembly search mode
--chain-id: Specific chain to search (optional)
--assembly-id: Specific assembly ID (optional)
--top-k: Number of results per query (default: 100)
--threshold: Minimum similarity score (default: 0.8)
--output-csv: Export results to CSV (optional)
--min-res: Minimum residue filter (default: 10)
--max-res: Maximum residue filter (optional)
--device: cuda, cpu, or auto
--use-gpu-index: Use GPU for FAISS search

`search query-db`

Compare all entries from a query database against a subject database.

search query-db \
  --query-db-path databases/query_set \
  --subject-db-path databases/target_set \
  --top-k 100 \
  --threshold 0.8 \
  --output-csv comparisons.csv

Key Options:

--query-db-path: Query database path
--subject-db-path: Subject database to search
--top-k: Results per query (default: 100)
--threshold: Similarity threshold (default: 0.8)
--output-csv: Export results to CSV
--use-gpu-index: Use GPU acceleration

`search stats`

Display database statistics.

search stats --db-path databases/my_structures

`search cluster`

Cluster database embeddings using the Leiden algorithm.

search cluster \
  --db-path databases/my_structures \
  --threshold 0.8 \
  --resolution 1.0 \
  --output clusters.csv \
  --max-neighbors 1000 \
  --min-cluster-size 5

Key Options:

--db-path: Database path
--threshold: Similarity threshold for edge creation (default: 0.8)
--resolution: Leiden resolution parameter - higher values create more clusters (default: 1.0)
--output: Output file (.csv or .json)
--max-neighbors: Maximum neighbors per chain (default: 1000)
--min-cluster-size: Filter out smaller clusters (optional)
--use-gpu-index: Use GPU for FAISS operations
--seed: Random seed for reproducibility (optional)

Python API

The RcsbStructureEmbedding class provides methods for computing embeddings programmatically.

Basic Usage

from rcsb_embedding_model import RcsbStructureEmbedding

# Initialize model
model = RcsbStructureEmbedding(min_res=10, max_res=5000)

# Load models (optional - loads automatically on first use)
model.load_models()  # Auto-detects CUDA
# or specify device:
# import torch
# model.load_models(device=torch.device("cuda:0"))

Methods

`load_models(device=None)`

Load both residue and aggregator models.

import torch
model.load_models(device=torch.device("cuda"))

`load_residue_embedding(device=None)`

Load only the ESM3 residue embedding model.

model.load_residue_embedding()

`load_aggregator_embedding(device=None)`

Load only the aggregator model.

model.load_aggregator_embedding()

`residue_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)`

Compute per-residue embeddings for a structure.

Parameters:

src_structure: File path, URL, or file-like object
structure_format: 'mmcif', 'binarycif', or 'pdb'
chain_id: Specific chain ID (optional, uses all chains if None)
assembly_id: Assembly ID for biological assembly (optional)

Returns: torch.Tensor of shape [num_residues, embedding_dim]

# Single chain
residue_emb = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)

# All chains concatenated
all_residues = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif"
)

# Biological assembly
assembly_residues = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    assembly_id="1"
)

`residue_embedding_by_chain(src_structure, structure_format='mmcif', chain_id=None)`

Compute per-residue embeddings separately for each chain.

Returns: dict[str, torch.Tensor] mapping chain IDs to embeddings

chain_embeddings = model.residue_embedding_by_chain(
    src_structure="1abc.cif",
    structure_format="mmcif"
)
# Returns: {'A': tensor(...), 'B': tensor(...), ...}

# Get specific chain
chain_a = model.residue_embedding_by_chain(
    src_structure="1abc.cif",
    chain_id="A"
)

`residue_embedding_by_assembly(src_structure, structure_format='mmcif', assembly_id=None)`

Compute residue embeddings for an assembly.

Returns: dict[str, torch.Tensor] mapping assembly ID to concatenated embeddings

assembly_emb = model.residue_embedding_by_assembly(
    src_structure="1abc.cif",
    structure_format="mmcif",
    assembly_id="1"
)
# Returns: {'1': tensor(...)}

`sequence_embedding(sequence)`

Compute residue embeddings from amino acid sequence (no structural information).

Parameters:

sequence: Amino acid sequence string (plain or FASTA format)

Returns: torch.Tensor of shape [sequence_length, embedding_dim]

# Plain sequence
seq_emb = model.sequence_embedding("ACDEFGHIKLMNPQRSTVWY")

# FASTA format
fasta = """>Protein1
ACDEFGHIKLMNPQRSTVWY
ACDEFGHIKLMNPQRSTVWY"""
seq_emb = model.sequence_embedding(fasta)

`aggregator_embedding(residue_embedding)`

Aggregate residue embeddings into a single structure-level vector.

Parameters:

residue_embedding: torch.Tensor from residue embedding methods

Returns: torch.Tensor of shape [1536]

residue_emb = model.residue_embedding("1abc.cif", chain_id="A")
structure_emb = model.aggregator_embedding(residue_emb)

`structure_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)`

End-to-end: compute residue embeddings and aggregate in one call.

# Complete structure embedding
structure_emb = model.structure_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)
# Returns: tensor of shape [1536]

Complete Example

from rcsb_embedding_model import RcsbStructureEmbedding
import torch

# Initialize
model = RcsbStructureEmbedding(min_res=10, max_res=5000)

# Option 1: Full structure embedding (one-shot)
embedding = model.structure_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)

# Option 2: Step-by-step with residue embeddings
residue_emb = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)
structure_emb = model.aggregator_embedding(residue_emb)

# Option 3: Process multiple chains
chain_embeddings = model.residue_embedding_by_chain(
    src_structure="1abc.cif"
)
for chain_id, res_emb in chain_embeddings.items():
    chain_emb = model.aggregator_embedding(res_emb)
    print(f"Chain {chain_id}: {chain_emb.shape}")

# Sequence-based embedding
seq_emb = model.sequence_embedding("ACDEFGHIKLMNPQRSTVWY")
structure_from_seq = model.aggregator_embedding(seq_emb)

See the examples/ and tests/ directories for more use cases.

Model Architecture

The embedding model is trained to predict structural similarity by approximating TM-scores using cosine distances between embeddings. It consists of two main components:

Protein Language Model (PLM): Computes residue-level embeddings from a given 3D structure.
Residue Embedding Aggregator: A transformer-based neural network that aggregates these residue-level embeddings into a single vector.

Embedding model architecture

Protein Language Model (PLM)

Residue-wise embeddings of protein structures are computed using the ESM3 generative protein language model.

Residue Embedding Aggregator

The aggregation component consists of six transformer encoder layers, each with a 3,072-neuron feedforward layer and ReLU activations. After processing through these layers, a summation pooling operation is applied, followed by 12 fully connected residual layers that refine the embeddings into a single 1,536-dimensional vector.

Testing

After installation, run the test suite:

pytest

Citation

Segura, J., et al. (2026). Multi-scale structural similarity embedding search across entire proteomes. (https://doi.org/10.1093/bioinformatics/btag058)

License

This project uses the EvolutionaryScale ESM-3 model and is distributed under the Cambrian Non-Commercial License Agreement.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

joan.segura

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.51

Mar 26, 2026

0.0.50

Mar 26, 2026

0.0.49

Feb 3, 2026

0.0.48

Dec 19, 2025

0.0.47

Dec 19, 2025

0.0.46

Dec 18, 2025

0.0.45

Dec 18, 2025

0.0.44

Dec 17, 2025

0.0.43

Nov 29, 2025

0.0.42

Nov 25, 2025

0.0.41

Nov 25, 2025

0.0.40

Nov 21, 2025

0.0.39

Nov 18, 2025

0.0.38

Oct 16, 2025

0.0.36

Jun 26, 2025

0.0.35

Jun 21, 2025

0.0.34

Jun 20, 2025

0.0.33

Jun 19, 2025

0.0.32

Jun 18, 2025

0.0.31

Jun 17, 2025

0.0.30

Jun 17, 2025

0.0.29

Jun 16, 2025

0.0.28

Jun 13, 2025

0.0.27

Jun 13, 2025

0.0.26

May 16, 2025

0.0.25

May 16, 2025

0.0.24

May 16, 2025

0.0.23

May 15, 2025

0.0.22

May 14, 2025

0.0.21

May 14, 2025

0.0.20

May 13, 2025

0.0.19

May 9, 2025

0.0.18

May 9, 2025

0.0.17

May 5, 2025

0.0.16

May 2, 2025

0.0.15

May 2, 2025

0.0.14

Apr 30, 2025

0.0.13

Apr 29, 2025

0.0.12

Apr 29, 2025

0.0.11

Apr 28, 2025

0.0.10

Apr 28, 2025

0.0.9

Apr 28, 2025

0.0.8

Apr 28, 2025

0.0.7

Apr 24, 2025

0.0.6

Apr 24, 2025

0.0.5

Apr 23, 2025

0.0.4

Apr 22, 2025

0.0.3

Apr 18, 2025

0.0.2

Apr 17, 2025

0.0.1

Apr 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rcsb_embedding_model-0.0.51.tar.gz (4.5 MB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rcsb_embedding_model-0.0.51-py3-none-any.whl (49.7 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file rcsb_embedding_model-0.0.51.tar.gz.

File metadata

Download URL: rcsb_embedding_model-0.0.51.tar.gz
Upload date: Mar 26, 2026
Size: 4.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rcsb_embedding_model-0.0.51.tar.gz
Algorithm	Hash digest
SHA256	`7509d85523fc8b5fdcbeb59c2dc44ba170cd9a2ebcfabcc5457ea98510577fb9`
MD5	`2884bfdcea9c223fa64d00be91d92f82`
BLAKE2b-256	`a543bcf8d614c8143b4589f5cfaa94efec4fe9005b03e2976070fadcef9adf81`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rcsb_embedding_model-0.0.51.tar.gz:

Publisher: publish.yaml on rcsb/rcsb-embedding-model

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rcsb_embedding_model-0.0.51.tar.gz
- Subject digest: 7509d85523fc8b5fdcbeb59c2dc44ba170cd9a2ebcfabcc5457ea98510577fb9
- Sigstore transparency entry: 1186421181
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: rcsb/rcsb-embedding-model@56a88bcd4b8fa13306c672803955051581b79ca6
- Branch / Tag: refs/tags/v0.0.51
- Owner: https://github.com/rcsb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@56a88bcd4b8fa13306c672803955051581b79ca6
- Trigger Event: release

File details

Details for the file rcsb_embedding_model-0.0.51-py3-none-any.whl.

File metadata

Download URL: rcsb_embedding_model-0.0.51-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 49.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rcsb_embedding_model-0.0.51-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f0c8d1cde35745d374d79e86955f2616a65d2b7217df1c994caf54128ca657e`
MD5	`e0ffe591516e3add2db62efb7a452aa0`
BLAKE2b-256	`fa25446a692c6a859bce8d0dd0e28983539970763b4585e2e14dbe2714e1cdf1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rcsb_embedding_model-0.0.51-py3-none-any.whl:

Publisher: publish.yaml on rcsb/rcsb-embedding-model

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rcsb_embedding_model-0.0.51-py3-none-any.whl
- Subject digest: 3f0c8d1cde35745d374d79e86955f2616a65d2b7217df1c994caf54128ca657e
- Sigstore transparency entry: 1186421185
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: rcsb/rcsb-embedding-model@56a88bcd4b8fa13306c672803955051581b79ca6
- Branch / Tag: refs/tags/v0.0.51
- Owner: https://github.com/rcsb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@56a88bcd4b8fa13306c672803955051581b79ca6
- Trigger Event: release

rcsb-embedding-model 0.0.51

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

RCSB Embedding Model

Overview

Features

Installation

From PyPI

From Source (Development)

Download Pre-trained Models

Usage

Command-Line Interface (CLI)

Inference Commands

inference residue-embedding

inference structure-embedding

inference chain-embedding

inference assembly-embedding

inference complete-embedding

inference download-models

Search Commands

search build-db

search query

search query-db

search stats

search cluster

Python API

Basic Usage

Methods

load_models(device=None)

load_residue_embedding(device=None)

load_aggregator_embedding(device=None)

residue_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)

residue_embedding_by_chain(src_structure, structure_format='mmcif', chain_id=None)

residue_embedding_by_assembly(src_structure, structure_format='mmcif', assembly_id=None)

sequence_embedding(sequence)

aggregator_embedding(residue_embedding)

structure_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)

Complete Example

Model Architecture

Protein Language Model (PLM)

Residue Embedding Aggregator

Testing

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`inference residue-embedding`

`inference structure-embedding`

`inference chain-embedding`

`inference assembly-embedding`

`inference complete-embedding`

`inference download-models`

`search build-db`

`search query`

`search query-db`

`search stats`

`search cluster`

`load_models(device=None)`

`load_residue_embedding(device=None)`

`load_aggregator_embedding(device=None)`

`residue_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)`

`residue_embedding_by_chain(src_structure, structure_format='mmcif', chain_id=None)`

`residue_embedding_by_assembly(src_structure, structure_format='mmcif', assembly_id=None)`

`sequence_embedding(sequence)`

`aggregator_embedding(residue_embedding)`

`structure_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)`