Skip to main content

Protein Embedding Model for Structure Search

Project description

FoldMatch

Version 0.7.1

Overview

FoldMatch is a Python toolkit to encode macromolecular 3D structures into fixed-length vector embeddings for efficient large-scale structure similarity search and clustering.

Reference: Multi-scale structural similarity embedding search across entire proteomes.

A web-based implementation using this tool for structure similarity search is available at rcsb-embedding-search.

If you are interested in training a new model with a new structure dataset, visit the rcsb-embedding-search repository, which provides scripts and documentation for training.

Features

  • Residue-level embeddings computed using the ESM3 protein language model
  • Sequence-based embeddings from FASTA files without requiring 3D structures
  • Structure-level embeddings aggregated via a transformer-based aggregator network
  • Fast and efficient FAISS-based similarity search
  • Two-stage sequence search — an embedding prefilter followed by exact pairwise Smith-Waterman alignment, reporting sequence identity, coverage, and approximate significance
  • Structural clustering using the Leiden algorithm for biological assembly identification
  • Command-line interface implemented with Typer for high-throughput inference workflows
  • Python API for interactive embedding computation and integration into analysis pipelines
  • High-performance inference leveraging PyTorch Lightning, with multi-node and multi-GPU support

Installation

From PyPI

pip install foldmatch

From Source (Development)

git clone https://github.com/rcsb/foldmatch.git
cd foldmatch
pip install -e .

Requirements:

  • Python ≥ 3.12
  • ESM 3.2.3
  • Lightning 2.6.1
  • Typer 0.24.1
  • Biotite 1.6.0
  • FAISS 1.13.2
  • igraph 1.0.0
  • leidenalg 0.11.0
  • PyTorch with CUDA support (recommended for GPU acceleration)

Optional Dependencies:

  • faiss-gpu for GPU-accelerated similarity search (instead of faiss-cpu)

Usage

The package provides two main interfaces:

  1. Command-line Interface (CLI) for batch processing and high-throughput workflows
  2. Python API for interactive use and integration into custom pipelines

Command-Line Interface (CLI)

The toolkit ships three CLIs. Each is invoked with --help for full option documentation; the canonical examples below are enough to get started.

fm-embedding — compute embeddings

Two subcommand groups reflect input modality:

# Residue / chain / assembly embeddings from a folder of 3D structures
fm-embedding from-structures residue  --src-folder data/pdb --output-path out --structure-format mmcif
fm-embedding from-structures chain    --src-folder data/pdb --output-path out --structure-format mmcif
fm-embedding from-structures assembly --src-folder data/pdb --output-path out --structure-format mmcif

# Residue / chain embeddings from protein sequences in a FASTA file (no 3D required)
fm-embedding from-sequences  residue  --fasta-file seqs.fasta --output-path out
fm-embedding from-sequences  chain    --fasta-file seqs.fasta --output-path out

# One-shot model download
fm-embedding download-models

Assembly-level embeddings are only available under from-structures — there is no assembly concept for a bare sequence.

Run fm-embedding [from-structures|from-sequences] [command] --help for full options (batch size, accelerator, devices, output format, distributed settings, etc.).

fm-search — build and query FAISS databases

# Build a similarity-search database from structures, FASTA, or pre-computed embeddings
fm-search build structures  --structure-folder data/pdb --output-db dbs/my_db --tmp-embedding-folder tmp
fm-search build sequences   --fasta-file seqs.fasta     --output-db dbs/my_db --tmp-embedding-folder tmp
fm-search build embeddings  --embedding-folder out      --output-db dbs/my_db

# Query the database
fm-search query structure   --db-path dbs/my_db --query-structure q.cif
fm-search query sequences   --db-path dbs/my_db --fasta-file q.fasta --tmp-embedding-folder tmp
fm-search query embedding   --db-path dbs/my_db --embedding-file q.pt
fm-search query db          --query-db-path dbs/queries --subject-db-path dbs/my_db

# Inspect, cluster, export
fm-search stats             --db-path dbs/my_db
fm-search cluster           --db-path dbs/my_db --output clusters.csv
fm-search similarity-graph  --db-path dbs/my_db --output graph.graphml

All build commands accept --index-type [auto|flat|hnsw|ivf_pq] and IVF-PQ tuning flags (--ivf-nlist, --ivf-nprobe). See fm-search <subcommand> --help for the full surface.

Two-stage sequence search (exact identity)

build sequences also writes a sidecar {db}.sequences store next to the FAISS index. This lets sequence-built databases report exact sequence identity, not just embedding similarity: when you run query sequences (or query db) against such a database, a second stage pairwise-aligns each embedding hit (local Smith-Waterman, BLOSUM62) and adds SeqIdentity_aln, SeqIdentity_shorter, QueryCoverage, SubjectCoverage, AlnLen, AlnScore, and Pvalue_approx/Evalue_approx columns; surviving hits are re-ranked by identity.

# Stage 2 turns on automatically when the database has a sequence store
fm-search query sequences --db-path dbs/my_db --fasta-file q.fasta --tmp-embedding-folder tmp
  • Auto by default: Stage 2 runs when the database(s) carry a sequence store and falls back to embedding-only otherwise. Force it with --seq-identity (errors if no store is present) or disable with --no-seq-identity. query db requires both databases to have sequence stores.
  • Hits below --min-seq-identity (default 0.3) or --min-coverage are dropped.
  • Tuning: --gap-open, --gap-extend, and --align-workers (defaults to all CPUs on the node).
  • Pvalue_approx/Evalue_approx are an approximate, relative-only significance signal (sampled Karlin–Altschul λ/K) — useful for ranking within FoldMatch, but not calibrated like BLAST/mmseqs2 E-values.

inference — low-level inference subcommands

Lower-level entry point exposing individual inference passes (residue-embedding, structure-embedding, chain-embedding, assembly-embedding, complete-embedding). Mostly useful for advanced workflows that compose inference stages explicitly. Run inference --help for the command list.


Python API

The RcsbStructureEmbedding class provides methods for computing embeddings programmatically.

Basic Usage

from foldmatch import FoldMatch

# Initialize model
model = FoldMatch(min_res=10, max_res=5000)

# Load models (optional - loads automatically on first use)
model.load_models()  # Auto-detects CUDA
# or specify device:
# import torch
# model.load_models(device=torch.device("cuda:0"))

Methods

load_models(device=None)

Load both residue and aggregator models.

import torch
model.load_models(device=torch.device("cuda"))

load_residue_embedding(device=None)

Load only the ESM3 residue embedding model.

model.load_residue_embedding()

load_aggregator_embedding(device=None)

Load only the aggregator model.

model.load_aggregator_embedding()

residue_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)

Compute per-residue embeddings for a structure.

Parameters:

  • src_structure: File path, URL, or file-like object
  • structure_format: 'mmcif', 'binarycif', or 'pdb'
  • chain_id: Specific chain ID (optional, uses all chains if None)
  • assembly_id: Assembly ID for biological assembly (optional)

Returns: torch.Tensor of shape [num_residues, embedding_dim]

# Single chain
residue_emb = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)

# All chains concatenated
all_residues = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif"
)

# Biological assembly
assembly_residues = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    assembly_id="1"
)

residue_embedding_by_chain(src_structure, structure_format='mmcif', chain_id=None)

Compute per-residue embeddings separately for each chain.

Returns: dict[str, torch.Tensor] mapping chain IDs to embeddings

chain_embeddings = model.residue_embedding_by_chain(
    src_structure="1abc.cif",
    structure_format="mmcif"
)
# Returns: {'A': tensor(...), 'B': tensor(...), ...}

# Get specific chain
chain_a = model.residue_embedding_by_chain(
    src_structure="1abc.cif",
    chain_id="A"
)

residue_embedding_by_assembly(src_structure, structure_format='mmcif', assembly_id=None)

Compute residue embeddings for an assembly.

Returns: dict[str, torch.Tensor] mapping assembly ID to concatenated embeddings

assembly_emb = model.residue_embedding_by_assembly(
    src_structure="1abc.cif",
    structure_format="mmcif",
    assembly_id="1"
)
# Returns: {'1': tensor(...)}

sequence_embedding(sequence)

Compute residue embeddings from amino acid sequence (no structural information).

Parameters:

  • sequence: Amino acid sequence string (plain or FASTA format)

Returns: torch.Tensor of shape [sequence_length, embedding_dim]

# Plain sequence
seq_emb = model.sequence_embedding("ACDEFGHIKLMNPQRSTVWY")

# FASTA format
fasta = """>Protein1
ACDEFGHIKLMNPQRSTVWY
ACDEFGHIKLMNPQRSTVWY"""
seq_emb = model.sequence_embedding(fasta)

aggregator_embedding(residue_embedding)

Aggregate residue embeddings into a single structure-level vector.

Parameters:

  • residue_embedding: torch.Tensor from residue embedding methods

Returns: torch.Tensor of shape [1536]

residue_emb = model.residue_embedding("1abc.cif", chain_id="A")
structure_emb = model.aggregator_embedding(residue_emb)

structure_embedding(src_structure, structure_format='mmcif', chain_id=None, assembly_id=None)

End-to-end: compute residue embeddings and aggregate in one call.

# Complete structure embedding
structure_emb = model.structure_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)
# Returns: tensor of shape [1536]

Complete Example

from foldmatch import FoldMatch
import torch

# Initialize
model = FoldMatch(min_res=10, max_res=5000)

# Option 1: Full structure embedding (one-shot)
embedding = model.structure_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)

# Option 2: Step-by-step with residue embeddings
residue_emb = model.residue_embedding(
    src_structure="1abc.cif",
    structure_format="mmcif",
    chain_id="A"
)
structure_emb = model.aggregator_embedding(residue_emb)

# Option 3: Process multiple chains
chain_embeddings = model.residue_embedding_by_chain(
    src_structure="1abc.cif"
)
for chain_id, res_emb in chain_embeddings.items():
    chain_emb = model.aggregator_embedding(res_emb)
    print(f"Chain {chain_id}: {chain_emb.shape}")

# Sequence-based embedding
seq_emb = model.sequence_embedding("ACDEFGHIKLMNPQRSTVWY")
structure_from_seq = model.aggregator_embedding(seq_emb)

See the examples/ and tests/ directories for more use cases.


Model Architecture

The embedding model is trained to predict structural similarity by approximating TM-scores using cosine distances between embeddings. It consists of two main components:

  • Protein Language Model (PLM): Computes residue-level embeddings from a given 3D structure.
  • Residue Embedding Aggregator: A transformer-based neural network that aggregates these residue-level embeddings into a single vector.

Embedding model architecture

Protein Language Model (PLM)

Residue-wise embeddings of protein structures are computed using the ESM3 generative protein language model.

Residue Embedding Aggregator

The aggregation component consists of six transformer encoder layers, each with a 3,072-neuron feedforward layer and ReLU activations. After processing through these layers, a summation pooling operation is applied, followed by 12 fully connected residual layers that refine the embeddings into a single 1,536-dimensional vector.


Testing

After installation, run the test suite:

pytest

macOS notes

The problem. PyPI wheels for faiss-cpu and torch (pulled in via lightning) each bundle their own copy of libomp.dylib. On macOS, both copies get loaded into the same Python process. Whenever FAISS enters an OpenMP-parallel section (batched search with more than one query vector, IndexHNSWFlat graph construction, IVF-PQ training) the second OpenMP runtime fails to pthread_mutex_init and the call deadlocks — the CLI appears to hang indefinitely. Linux installs are unaffected because both libraries share a single OpenMP runtime.

Affected commands on macOS without mitigation:

  • fm-search build with --index-type hnsw or auto past ~10k vectors, and any --index-type ivf_pq.
  • fm-search query embedding with a multi-row .parquet file.
  • fm-search query sequences with more than one input sequence.
  • fm-search query db (database-to-database).

Single-query paths (fm-search query structure, small --index-type flat builds) are unaffected.

Possible fixes.

  1. Fix the install environment — install both libraries against a unified OpenMP runtime. On conda-forge:

    conda install -c conda-forge faiss-cpu pytorch llvm-openmp
    

    Once a single libomp is loaded, FAISS's parallel paths just work and you keep the full multi-threaded performance.

  2. Force single-threaded FAISS via environment variable — set OMP_NUM_THREADS=1 before invoking Python:

    export OMP_NUM_THREADS=1
    fm-search query db ...
    

    Sidesteps the parallel section entirely. Toolkit works, but FAISS runs single-threaded so large builds and queries are slower.

What this package does by default. To prevent macOS users from hitting a silent hang out of the box, foldmatch/__init__.py calls os.environ.setdefault("OMP_NUM_THREADS", "1") on darwin only — before any torch or faiss import. This is option 2 above, applied automatically. Linux installs are not touched (the branch is skipped). A user on macOS who has fixed their environment per option 1 can opt back into parallelism by exporting OMP_NUM_THREADS=N before launching Python — setdefault respects an existing value.


Citation

Segura, J., et al. (2026). Multi-scale structural similarity embedding search across entire proteomes. (https://doi.org/10.1093/bioinformatics/btag058)


License

This project uses the EvolutionaryScale ESM-3 model and is distributed under the Cambrian Non-Commercial License Agreement.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

foldmatch-0.7.1.tar.gz (4.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

foldmatch-0.7.1-py3-none-any.whl (83.4 kB view details)

Uploaded Python 3

File details

Details for the file foldmatch-0.7.1.tar.gz.

File metadata

  • Download URL: foldmatch-0.7.1.tar.gz
  • Upload date:
  • Size: 4.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for foldmatch-0.7.1.tar.gz
Algorithm Hash digest
SHA256 4600608bccfd92bf26ec1437f63b6d9e8f4b7f32b0ae4b558e079894cde0a4ad
MD5 b069b121730580024129f330642e8a5d
BLAKE2b-256 18c9b216896e43b40adccd5951f65f5606dc49aabe60ddb6ab4552a193288a74

See more details on using hashes here.

Provenance

The following attestation bundles were made for foldmatch-0.7.1.tar.gz:

Publisher: publish.yaml on rcsb/foldmatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file foldmatch-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: foldmatch-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 83.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for foldmatch-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b4747ee5fd4f11793a12496eb09011c65f1f10838a3c77813992fadb39d01aac
MD5 108efa0ec9b801861a66b3f988409cd0
BLAKE2b-256 2820302235796523ae74ba4458fef682b97e721c74e74baef2dc58e96e67e30c

See more details on using hashes here.

Provenance

The following attestation bundles were made for foldmatch-0.7.1-py3-none-any.whl:

Publisher: publish.yaml on rcsb/foldmatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page