Skip to main content

Extract protein embeddings from FASTA files using ESM-2, ProtT5, and ProtBert models

Project description

ProtEmbedder

PyPI version Python License: MIT

Extract protein embeddings from FASTA files using ESM-2, ProtT5, and ProtBert protein language models.

Installation

pip install protembedder

Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0, transformers ≥ 4.30

For development / editable install: pip install -e .

Quick Start

CLI Usage

# Per-protein embeddings with ESM-2 650M (default)
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt

# Per-protein embeddings with ProtT5-XL
protembedder -m prot_t5_xl -i proteins.fasta -o embeddings.pt

# Per-protein embeddings with ProtBert
protembedder -m prot_bert -i proteins.fasta -o embeddings.pt

# Per-residue (per amino acid) embeddings
protembedder -m prot_bert -i proteins.fasta -o embeddings.pt --per-residue

# GPU with custom batch size
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16

# Small ESM-2 model for quick testing
protembedder -m esm2_t6_8M -i proteins.fasta -o embeddings.pt -v

CLI Flags

Flag Short Required Default Description
--model -m Yes Model name (see tables below)
--input -i Yes Input FASTA file path
--output -o Yes Output .pt file path
--per-residue No False Per amino acid embeddings
--device No auto cpu, cuda, cuda:0, etc.
--batch-size No 8 Sequences per batch
--verbose -v No False Verbose logging

Available Models

ESM-2 (Meta AI)

Model Parameters Embedding Dim Layers
esm2_t6_8M 8M 320 6
esm2_t12_35M 35M 480 12
esm2_t30_150M 150M 640 30
esm2_t33_650M 650M 1280 33
esm2_t36_3B 3B 2560 36
esm2_t48_15B 15B 5120 48

ProtT5 (Rostlab)

Model Parameters Embedding Dim HuggingFace Repo
prot_t5_xl 3B 1024 Rostlab/prot_t5_xl_half_uniref50-enc

ProtBert (Rostlab)

Model Parameters Embedding Dim HuggingFace Repo
prot_bert 420M 1024 Rostlab/prot_bert

Python API

import torch
from protembedder import ProteinEmbedder

# ESM-2
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")

# ProtT5-XL
embedder = ProteinEmbedder("prot_t5_xl", device="cuda")

# ProtBert
embedder = ProteinEmbedder("prot_bert", device="cuda")

# From FASTA file — per-protein embeddings (default)
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False)

# From sequence list — per-residue embeddings
sequences = [
    ("protein_1", "MKTAYIAKQRQISFVKSH"),
    ("protein_2", "MDEVLQAELPAEG"),
]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)

# Save / Load
torch.save(embeddings, "embeddings.pt")
loaded = torch.load("embeddings.pt")

Output Format

The output .pt file contains a Python dict: {header: tensor}.

  • Per-protein (default): tensor shape is (embed_dim,)
  • Per-residue (--per-residue): tensor shape is (seq_len, embed_dim)
emb = torch.load("embeddings.pt")
for name, tensor in emb.items():
    print(f"{name}: {tensor.shape}")
# protein_1: torch.Size([1280])        # ESM-2 650M, per-protein
# protein_1: torch.Size([18, 1024])    # ProtBert, per-residue

OOM Handling

If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time. You can also reduce --batch-size manually.

References

Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574

Elnaggar, A., et al. "ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.10 (2021): 7112-7127. https://doi.org/10.1109/TPAMI.2021.3095381

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protembedder-0.3.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protembedder-0.3.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file protembedder-0.3.0.tar.gz.

File metadata

  • Download URL: protembedder-0.3.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9ef331b8fddb9078cc95f0d4065855a198be301b320a1029af5f32c535978e3c
MD5 53204ad14407deeadc83fe666dd5c425
BLAKE2b-256 92056b63d76d0e4464202592e378b1bd603217c134722ed9fba0b1b428efba20

See more details on using hashes here.

File details

Details for the file protembedder-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: protembedder-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f41dcfa7f7fbe0d9580c418c747a01bed43a8f86da22290e8d988125af3a2bf9
MD5 70bd13753cca8f963b0e0fb0e999e963
BLAKE2b-256 c8d549c10fd99a49d1909a22632257e9d51e3d88349928c5eb8520b522d96624

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page