Skip to main content

Extract protein embeddings from FASTA files using ESM-2 and ProtT5 models

Project description

ProtEmbedder

PyPI version Python License: MIT

Extract protein embeddings from FASTA files using ESM-2 and ProtT5 protein language models.

Installation

pip install protembedder

Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0, transformers ≥ 4.30

For development / editable install: pip install -e .

Quick Start

CLI Usage

# Per-protein embeddings with ESM-2 650M (default)
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt

# Per-protein embeddings with ProtT5-XL
protembedder -m prot_t5_xl -i proteins.fasta -o embeddings.pt

# Per-residue (per amino acid) embeddings
protembedder -m prot_t5_xl -i proteins.fasta -o embeddings.pt --per-residue

# GPU with custom batch size
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16

# Small ESM-2 model for quick testing
protembedder -m esm2_t6_8M -i proteins.fasta -o embeddings.pt -v

CLI Flags

Flag Short Required Default Description
--model -m Yes Model name (see tables below)
--input -i Yes Input FASTA file path
--output -o Yes Output .pt file path
--per-residue No False Per amino acid embeddings
--device No auto cpu, cuda, cuda:0, etc.
--batch-size No 8 Sequences per batch
--verbose -v No False Verbose logging

Available Models

ESM-2 (Meta AI)

Model Parameters Embedding Dim Layers
esm2_t6_8M 8M 320 6
esm2_t12_35M 35M 480 12
esm2_t30_150M 150M 640 30
esm2_t33_650M 650M 1280 33
esm2_t36_3B 3B 2560 36
esm2_t48_15B 15B 5120 48

ProtT5 (Rostlab)

Model Parameters Embedding Dim HuggingFace Repo
prot_t5_xl 3B 1024 Rostlab/prot_t5_xl_half_uniref50-enc

Python API

import torch
from protembedder import ProteinEmbedder

# ESM-2
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")

# ProtT5-XL
embedder = ProteinEmbedder("prot_t5_xl", device="cuda")

# From FASTA file — per-protein embeddings (default)
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False)

# From sequence list — per-residue embeddings
sequences = [
    ("protein_1", "MKTAYIAKQRQISFVKSH"),
    ("protein_2", "MDEVLQAELPAEG"),
]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)

# Save / Load
torch.save(embeddings, "embeddings.pt")
loaded = torch.load("embeddings.pt")

Output Format

The output .pt file contains a Python dict: {header: tensor}.

  • Per-protein (default): tensor shape is (embed_dim,)
  • Per-residue (--per-residue): tensor shape is (seq_len, embed_dim)
emb = torch.load("embeddings.pt")
for name, tensor in emb.items():
    print(f"{name}: {tensor.shape}")
# protein_1: torch.Size([1280])        # ESM-2 650M, per-protein
# protein_1: torch.Size([18, 1024])    # ProtT5-XL, per-residue

OOM Handling

If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time. You can also reduce --batch-size manually.

References

Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574

Elnaggar, A., et al. "ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.10 (2021): 7112-7127. https://doi.org/10.1109/TPAMI.2021.3095381

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protembedder-0.2.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protembedder-0.2.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file protembedder-0.2.0.tar.gz.

File metadata

  • Download URL: protembedder-0.2.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6ec1b24c0b7a450cf24426ab863d4fa2bf306e62fbc4f258b7e3bc008072775a
MD5 b8268e448b2651bfa41e7eb00eb16f96
BLAKE2b-256 872a00ccd3abe324911e88d1f864cd6ef0bbf3918b23851ec891e37801db43a2

See more details on using hashes here.

File details

Details for the file protembedder-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: protembedder-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d0b85bf94621a7b0c6796f9c5b4b08415e8f4c0e44532712762ecf359bfb897
MD5 959faa76bdf166ea8414931e521807fd
BLAKE2b-256 7c552e50367ae1f2a3d2ecc43484a9dc9eb75b47b011b16bbc16753240f88c5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page