Extract protein embeddings from FASTA files using ESM-2 and ProtT5 models
Project description
ProtEmbedder
Extract protein embeddings from FASTA files using ESM-2 and ProtT5 protein language models.
Installation
pip install protembedder
Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0, transformers ≥ 4.30
For development / editable install:
pip install -e .
Quick Start
CLI Usage
# Per-protein embeddings with ESM-2 650M (default)
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt
# Per-protein embeddings with ProtT5-XL
protembedder -m prot_t5_xl -i proteins.fasta -o embeddings.pt
# Per-residue (per amino acid) embeddings
protembedder -m prot_t5_xl -i proteins.fasta -o embeddings.pt --per-residue
# GPU with custom batch size
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16
# Small ESM-2 model for quick testing
protembedder -m esm2_t6_8M -i proteins.fasta -o embeddings.pt -v
CLI Flags
| Flag | Short | Required | Default | Description |
|---|---|---|---|---|
--model |
-m |
Yes | — | Model name (see tables below) |
--input |
-i |
Yes | — | Input FASTA file path |
--output |
-o |
Yes | — | Output .pt file path |
--per-residue |
— | No | False |
Per amino acid embeddings |
--device |
— | No | auto | cpu, cuda, cuda:0, etc. |
--batch-size |
— | No | 8 |
Sequences per batch |
--verbose |
-v |
No | False |
Verbose logging |
Available Models
ESM-2 (Meta AI)
| Model | Parameters | Embedding Dim | Layers |
|---|---|---|---|
esm2_t6_8M |
8M | 320 | 6 |
esm2_t12_35M |
35M | 480 | 12 |
esm2_t30_150M |
150M | 640 | 30 |
esm2_t33_650M |
650M | 1280 | 33 |
esm2_t36_3B |
3B | 2560 | 36 |
esm2_t48_15B |
15B | 5120 | 48 |
ProtT5 (Rostlab)
| Model | Parameters | Embedding Dim | HuggingFace Repo |
|---|---|---|---|
prot_t5_xl |
3B | 1024 | Rostlab/prot_t5_xl_half_uniref50-enc |
Python API
import torch
from protembedder import ProteinEmbedder
# ESM-2
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")
# ProtT5-XL
embedder = ProteinEmbedder("prot_t5_xl", device="cuda")
# From FASTA file — per-protein embeddings (default)
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False)
# From sequence list — per-residue embeddings
sequences = [
("protein_1", "MKTAYIAKQRQISFVKSH"),
("protein_2", "MDEVLQAELPAEG"),
]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)
# Save / Load
torch.save(embeddings, "embeddings.pt")
loaded = torch.load("embeddings.pt")
Output Format
The output .pt file contains a Python dict: {header: tensor}.
- Per-protein (default):
tensorshape is(embed_dim,) - Per-residue (
--per-residue):tensorshape is(seq_len, embed_dim)
emb = torch.load("embeddings.pt")
for name, tensor in emb.items():
print(f"{name}: {tensor.shape}")
# protein_1: torch.Size([1280]) # ESM-2 650M, per-protein
# protein_1: torch.Size([18, 1024]) # ProtT5-XL, per-residue
OOM Handling
If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time. You can also reduce --batch-size manually.
References
Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574
Elnaggar, A., et al. "ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.10 (2021): 7112-7127. https://doi.org/10.1109/TPAMI.2021.3095381
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protembedder-0.2.0.tar.gz.
File metadata
- Download URL: protembedder-0.2.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ec1b24c0b7a450cf24426ab863d4fa2bf306e62fbc4f258b7e3bc008072775a
|
|
| MD5 |
b8268e448b2651bfa41e7eb00eb16f96
|
|
| BLAKE2b-256 |
872a00ccd3abe324911e88d1f864cd6ef0bbf3918b23851ec891e37801db43a2
|
File details
Details for the file protembedder-0.2.0-py3-none-any.whl.
File metadata
- Download URL: protembedder-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d0b85bf94621a7b0c6796f9c5b4b08415e8f4c0e44532712762ecf359bfb897
|
|
| MD5 |
959faa76bdf166ea8414931e521807fd
|
|
| BLAKE2b-256 |
7c552e50367ae1f2a3d2ecc43484a9dc9eb75b47b011b16bbc16753240f88c5b
|