Extract protein embeddings from FASTA files using ESM-2 models
Project description
ProtEmbedder
Extract protein embeddings from FASTA files using ESM-2 protein language models.
Installation
pip install -e .
Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0
Quick Start
CLI Usage
# Per-protein embeddings (default) — one vector per sequence
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt
# Per-residue embeddings — one vector per amino acid
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --per-residue
# GPU with custom batch size
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16
# Small model for quick testing
protembedder -m esm2_t6_8M -i proteins.fasta -o embeddings.pt -v
CLI Flags
| Flag | Short | Required | Default | Description |
|---|---|---|---|---|
--model |
-m |
Yes | — | ESM-2 model name (see table below) |
--input |
-i |
Yes | — | Input FASTA file path |
--output |
-o |
Yes | — | Output .pt file path |
--per-residue |
— | No | False |
Per amino acid embeddings |
--device |
— | No | auto | cpu, cuda, cuda:0, etc. |
--batch-size |
— | No | 8 |
Sequences per batch |
--verbose |
-v |
No | False |
Verbose logging |
Available Models
| Model | Parameters | Embedding Dim | Layers |
|---|---|---|---|
esm2_t6_8M |
8M | 320 | 6 |
esm2_t12_35M |
35M | 480 | 12 |
esm2_t30_150M |
150M | 640 | 30 |
esm2_t33_650M |
650M | 1280 | 33 |
esm2_t36_3B |
3B | 2560 | 36 |
esm2_t48_15B |
15B | 5120 | 48 |
Python API
import torch
from protembedder import ProteinEmbedder
# Initialize
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")
# From FASTA file
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False)
# From sequence list
sequences = [
("protein_1", "MKTAYIAKQRQISFVKSH"),
("protein_2", "MDEVLQAELPAEG"),
]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)
# Save / Load
torch.save(embeddings, "embeddings.pt")
loaded = torch.load("embeddings.pt")
Output Format
The output .pt file contains a Python dict: {header: tensor}.
- Per-protein (default):
tensorshape is(embed_dim,) - Per-residue (
--per-residue):tensorshape is(seq_len, embed_dim)
emb = torch.load("embeddings.pt")
for name, tensor in emb.items():
print(f"{name}: {tensor.shape}")
# protein_1: torch.Size([1280]) # per-protein
# protein_1: torch.Size([18, 1280]) # per-residue
OOM Handling
If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time for that batch. You can also reduce --batch-size manually.
Reference
Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protembedder-0.1.0.tar.gz.
File metadata
- Download URL: protembedder-0.1.0.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03d0d21c9e5f222d99a974a896c50aa2d22b0382de756cdf065455dbd2d8660f
|
|
| MD5 |
0af984a36c6816a873811a4f7e3a3660
|
|
| BLAKE2b-256 |
7dbf4f87ced0fe67359dad29d4b61a7cf92c27ca784c5fbc96402cd019a19e77
|
File details
Details for the file protembedder-0.1.0-py3-none-any.whl.
File metadata
- Download URL: protembedder-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4560bd3ba5a54d45801c7bfdc57a3c43ec964ea7a7daa1fb218a28ecbfa6aa51
|
|
| MD5 |
34d852870a5f9b3e18720a65be8f08f5
|
|
| BLAKE2b-256 |
2ec2492e100b2a06d906c45bdf2badffb032e0a4bdd278a13fd067d1d873eca9
|