Skip to main content

Extract protein embeddings from FASTA files using ESM-2, ProtT5, and ProtBert models

Project description

ProtEmbedder

PyPI version Python License: MIT

Extract protein embeddings from FASTA files using ESM-2, ProtT5, and ProtBert protein language models.

Installation

pip install protembedder

Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0, transformers ≥ 4.30

For development / editable install: pip install -e .

Quick Start

List available models

protembedder list-models

# With details (family, embedding dim, repo)
protembedder list-models --verbose

CLI Usage

# Per-protein embeddings — PyTorch output (default)
protembedder embed -m esm2_t33_650M -i proteins.fasta -o embeddings.pt

# HDF5 output
protembedder embed -m prot_t5_xl -i proteins.fasta -o embeddings.h5 --format h5

# NumPy .npz output
protembedder embed -m prot_bert -i proteins.fasta -o embeddings.npz --format npz

# CSV output
protembedder embed -m prot_bert -i proteins.fasta -o embeddings.csv --format csv

# Per-residue (per amino acid) embeddings
protembedder embed -m prot_bert -i proteins.fasta -o embeddings.pt --per-residue

# GPU with custom batch size, disable progress bar
protembedder embed -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16 --no-progress

CLI Flags (embed subcommand)

Flag Short Required Default Description
--model -m Yes Model name (see list-models)
--input -i Yes Input FASTA file path
--output -o Yes Output file path
--format No pt Output format: pt, h5, npz, csv
--per-residue No False Per amino acid embeddings
--device No auto cpu, cuda, cuda:0, etc.
--batch-size No 8 Sequences per batch
--no-progress No False Disable tqdm progress bar
--verbose -v No False Verbose logging

Output Formats

Flag Extension Description Load with
pt .pt PyTorch dict {header: tensor} torch.load()
h5 .h5 HDF5 — embeddings/ group + headers dataset h5py / protembedder.io.load_embeddings()
npz .npz NumPy archive — embeddings array + headers array np.load() / protembedder.io.load_embeddings()
csv .csv CSV table — header, emb_0, emb_1, ... (per-protein) or header, residue_idx, emb_0, ... (per-residue) pandas.read_csv()

Available Models

ESM-2 (Meta AI)

Model Parameters Embedding Dim Layers
esm2_t6_8M 8M 320 6
esm2_t12_35M 35M 480 12
esm2_t30_150M 150M 640 30
esm2_t33_650M 650M 1280 33
esm2_t36_3B 3B 2560 36
esm2_t48_15B 15B 5120 48

ProtT5 (Rostlab)

Model Parameters Embedding Dim HuggingFace Repo
prot_t5_xl 3B 1024 Rostlab/prot_t5_xl_half_uniref50-enc

ProtBert (Rostlab)

Model Parameters Embedding Dim HuggingFace Repo
prot_bert 420M 1024 Rostlab/prot_bert

Python API

import torch
from protembedder import ProteinEmbedder
from protembedder.io import save_embeddings, load_embeddings

# Load any supported model
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")
# embedder = ProteinEmbedder("prot_t5_xl")
# embedder = ProteinEmbedder("prot_bert")

# From FASTA file with progress bar
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False, show_progress=True)

# From a sequence list
sequences = [("prot_1", "MKTAYIAKQRQISFVKSH"), ("prot_2", "MDEVLQAELPAEG")]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)

# Save in any format
save_embeddings(embeddings, "embeddings.h5",   fmt="h5")
save_embeddings(embeddings, "embeddings.npz",  fmt="npz")
save_embeddings(embeddings, "embeddings.csv",  fmt="csv")
save_embeddings(embeddings, "embeddings.pt",   fmt="pt")

# Load back (pt, h5, npz)
loaded = load_embeddings("embeddings.h5")

# List all models
print(ProteinEmbedder.list_models())

Output Format Details

  • Per-protein (default): tensor shape (embed_dim,)
  • Per-residue (--per-residue): tensor shape (seq_len, embed_dim)
emb = load_embeddings("embeddings.pt")
for name, tensor in emb.items():
    print(f"{name}: {tensor.shape}")
# prot_1: torch.Size([1280])        # ESM-2 650M, per-protein
# prot_1: torch.Size([18, 1024])    # ProtBert, per-residue

OOM Handling

If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time. You can also reduce --batch-size manually.

References

Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574

Elnaggar, A., et al. "ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.10 (2021): 7112-7127. https://doi.org/10.1109/TPAMI.2021.3095381

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protembedder-0.4.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protembedder-0.4.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file protembedder-0.4.0.tar.gz.

File metadata

  • Download URL: protembedder-0.4.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.4.0.tar.gz
Algorithm Hash digest
SHA256 1891372ee1ea616c5de7dcaaf770bb06b6a2194c07ae35c10b2ab0cf0cd9b570
MD5 a758d61f783b3f71a60c8c94fa829c0a
BLAKE2b-256 89fbf4951eeed4dbc3c514d16e048ff5fad28af901beca2ba0f7d59998098319

See more details on using hashes here.

File details

Details for the file protembedder-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: protembedder-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29cf2169894d8e8feb091c001b49e01ba6be8a06625eb9d4506e66c70e852707
MD5 c7d90d72e3c62bc0737cb68c9099473e
BLAKE2b-256 8f697749e129348ab022deeee74a6a60e10a0da61735d4348da713f6f9dbda2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page