Extract protein embeddings from FASTA files using ESM-2, ProtT5, and ProtBert models

These details have not been verified by PyPI

Project links

Project description

ProtEmbedder

Extract protein embeddings from FASTA files using ESM-2, ProtT5, and ProtBert protein language models.

Installation

pip install protembedder

Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0, transformers ≥ 4.30

For development / editable install: pip install -e .

Quick Start

List available models

protembedder list-models

# With details (family, embedding dim, repo)
protembedder list-models --verbose

CLI Usage

# Per-protein embeddings — PyTorch output (default)
protembedder embed -m esm2_t33_650M -i proteins.fasta -o embeddings.pt

# HDF5 output
protembedder embed -m prot_t5_xl -i proteins.fasta -o embeddings.h5 --format h5

# NumPy .npz output
protembedder embed -m prot_bert -i proteins.fasta -o embeddings.npz --format npz

# CSV output
protembedder embed -m prot_bert -i proteins.fasta -o embeddings.csv --format csv

# Per-residue (per amino acid) embeddings
protembedder embed -m prot_bert -i proteins.fasta -o embeddings.pt --per-residue

# GPU with custom batch size, disable progress bar
protembedder embed -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16 --no-progress

CLI Flags (`embed` subcommand)

Flag	Short	Required	Default	Description
`--model`	`-m`	Yes	—	Model name (see `list-models`)
`--input`	`-i`	Yes	—	Input FASTA file path
`--output`	`-o`	Yes	—	Output file path
`--format`	—	No	`pt`	Output format: `pt`, `h5`, `npz`, `csv`
`--per-residue`	—	No	`False`	Per amino acid embeddings
`--device`	—	No	auto	`cpu`, `cuda`, `cuda:0`, etc.
`--batch-size`	—	No	`8`	Sequences per batch
`--no-progress`	—	No	`False`	Disable tqdm progress bar
`--verbose`	`-v`	No	`False`	Verbose logging

Output Formats

Flag	Extension	Description	Load with
`pt`	`.pt`	PyTorch dict `{header: tensor}`	`torch.load()`
`h5`	`.h5`	HDF5 — `embeddings/` group + `headers` dataset	`h5py` / `protembedder.io.load_embeddings()`
`npz`	`.npz`	NumPy archive — `embeddings` array + `headers` array	`np.load()` / `protembedder.io.load_embeddings()`
`csv`	`.csv`	CSV table — `header, emb_0, emb_1, ...` (per-protein) or `header, residue_idx, emb_0, ...` (per-residue)	`pandas.read_csv()`

Available Models

ESM-2 (Meta AI)

Model	Parameters	Embedding Dim	Layers
`esm2_t6_8M`	8M	320	6
`esm2_t12_35M`	35M	480	12
`esm2_t30_150M`	150M	640	30
`esm2_t33_650M`	650M	1280	33
`esm2_t36_3B`	3B	2560	36
`esm2_t48_15B`	15B	5120	48

ProtT5 (Rostlab)

Model	Parameters	Embedding Dim	HuggingFace Repo
`prot_t5_xl`	3B	1024	Rostlab/prot_t5_xl_half_uniref50-enc

ProtBert (Rostlab)

Model	Parameters	Embedding Dim	HuggingFace Repo
`prot_bert`	420M	1024	Rostlab/prot_bert

Python API

import torch
from protembedder import ProteinEmbedder
from protembedder.io import save_embeddings, load_embeddings

# Load any supported model
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")
# embedder = ProteinEmbedder("prot_t5_xl")
# embedder = ProteinEmbedder("prot_bert")

# From FASTA file with progress bar
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False, show_progress=True)

# From a sequence list
sequences = [("prot_1", "MKTAYIAKQRQISFVKSH"), ("prot_2", "MDEVLQAELPAEG")]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)

# Save in any format
save_embeddings(embeddings, "embeddings.h5",   fmt="h5")
save_embeddings(embeddings, "embeddings.npz",  fmt="npz")
save_embeddings(embeddings, "embeddings.csv",  fmt="csv")
save_embeddings(embeddings, "embeddings.pt",   fmt="pt")

# Load back (pt, h5, npz)
loaded = load_embeddings("embeddings.h5")

# List all models
print(ProteinEmbedder.list_models())

Output Format Details

Per-protein (default): tensor shape (embed_dim,)
Per-residue (--per-residue): tensor shape (seq_len, embed_dim)

emb = load_embeddings("embeddings.pt")
for name, tensor in emb.items():
    print(f"{name}: {tensor.shape}")
# prot_1: torch.Size([1280])        # ESM-2 650M, per-protein
# prot_1: torch.Size([18, 1024])    # ProtBert, per-residue

OOM Handling

If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time. You can also reduce --batch-size manually.

References

Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574

Elnaggar, A., et al. "ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.10 (2021): 7112-7127. https://doi.org/10.1109/TPAMI.2021.3095381

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Mar 16, 2026

0.3.0

Mar 16, 2026

0.2.0

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protembedder-0.4.0.tar.gz (17.4 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

protembedder-0.4.0-py3-none-any.whl (14.4 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file protembedder-0.4.0.tar.gz.

File metadata

Download URL: protembedder-0.4.0.tar.gz
Upload date: Mar 16, 2026
Size: 17.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`1891372ee1ea616c5de7dcaaf770bb06b6a2194c07ae35c10b2ab0cf0cd9b570`
MD5	`a758d61f783b3f71a60c8c94fa829c0a`
BLAKE2b-256	`89fbf4951eeed4dbc3c514d16e048ff5fad28af901beca2ba0f7d59998098319`

See more details on using hashes here.

File details

Details for the file protembedder-0.4.0-py3-none-any.whl.

File metadata

Download URL: protembedder-0.4.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`29cf2169894d8e8feb091c001b49e01ba6be8a06625eb9d4506e66c70e852707`
MD5	`c7d90d72e3c62bc0737cb68c9099473e`
BLAKE2b-256	`8f697749e129348ab022deeee74a6a60e10a0da61735d4348da713f6f9dbda2d`

See more details on using hashes here.

protembedder 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ProtEmbedder

Installation

Quick Start

List available models

CLI Usage

CLI Flags (`embed` subcommand)

Output Formats

Available Models

Python API

Output Format Details

OOM Handling

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

protembedder 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ProtEmbedder

Installation

Quick Start

List available models

CLI Usage

CLI Flags (embed subcommand)

Output Formats

Available Models

Python API

Output Format Details

OOM Handling

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

CLI Flags (`embed` subcommand)