Skip to main content

Extract protein embeddings from FASTA files using ESM-2 models

Project description

ProtEmbedder

Extract protein embeddings from FASTA files using ESM-2 protein language models.

Installation

pip install -e .

Requirements: Python ≥ 3.8, PyTorch ≥ 1.12, fair-esm ≥ 2.0

Quick Start

CLI Usage

# Per-protein embeddings (default) — one vector per sequence
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt

# Per-residue embeddings — one vector per amino acid
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --per-residue

# GPU with custom batch size
protembedder -m esm2_t33_650M -i proteins.fasta -o embeddings.pt --device cuda --batch-size 16

# Small model for quick testing
protembedder -m esm2_t6_8M -i proteins.fasta -o embeddings.pt -v

CLI Flags

Flag Short Required Default Description
--model -m Yes ESM-2 model name (see table below)
--input -i Yes Input FASTA file path
--output -o Yes Output .pt file path
--per-residue No False Per amino acid embeddings
--device No auto cpu, cuda, cuda:0, etc.
--batch-size No 8 Sequences per batch
--verbose -v No False Verbose logging

Available Models

Model Parameters Embedding Dim Layers
esm2_t6_8M 8M 320 6
esm2_t12_35M 35M 480 12
esm2_t30_150M 150M 640 30
esm2_t33_650M 650M 1280 33
esm2_t36_3B 3B 2560 36
esm2_t48_15B 15B 5120 48

Python API

import torch
from protembedder import ProteinEmbedder

# Initialize
embedder = ProteinEmbedder("esm2_t33_650M", device="cuda")

# From FASTA file
embeddings = embedder.embed_fasta("proteins.fasta", per_residue=False)

# From sequence list
sequences = [
    ("protein_1", "MKTAYIAKQRQISFVKSH"),
    ("protein_2", "MDEVLQAELPAEG"),
]
embeddings = embedder.embed_sequences(sequences, per_residue=True, batch_size=4)

# Save / Load
torch.save(embeddings, "embeddings.pt")
loaded = torch.load("embeddings.pt")

Output Format

The output .pt file contains a Python dict: {header: tensor}.

  • Per-protein (default): tensor shape is (embed_dim,)
  • Per-residue (--per-residue): tensor shape is (seq_len, embed_dim)
emb = torch.load("embeddings.pt")
for name, tensor in emb.items():
    print(f"{name}: {tensor.shape}")
# protein_1: torch.Size([1280])        # per-protein
# protein_1: torch.Size([18, 1280])    # per-residue

OOM Handling

If a batch causes an out-of-memory error on GPU, the package automatically falls back to processing sequences one at a time for that batch. You can also reduce --batch-size manually.

Reference

Lin, Z., et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379.6637 (2023): 1123-1130. https://doi.org/10.1126/science.ade2574

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protembedder-0.1.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protembedder-0.1.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file protembedder-0.1.0.tar.gz.

File metadata

  • Download URL: protembedder-0.1.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.1.0.tar.gz
Algorithm Hash digest
SHA256 03d0d21c9e5f222d99a974a896c50aa2d22b0382de756cdf065455dbd2d8660f
MD5 0af984a36c6816a873811a4f7e3a3660
BLAKE2b-256 7dbf4f87ced0fe67359dad29d4b61a7cf92c27ca784c5fbc96402cd019a19e77

See more details on using hashes here.

File details

Details for the file protembedder-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: protembedder-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for protembedder-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4560bd3ba5a54d45801c7bfdc57a3c43ec964ea7a7daa1fb218a28ecbfa6aa51
MD5 34d852870a5f9b3e18720a65be8f08f5
BLAKE2b-256 2ec2492e100b2a06d906c45bdf2badffb032e0a4bdd278a13fd067d1d873eca9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page