Skip to main content

Extract protein embeddings from protein language models.

Project description

ProtEnc: generate protein embeddings the easy way

ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

Usage

Installation

pip install protenc

Python API

import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)

Command-line interface

After installation, use the protenc shell command for bulk generation and export of protein embeddings.

protenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>

By default, input and output formats are inferred from the file extensions.

Run

protenc --help

for a detailed usage description.

Example

Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb utility function:

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)

Features

Input formats:

Output format:

General:

  • Multi-GPU inference with (--data_parallel)
  • FP16 inference (--amp)

Development

Clone the repository:

git clone git+https://github.com/kklemon/protenc.git

Install dependencies via Poetry:

poetry install

Contribution

Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.

Todo

  • Support for more input formats
    • CSV
    • Parquet
    • FASTA
    • JSON
  • Support for more output formats
    • LMDB
    • HDF5
    • DataFrame
    • Pickle
  • Support for large models
    • Model offloading
    • Sharding
    • FlashAttention (via Kernl?)
  • Support for more protein language models
    • Whole ProtTrans family
    • Whole ESM family
    • AlphaFold (?)
  • Implement all remaining TODOs in code
  • Evaluation
  • Demos
  • Distributed inference
  • Maybe support some sort of optimized inference such as quantization
    • This may be up to the model providers
  • Improve documentation
  • Support translation of gene sequences
  • Add tests. We need tests!!!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protenc-0.1.6.tar.gz (12.7 kB view hashes)

Uploaded Source

Built Distribution

protenc-0.1.6-py3-none-any.whl (13.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page