Skip to main content

Extract protein embeddings from protein language models.

Project description

ProtEnc: generate protein embeddings the easy way

ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

Usage

Installation

pip install protenc

Python API

import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)

Command-line interface

After installation, use the protenc shell command for bulk generation and export of protein embeddings.

protenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>

By default, input and output formats are inferred from the file extensions.

Run

protenc --help

for a detailed usage description.

Example

Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb utility function:

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)

Features

Input formats:

Output format:

General:

  • Multi-GPU inference with (--data_parallel)
  • FP16 inference (--amp)

Development

Clone the repository:

git clone git+https://github.com/kklemon/protenc.git

Install dependencies via Poetry:

poetry install

Contribution

Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.

Todo

  • Support for more input formats
    • CSV
    • Parquet
    • FASTA
    • JSON
  • Support for more output formats
    • LMDB
    • HDF5
    • DataFrame
    • Pickle
  • Support for large models
    • Model offloading
    • Sharding
    • FlashAttention (via Kernl?)
  • Support for more protein language models
    • Whole ProtTrans family
    • Whole ESM family
    • AlphaFold (?)
  • Implement all remaining TODOs in code
  • Evaluation
  • Demos
  • Distributed inference
  • Maybe support some sort of optimized inference such as quantization
    • This may be up to the model providers
  • Improve documentation
  • Support translation of gene sequences
  • Add tests. We need tests!!!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protenc-0.1.6.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

protenc-0.1.6-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file protenc-0.1.6.tar.gz.

File metadata

  • Download URL: protenc-0.1.6.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.9 Linux/5.15.0-86-generic

File hashes

Hashes for protenc-0.1.6.tar.gz
Algorithm Hash digest
SHA256 b5c9304c9a664abcff9c6540166e83a7ce32d6eb9b3539c8f6ade9fbda080a1f
MD5 e3fefcd799ee203395551b7bf9282fe4
BLAKE2b-256 d334d73edcf3a7965023b85919fda6230c74d5d608e0ea65bf4ff703d46295f1

See more details on using hashes here.

File details

Details for the file protenc-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: protenc-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.9 Linux/5.15.0-86-generic

File hashes

Hashes for protenc-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f3271b4c279bef15d315cf7ae00cf950ace17f662ecbb5dbffa6a5a8817d3097
MD5 69498c2146f059d2d277948f9f9c7ab8
BLAKE2b-256 d5f8b9601dd9d40fa7c0cfe1648abd4bf18f16d42a0ea9daf7e908c81b610dd5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page