Extract protein embeddings from protein language models.

These details have not been verified by PyPI

Project links

Project description

ProtEnc: generate protein embeddings the easy way

ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

ProtTrans family
ESM
AlphaFold (coming soon™)
OmegaPLM (coming soon™)

Usage

Installation

pip install protenc

Python API

import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)

Command-line interface

After installation, use the protenc shell command for bulk generation and export of protein embeddings.

protenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>

By default, input and output formats are inferred from the file extensions.

Run

protenc --help

for a detailed usage description.

Example

Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb utility function:

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)

Features

Input formats:

CSV
JSON
FASTA

Output format:

LMDB
HDF5 (coming soon)

General:

Multi-GPU inference with (--data_parallel)
FP16 inference (--amp)

Development

Clone the repository:

git clone git+https://github.com/kklemon/protenc.git

Install dependencies via Poetry:

poetry install

Contribution

Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.

Todo

Support for more input formats
- CSV
- Parquet
- FASTA
- JSON
Support for more output formats
- LMDB
- HDF5
- DataFrame
- Pickle
Support for large models
- Model offloading
- Sharding
- FlashAttention (via Kernl?)
Support for more protein language models
- Whole ProtTrans family
- Whole ESM family
- AlphaFold (?)
Implement all remaining TODOs in code
Evaluation
Demos
Distributed inference
Maybe support some sort of optimized inference such as quantization
- This may be up to the model providers
Improve documentation
Support translation of gene sequences
Add tests. We need tests!!!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Oct 6, 2023

0.1.5

Oct 1, 2023

0.1.4

Sep 29, 2023

0.1.3

Sep 29, 2023

0.1.2

Jul 8, 2023

0.1.1

Dec 1, 2022

0.1.0

Nov 28, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protenc-0.1.6.tar.gz (12.7 kB view details)

Uploaded Oct 6, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

protenc-0.1.6-py3-none-any.whl (13.5 kB view details)

Uploaded Oct 6, 2023 Python 3

File details

Details for the file protenc-0.1.6.tar.gz.

File metadata

Download URL: protenc-0.1.6.tar.gz
Upload date: Oct 6, 2023
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.9 Linux/5.15.0-86-generic

File hashes

Hashes for protenc-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`b5c9304c9a664abcff9c6540166e83a7ce32d6eb9b3539c8f6ade9fbda080a1f`
MD5	`e3fefcd799ee203395551b7bf9282fe4`
BLAKE2b-256	`d334d73edcf3a7965023b85919fda6230c74d5d608e0ea65bf4ff703d46295f1`

See more details on using hashes here.

File details

Details for the file protenc-0.1.6-py3-none-any.whl.

File metadata

Download URL: protenc-0.1.6-py3-none-any.whl
Upload date: Oct 6, 2023
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.9 Linux/5.15.0-86-generic

File hashes

Hashes for protenc-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3271b4c279bef15d315cf7ae00cf950ace17f662ecbb5dbffa6a5a8817d3097`
MD5	`69498c2146f059d2d277948f9f9c7ab8`
BLAKE2b-256	`d5f8b9601dd9d40fa7c0cfe1648abd4bf18f16d42a0ea9daf7e908c81b610dd5`

See more details on using hashes here.

protenc 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ProtEnc: generate protein embeddings the easy way

Usage

Installation

Python API

Command-line interface

Development

Contribution

Todo

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes