Extract protein embeddings from protein language models.
Project description
ProtEnc: generate protein embeddings the easy way
ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:
Usage
Installation
pip install protenc
Python API
import protenc
# List available models
print(protenc.list_models())
# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')
proteins = [
'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]
for embed in encoder(proteins, return_format='numpy'):
# Embeddings have shape [L, D] where L is the sequence length and D the embedding dimensionality.
print(embed.shape)
# Derive a single per-protein embedding vector by averaging along the sequence dimension
embed.mean(0)
Command-line interface
After installation, use the protenc
shell command for bulk generation and export of protein embeddings.
protenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>
By default, input and output formats are inferred from the file extensions.
Run
protenc --help
for a detailed usage description.
Example
Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:
protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D
The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb
utility function:
from protenc.utils import read_from_lmdb
for label, embed in read_from_lmdb('embeddings.lmdb'):
print(label, embed)
Features
Input formats:
- CSV
- JSON
- FASTA
Output format:
General:
- Multi-GPU inference with (
--data_parallel
) - FP16 inference (
--amp
)
Development
Clone the repository:
git clone git+https://github.com/kklemon/protenc.git
Install dependencies via Poetry:
poetry install
Contribution
Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.
Todo
- Support for more input formats
- CSV
- Parquet
- FASTA
- JSON
- Support for more output formats
- LMDB
- HDF5
- DataFrame
- Pickle
- Support for large models
- Model offloading
- Sharding
- FlashAttention (via Kernl?)
- Support for more protein language models
- Whole ProtTrans family
- Whole ESM family
- AlphaFold (?)
- Implement all remaining TODOs in code
- Evaluation
- Demos
- Distributed inference
- Maybe support some sort of optimized inference such as quantization
- This may be up to the model providers
- Improve documentation
- Support translation of gene sequences
- Add tests. We need tests!!!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file protenc-0.1.6.tar.gz
.
File metadata
- Download URL: protenc-0.1.6.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.9 Linux/5.15.0-86-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5c9304c9a664abcff9c6540166e83a7ce32d6eb9b3539c8f6ade9fbda080a1f |
|
MD5 | e3fefcd799ee203395551b7bf9282fe4 |
|
BLAKE2b-256 | d334d73edcf3a7965023b85919fda6230c74d5d608e0ea65bf4ff703d46295f1 |
File details
Details for the file protenc-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: protenc-0.1.6-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.9 Linux/5.15.0-86-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3271b4c279bef15d315cf7ae00cf950ace17f662ecbb5dbffa6a5a8817d3097 |
|
MD5 | 69498c2146f059d2d277948f9f9c7ab8 |
|
BLAKE2b-256 | d5f8b9601dd9d40fa7c0cfe1648abd4bf18f16d42a0ea9daf7e908c81b610dd5 |