Extract protein embeddings from pretrained models.
Project description
ProtEnc: generate protein embeddings the easy way
ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing jungle of protein language models (pLMs). Currently, supported models are:
Note: the project is work in progress.
Motivation
Usage
Installation
pip install protenc
Note: while ProtEnc depends on pytorch, it is not part of the formal dependencies. This is due to the large number of different pytorch distributions which may mismatch with the target environment.
Python API
import protenc
import torch
# List available models
print(protenc.list_models())
# Instantiate a model
model = protenc.get_model('esm2_t33_650M_UR50D')
proteins = [
'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]
batch = model.prepare_sequences(proteins)
# Use GPU if available
if torch.cuda.is_available():
model = model.to('cuda')
batch = protenc.utils.to_device(batch, 'cuda')
for embed in model(batch):
# Embeddings have shape [L, D] where L is the sequence length and D the
# embedding dimensionality.
print(embed.shape)
# Derive a single per-protein embedding vector by averaging along the
# sequence dimension
embed.mean(0)
Command-line interface
After installation, use the protenc
shell command for bulk generation and export of protein embeddings.
protenc <path-to-protein-sequences> <path-to-output> --model_name=<name-of-model>
By default, input and output formats are inferred from the file extensions.
Run
protenc --help
for a detailed usage description.
Example
Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:
protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D
The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb
utility function:
from protenc.utils import read_from_lmdb
for label, embed in read_from_lmdb('embeddings.lmdb'):
print(label, embed)
Features
Input formats:
- CSV
- JSON
- FASTA
Output format:
General:
- Multi-GPU inference with (
--data_parallel
) - FP16 inference (
--amp
)
Development
Clone the repository:
git clone git+https://github.com/kklemon/protenc.git
Install dependencies via Poetry:
poetry install
Contribution
Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.
Todo
- Support for more input formats
- CSV
- Parquet
- FASTA
- JSON
- Support for more output formats
- LMDB
- HDF5
- DataFrame
- Pickle
- Large models support
- Model offloading
- Sharding
- Support for more protein language models
- Whole ProtTrans family
- Whole ESM family
- AlphaFold (?)
- Implement all remaining TODOs in code
- Distributed inference
- Maybe support some sort of optimized inference such as quantization
- This may be up to the model providers
- Improve documentation
- Support translation of gene sequences
- Add tests. We need tests!!!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.