Wrapper on top of ESM/Protbert model in order to easily work with protein embedding
Project description
Table of contents
Bio-transformers
bio-transformers is a python wrapper on top of the ESM/Protbert model, which are Transformers protein language model, trained on millions on proteins and used to predict embeddings. This package provide other functionalities (like compute the loglikelihood of a protein) or compute embeddings on multiple-gpu.
You can find the original repo here :
Installation
It is recommended to work with conda environnements in order to manage the specific dependencies of the package.
conda create --name bio-transformers python=3.7 -y
conda activate bio-transformers
pip install bio-transformers
Usage
Quick start
The main class BioTranformers
allow the developper to use Protbert and ESM backend
>>from biotransformers import BioTransformers
>>BioTransformers.list_backend()
Use backend in this list :
* esm1_t34_670M_UR100
* esm1_t6_43M_UR50S
* esm1b_t33_650M_UR50S
* esm_msa1_t12_100M_UR50S
* protbert
* protbert_bfd
Embeddings
Choose a backend and pass a list of sequences of Amino acids to compute the embeddings.
By default, the compute_embeddings
function return the <CLS>
token embedding.
You can add a pooling_list
in addition , so you can compute the mean of the tokens embeddings.
from biotransformers import BioTransformers
sequences = [
"MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
"KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
]
bio_trans = BioTransformers(backend="protbert")
embeddings = bio_trans.compute_embeddings(sequences, pooling_list=['mean'])
cls_emb = embeddings['cls']
mean_emb = embeddings['mean']
Pseudo-Loglikelihood
The protein loglikelihood is a metric which estimates the joint probability of observing a given sequence of amino-acids. The idea behind such an estimator is to approximate the probability that a mutated protein will be “natural”, and can effectively be produced by a cell.
These metrics rely on transformers language models . These models are trained to predict a “masked” amino-acid in a sequence. As a consequence, they can provide us an estimate of the probability of observing an amino-acid given the “context” (the surrounding amino-acids). By multiplying individual probabilities computed for a given amino-acid given its context, we obtain a pseudo-likelihood, which can be a candidate estimator to approximate a sequence stability.
from biotransformers import BioTransformers
sequences = [
"MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
"KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
]
bio_trans = BioTransformers(backend="protbert",device="cuda:0")
loglikelihood = bio_trans.compute_loglikelihood(sequences)
Roadmap:
- Support multi-gpu forward
- support MSA transformers
- add compute_accuracy functionnality
- support finetuning of model
Citations
License
This source code is licensed under the Apache 2 license found in the LICENSE
file in the root directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bio_transformers-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1a7585de2eb2961411e2462d739e51db9e4ef7b20c3b251bc932f44cb5063ce |
|
MD5 | 6f6ff675c8ba2fdcc20519ad69667a90 |
|
BLAKE2b-256 | 5ff4f4dc3e757533f35413c3ecaf66367104506ff212b3e218ab4b3995c72c90 |