Skip to main content

Wrapper on top of ESM/Protbert model in order to easily work with protein embedding

Project description

PyPI License Python 3.7 Code style: black Dependencies Documentation Status codecov

Table of contents

Bio-transformers

bio-transformers is a python wrapper on top of the ESM/Protbert models, which are Transformers protein language models, trained on millions of proteins and used to predict embeddings. This package provides a unified interface to use all these models - which we call backends. For instance you'll be able to compute natural amino-acids probabilities ,embeddings or easily finetune your model on multiple-GPUs.

You can find the original repositories for the models here :

Installation

It is recommended to work with conda environments in order to manage the specific dependencies of this package.

  conda create --name bio-transformers python=3.7 -y
  conda activate bio-transformers
  pip install bio-transformers

Usage

Quick start

The main class BioTranformers allows developers to use Protbert and ESM backends

> from biotransformers import BioTransformers
> BioTransformers.list_backend()
Use backend in this list :

    *   esm1_t34_670M_UR100
    *   esm1_t6_43M_UR50S
    *   esm1b_t33_650M_UR50S
    *   esm_msa1_t12_100M_UR50S
    *   protbert
    *   protbert_bfd

Embeddings

Choose a backend and pass a list of sequences of Amino acids to compute the embeddings. By default, the compute_embeddings function returns the <CLS> token embeddings. You can add a pool_mode in addition, so you can compute the mean of the tokens embeddings.

from biotransformers import BioTransformers

sequences = [
        "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
        "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
    ]

bio_trans = BioTransformers(backend="protbert")
embeddings = bio_trans.compute_embeddings(sequences, pool_mode=('cls','mean'))

cls_emb = embeddings['cls']
mean_emb = embeddings['mean']

Multi-gpu

If you have access to multiple GPUs, you can activate the multi_gpu option to speed-up the inference. This option relies on torch.nn.DataParallel.

bio_trans = BioTransformers(backend="protbert",multi_gpu=True)
embeddings = bio_trans.compute_embeddings(sequences, pool_mode=('cls','mean'))

Pseudo-Loglikelihood

The protein loglikelihood is a metric that estimates the joint probability of observing a given sequence of amino acids. The idea behind such an estimator is to approximate the probability that a mutated protein will be “natural”, and can effectively be produced by a cell.

These metrics rely on transformers language models. These models are trained to predict a “masked” amino acid in a sequence. As a consequence, they can provide us with an estimate of the probability of observing an amino acid given the “context” (the surrounding amino acids). By multiplying individual probabilities computed for a given amino-acid given its context, we obtain a pseudo-likelihood, which can be a candidate estimator to approximate sequence stability.

from biotransformers import BioTransformers

sequences = [
        "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
        "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
    ]

bio_trans = BioTransformers(backend="protbert",device="cuda:0")
loglikelihood = bio_trans.compute_loglikelihood(sequences)

Finetune pre-trained transformers on your dataset

You can use the train_masked function to finetune your backend on your dataset. The model is automatically scaled on the available GPUs. More information on the documentation

import biodatasets
import numpy as np
from biotransformers import BioTransformers

data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train on small sequences
length = np.array(list(map(len, X))) < 200
train_seq = X[length][:15000]
bio_trans = BioTransformers("esm1_t6_43M_UR50S", device="cuda")

bio_trans.train_masked(
    train_seq,
    lr=1.0e-5,
    warmup_init_lr=1e-7,
    toks_per_batch=2000,
    epochs=20,
    batch_size=16,
    acc_batch_size=256,
    warmup_updates=1024,
    accelerator="ddp",
    checkpoint=None,
    save_last_checkpoint=False,
)

Roadmap:

  • support MSA transformers

Citations

Here some papers on interest on the subject.

The excellent ProtBert work can be found at (biorxiv preprint):

@article{protTrans2021,
  author={Ahmed Elnaggar and Michael Heinzinger, Christian Dallago1,Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer,Debsindhu Bhowmik and Burkhard Rost},
  title={ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing},
  year={2019},
  doi={10.1101/2020.07.12.199554},
  url={https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3.full.pdf},
  journal={bioRxiv}
}

For the ESM model, see (biorxiv preprint):

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={bioRxiv}
}

For the self-attention contact prediction, see the following paper (biorxiv preprint):

@article{rao2020transformer,
  author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander},
  title={Transformer protein language models are unsupervised structure learners},
  year={2020},
  doi={10.1101/2020.12.15.422761},
  url={https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1},
  journal={bioRxiv}
}

For the MSA Transformer, see the following paper (biorxiv preprint):

@article{rao2021msa,
  author = {Rao, Roshan and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John F. and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
  title={MSA Transformer},
  year={2021},
  doi={10.1101/2021.02.12.430858},
  url={https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1},
  journal={bioRxiv}
}

License

This source code is licensed under the Apache 2 license found in the LICENSE file in the root directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio-transformers-0.0.8.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

bio_transformers-0.0.8-py3-none-any.whl (36.0 kB view details)

Uploaded Python 3

File details

Details for the file bio-transformers-0.0.8.tar.gz.

File metadata

  • Download URL: bio-transformers-0.0.8.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10

File hashes

Hashes for bio-transformers-0.0.8.tar.gz
Algorithm Hash digest
SHA256 3e969d147f3289c66ee0671a9bd182468b74c8e11e81832f5c375996deff7d57
MD5 bf7d6b22b7490c8e5843ad099d66a610
BLAKE2b-256 6d31b6250922b0e8d4912f39fa2c7413dfdeb6dbd628698f409c418ff9a41aa2

See more details on using hashes here.

File details

Details for the file bio_transformers-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: bio_transformers-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 36.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10

File hashes

Hashes for bio_transformers-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 357d05a678d3845970021a90719c7ccac776a1e635a9701bb7bf202fe22814df
MD5 89956de808121fd129c1d07b2e52298c
BLAKE2b-256 d3cd3767203eef086ae33fbb6631ab487ad5ae06e7756edb4ff827424681913a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page