Skip to main content

Wrapper on top of ESM/Protbert model in order to easily work with protein embedding

Project description

PyPI License Python 3.7 Code style: black Dependencies

Table of contents

Bio-transformers

bio-transformers is a python wrapper on top of the ESM/Protbert model, which are Transformers protein language model, trained on millions on proteins and used to predict embeddings. This package provide other functionalities (like compute the loglikelihood of a protein) or compute embeddings on multiple-gpu.

You can find the original repo here :

Installation

It is recommended to work with conda environnements in order to manage the specific dependencies of the package.

  conda create --name bio-transformers python=3.7 -y
  conda activate bio-transformers
  pip install bio-transformers

Usage

Quick start

The main class BioTranformers allow the developper to use Protbert and ESM backend

>>from biotransformers import BioTransformers
>>BioTransformers.list_backend()
Use backend in this list :

  *   esm1_t34_670M_UR100
  *   esm1_t6_43M_UR50S
  *   esm1b_t33_650M_UR50S
  *   esm_msa1_t12_100M_UR50S
  *   protbert
  *   protbert_bfd

Embeddings

Choose a backend and pass a list of sequences of Amino acids to compute the embeddings. By default, the compute_embeddings function return the <CLS> token embedding. You can add a pooling_list in addition , so you can compute the mean of the tokens embeddings.

from biotransformers import BioTransformers

sequences = [
        "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
        "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
    ]

bio_trans = BioTransformers(backend="protbert")
embeddings = bio_trans.compute_embeddings(sequences, pooling_list=['mean'])

cls_emb = embeddings['cls']
mean_emb = embeddings['mean']

Pseudo-Loglikelihood

The protein loglikelihood is a metric which estimates the joint probability of observing a given sequence of amino-acids. The idea behind such an estimator is to approximate the probability that a mutated protein will be “natural”, and can effectively be produced by a cell.

These metrics rely on transformers language models . These models are trained to predict a “masked” amino-acid in a sequence. As a consequence, they can provide us an estimate of the probability of observing an amino-acid given the “context” (the surrounding amino-acids). By multiplying individual probabilities computed for a given amino-acid given its context, we obtain a pseudo-likelihood, which can be a candidate estimator to approximate a sequence stability.

from biotransformers import BioTransformers

sequences = [
        "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
        "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
    ]

bio_trans = BioTransformers(backend="protbert",device="cuda:0")
loglikelihood = bio_trans.compute_loglikelihood(sequences)

Roadmap:

  • Support multi-gpu forward
  • support MSA transformers
  • add compute_accuracy functionnality
  • support finetuning of model

Citations

License

This source code is licensed under the Apache 2 license found in the LICENSE file in the root directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio-transformers-0.0.4.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

bio_transformers-0.0.4-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file bio-transformers-0.0.4.tar.gz.

File metadata

  • Download URL: bio-transformers-0.0.4.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for bio-transformers-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3563034730043e1e6173373ee9fb44a7a8b7e7be1c63942e1360baa70710b552
MD5 57027b3eeefc2fde945bb3d7399a2936
BLAKE2b-256 5e432b740f99ce3c6d5b9079745c45c2224da3e603e3cbfe83494f6d14303847

See more details on using hashes here.

File details

Details for the file bio_transformers-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: bio_transformers-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for bio_transformers-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b1a7585de2eb2961411e2462d739e51db9e4ef7b20c3b251bc932f44cb5063ce
MD5 6f6ff675c8ba2fdcc20519ad69667a90
BLAKE2b-256 5ff4f4dc3e757533f35413c3ecaf66367104506ff212b3e218ab4b3995c72c90

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page