Skip to main content

Directed evolution of proteins with fast gradient-based discrete MCMC.

Project description

EvoProtGrad

PyPI version License

A Python package for directed evolution on a protein sequence with gradient-based discrete Markov chain monte carlo (MCMC). Users are able to compose custom models that map sequence to function with pretrained models, including protein language models (PLMs), to guide and constrain search. Our package natively integrates with 🤗 HuggingFace and supports PLMs from transformers.

Our MCMC sampler identifies promising amino acids to mutate via model gradients taken with respect to the input (i.e., sensitivity analysis). We allow users to compose their own custom target function for MCMC by leveraging the Product of Experts MCMC paradigm. Each model is an "expert" that contributes its own knowledge about the protein's fitness landscape to the overall target function. The sampler is designed to be more efficient and effective than brute force and random search while maintaining most of the generality and flexibility.

See our publication and our documentation for more details.

Installation

EvoProtGrad is available on PyPI and can be installed with pip:

pip install evo_prot_grad

For the bleeding edge version, and/or if you wish to run tests or register a new expert model with EvoProtGrad, please clone this repo and install in editable mode as follows:

git clone https://github.com/NREL/EvoProtGrad.git
cd EvoProtGrad
pip install -e .

Run tests

Test the code by running python3 -m unittest.

Basic Usage

See demo.ipynb to get started right away in a Jupyter notebook or Open In Colab

Create a ProtBERT expert from a pretrained HuggingFace protein language model (PLM) using evo_prot_grad.get_expert:

import evo_prot_grad

prot_bert_expert = evo_prot_grad.get_expert('bert', scoring_strategy = 'pseudolikelihood_ratio', temperature = 1.0)

The default BERT-style PLM in EvoProtGrad is Rostlab/prot_bert. Normally, we would need to also specify the model and tokenizer. When using a default PLM expert, we automatically pull these from the HuggingFace Hub. The temperature parameter rescales the expert scores and can be used to trade off the importance of different experts. The pseudolikelihood_ratio strategy computes the ratio of the "pseudo" log-likelihood (this isn't the exact log-likelihood when the protein language model is a masked language model) of the wild type and mutant sequence.

Then, create an instance of DirectedEvolution and run the search, returning a list of the best variant per Markov chain (as measured by the prot_bert expert):

variants, scores = evo_prot_grad.DirectedEvolution(
                   wt_fasta = 'test/gfp.fasta',    # path to wild type fasta file
                   output = 'best',                # return best, last, all variants    
                   experts = [prot_bert_expert],   # list of experts to compose
                   parallel_chains = 1,            # number of parallel chains to run
                   n_steps = 20,                   # number of MCMC steps per chain
                   max_mutations = 10,             # maximum number of mutations per variant
                   verbose = True                  # print debug info to command line
)()

We provide a few experts in evo_prot_grad/experts that you can use out of the box, such as:

Protein Language Models (PLMs)

  • bert, BERT-style PLMs, default: Rostlab/prot_bert
  • causallm, CausalLM-style PLMs, default: lightonai/RITA_s
  • esm, ESM-style PLMs, default: facebook/esm2_t6_8M_UR50D

Potts models

  • evcouplings

and an generic expert for supervised downstream regression models

  • onehot_downstream_regression

Citation

If you use EvoProtGrad in your research, please cite the following publication:

@article{emami2023plug,
  title={Plug \& play directed evolution of proteins with gradient-based discrete MCMC},
  author={Emami, Patrick and Perreault, Aidan and Law, Jeffrey and Biagioni, David and John, Peter St},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025014},
  year={2023},
  publisher={IOP Publishing}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evo_prot_grad-0.2.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

evo_prot_grad-0.2-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file evo_prot_grad-0.2.tar.gz.

File metadata

  • Download URL: evo_prot_grad-0.2.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for evo_prot_grad-0.2.tar.gz
Algorithm Hash digest
SHA256 578443171b9368300b6c5361f36bdfdccff54e6085d7bd87efeb28de1f813229
MD5 11d863495bd9e56a636d3f39ad95da57
BLAKE2b-256 0bac6b066e5e709fa1f9d99d37fd5413449174316d2bcf68a4ded4836de47aed

See more details on using hashes here.

File details

Details for the file evo_prot_grad-0.2-py3-none-any.whl.

File metadata

  • Download URL: evo_prot_grad-0.2-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for evo_prot_grad-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 01291775ba8b15e09ed2debdbaf6ca9c19dfeeb1ee5356033241969fd4089985
MD5 d9137e1a4394f6958af06e9a5338f68b
BLAKE2b-256 aa320e4aaee2adf048b4ebc13555f0c889c3733db1adcb4915a9b8727ef2c4b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page