Protein Redesign using Raygun
Project description
Raygun: template-based protein design tool
Raygun is a new approach to protein design. Unlike de novo design tools that generate a protein from scratch, Raygun allows users to take an existing protein as template and modify it by introducing insertions, deletions and substitutions. Our analyses showed that the modified proteins significantly retained structural and functional properties of the original template protein.
Publication
Devkota, K., Shonai, D., Mao, J., Soderling, S. H., & Singh, R. (2024). Miniaturizing, Modifying, and Augmenting Nature's Proteins with Raygun. bioRxiv, 2024-08.
Introduction
Raygun is a novel protein design framework that allows for miniaturization, magnification and modification of any template proteins. It lets the user select any protein as template and generates structurally (and therefore, functionally) similar samples, while giving full control over the lengths of the generated sequences.
How to use Raygun: Input a protein sequence, specify a target length and a noise parameter. Raygun will use those information to efficiently generate samples (< 1 sec/sample on a GPU). The users have absolute control over the length of the target protein.
How Raygun works Raygun is an autoencoder-based design which represents any protein as a 64,000-dimensional Multivariate Normal Distribution. The Raygun decoder has the ability to accurately map this fixed-length representation back to the variable length space of the user's specifications.
Requirements
Raygun has few package requirements: numpy
,
pandas
, fair-esm
, pyyaml
, h5py
, einops
and torch
(the version suitable for your GPU). We verified that our model works on
A100 and A6000 GPUs, for the following specifications:
- fair-esm=2.0.0
- numpy=1.26.4
- pandas=2.1.4
- pytorch=2.1.1 (py3.11_cuda12.1_cudnn8.9.2_0)
From source repository
Users can install Raygun directly from source by cloning the github repo https://github.com/rohitsinghlab/raygun and installing the package through pip.
git clone https://github.com/rohitsinghlab/raygun
cd raygun
pip install . #note that the code will be copied to the environment's packages directory, so your localdir changes will not be reflected unless you reinstall
Using pip
Alternately, users can install raygun from the pip repository
pip install raygun
Quick start
Raygun provides users with two command-line programs for training the model and fine-tuning/generating protein samples. These are described below
Generating samples
After the raygun package has been installed, we can use it to generate
samples using the raygun-sample
command. This method will
also fine-tune the model.
We strongly recommend that the user first fine-tune the model on the target sequence or a set of related sequences.
raygun-sample
can be invoked in bash in the following way:
raygun-sample --config <YAML configuration file>
We have provided YAML configuration files related to lacZ sampling in the github repository folder example-configs/lacZ
:
- Quick Start:
generate-sample-lacZ-v1.yaml
fine-tunes on just one lacZ template sequence, and then generates. - Full Example:
generate-sample-lacZ-v1.yaml
fine-tunes on 20 lacZ sequences from the relevant PFAM domain, and then generates.
Below we show v1
## This YAML file specifies all the parameters for using Raygun. ##
## At start, we suggest focusing only on parameters in Sections 1 and 2.
###### Section 1: GPU, INPUT and OUTPUT Locations ########
device: 0 # CUDA device
## template FASTA file
templatefasta: "example-configs/lacZ/lacZ-template.fasta"
## FINE-TUNING ##
## We strongly recommend starting from our pre-trained
## model and fine-tune it for your sequences.
## First time fine-tuning (or to overwrite previous fine-tune). Comment these lines if reusing fine-tuned model.
finetune: true # will overwrite the existing models in model folder if it exists
finetunetrain: "example-configs/lacZ/lacZ-template.fasta" # a single fasta file containing 1 or more sequences you want to fine-tune the decoder on
finetuned_model_loc: "lacZ-finetuned" # folder where models are saved. Will be created if it doesn't exist
## Uncomment lines below to reuse fine-tuned model.
# finetune: false
# finetuned_model_checkpoint: "lacZ-model/epoch_50.sav"
## OUTPUT LOCATION ##
## output folder. Will be created if does not exist. Files may be overwritten if names clash
output_file_identifier: "lacZ" # this will be a substring in all output files produced
sample_out_folder: "lacZ-samples"
###### Section 2: GENERATION PARAMETERS ########
# how many samples you want. This will be the count after the PLL filtering
num_raygun_samples_to_generate: 20
## how much pseudo log-likelihood filtering to do.
## value=0 means no filtering, 0.9 means keep best 10% of hits by PLL
## Raygun will actually generate <num_raygun_samples_to_generate>*(1/<filter_ratio_with_pll>) entries, storing them in a file with "unfiltered" in its name
## It'll then filter them by PLL and store the <num_raygun_samples_to_generate> sequences in a file with "filtered" in its name
filter_ratio_with_pll: 0.5
## target lengths: a json file containing the target length range you want for each template
##
## the format of the json file is: { "<fastaid>": [minlen, maxlen], ... }
## here's an example: { "sp|P00722|BGAL_ECOLI": [900, 950] }.
## In this case, a total of <num_raygun_samples_to_generate> will be generated across 900-950 length
## to specify a single target-length, you can set minlen=maxlen (e.g. [900,900])
lengthinfo: "example-configs/lacZ/leninfo-lacZ.json"
## noiseratio is a number >= 0. At 0, minimal substitutions will be introduced. If you go over 2.25, expect >50% substitution rate
## for most applications, we recommend noiseratio = 0.5 and randomize_noise = true
noiseratio: 0.5
randomize_noise: true # if true, the actual noise for any sample will actually be sampled from uniform(0, <noiseratio>)
###### Section 3: OTHER PARAMETERS ########
## you can ignore these for now
finetune_epoch: 50
finetune_save_every: 50
finetune_lr: 0.0001
minallowedlength: 55 # minimum length of template protein
usereconstructionloss: true
usereplicateloss: true
usecrossentropyloss: true
reconstructionlossratio: 1
replicatelossratio: 1
crossentropylossratio: 1
maxlength: 1000
saveoptimizerstate: false
Note that raygun-sample
gives users the option to perform
finetuning on the pretrained model. So, only using
raygun-sample
satisfies majority of end-user requirements
Training the model
If the goal is to pre-train the model from scratch, we suggest using the
raygun-train
command. It can be invoked as:
raygun-train --config <YAML configuration file>
We also provide the configuration file for training the lacZ-train model
in the example-configs/lacZ
folder.
## This YAML file specifies all the parameters for finetuning Raygun, or training it from scratch ##
## At start, we suggest focusing only on parameters in Sections 1 and 2.
###### Section 1: GPU, INPUT and OUTPUT Locations ########
# your cuda device. If you only have one, 0 is likely the default
device: 0
## INPUT ##
# the input sequence(s) for training or finetuning. Requires the training fasta file. If the validation fasta not specified, the training epoch will not perform the validation step
trainfasta: "example-configs/lacZ/lacZ-selected-family.fasta"
# validfasta: "example-configs/lacZ/lacZ-selected-family.fasta"
# embedding location. If specified, the ESM-2 outputs will be saved at this location. If set to null, the embeddings are not saved
esm2_embedding_saveloc: null
# folder where the output model is to be saved. REQUIRED
output_model_loc: "lacZ-trained"
# if the `checkpoint` is specified, the training/finetuning will begin from this checkpoint. If not provided, the program will use the pretrain model
# checkpoint: "bgal-model/epoch_5.sav"
# Set this to true if the goal is finetuning. Finetuning freezes the encoder parameters only updating the Raygun decoder
# For training from scratch, set `finetune: false`. Default: false
finetune: false
## Total number of epochs to train/finetune, and the learning rate
epoch: 50
lr: 0.0001
# Default: 1, model is saved at every multiples of this parameter
save_every: 50
###### Section 3: OTHER PARAMETERS ########
## you can ignore these for now
usereconstructionloss: true
usereplicateloss: true
usecrossentropyloss: true
reconstructionlossratio: 1
replicatelossratio: 1
crossentropylossratio: 1
maxlength: 1000
minallowedlength: 55
clip: 0.0001
saveoptimizerstate: false
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file raygun-0.1.0.tar.gz
.
File metadata
- Download URL: raygun-0.1.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01580403cf4405af724c9eb2d0d67ea1d33271ab8e90b7528064c503f4ca7b59 |
|
MD5 | a26987726137f6cbb005a482ad752ecd |
|
BLAKE2b-256 | f652750fd291475591c4935a2d2b0bdd3f1c2dd262fcb514f6499cbf329c21aa |
File details
Details for the file raygun-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: raygun-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d15b8b51c57600e18acc4b90280a45815ef7540473ae100a6caddd83609072bb |
|
MD5 | 0748fc46fa0d9bde0437b1c63bc86dc3 |
|
BLAKE2b-256 | d550725362bdbb49baecc7bfaa9879be6ea8d2843527085dfbc0565999409384 |