Skip to main content

Protein Redesign using Raygun

Project description

Raygun: template-based protein design tool

Raygun is a new approach to protein design. Unlike de novo design tools that generate a protein from scratch, Raygun allows users to take an existing protein as template and modify it by introducing insertions, deletions and substitutions. Our analyses showed that the modified proteins significantly retained structural and functional properties of the original template protein.

Publication

Devkota, K., Shonai, D., Mao, J., Soderling, S. H., & Singh, R. (2024). Miniaturizing, Modifying, and Augmenting Nature's Proteins with Raygun. bioRxiv, 2024-08.

Raygun blasting a protein, shrinking its size.

Introduction

Raygun is a novel protein design framework that allows for miniaturization, magnification and modification of any template proteins. It lets the user select any protein as template and generates structurally (and therefore, functionally) similar samples, while giving full control over the lengths of the generated sequences.

How to use Raygun: Input a protein sequence, specify a target length and a noise parameter. Raygun will use those information to efficiently generate samples (< 1 sec/sample on a GPU). The users have absolute control over the length of the target protein.

How Raygun works Raygun is an autoencoder-based design which represents any protein as a 64,000-dimensional Multivariate Normal Distribution. The Raygun decoder has the ability to accurately map this fixed-length representation back to the variable length space of the user's specifications.

Requirements

Raygun has few package requirements: numpy, pandas, fair-esm, pyyaml, h5py, einops and torch (the version suitable for your GPU). We verified that our model works on A100 and A6000 GPUs, for the following specifications:

  • fair-esm=2.0.0
  • numpy=1.26.4
  • pandas=2.1.4
  • pytorch=2.1.1 (py3.11_cuda12.1_cudnn8.9.2_0)

From source repository

Users can install Raygun directly from source by cloning the github repo https://github.com/rohitsinghlab/raygun and installing the package through pip.

git clone https://github.com/rohitsinghlab/raygun
cd raygun
pip install . #note that the code will be copied to the environment's packages directory, so your localdir changes will not be reflected unless you reinstall

Using pip

Alternately, users can install raygun from the pip repository

pip install raygun

Quick start

Raygun provides users with two command-line programs for training the model and fine-tuning/generating protein samples. These are described below

Generating samples

After the raygun package has been installed, we can use it to generate samples using the raygun-sample command. This method will also fine-tune the model.

We strongly recommend that the user first fine-tune the model on the target sequence or a set of related sequences.

raygun-sample can be invoked in bash in the following way:

raygun-sample --config <YAML configuration file>

We have provided YAML configuration files related to lacZ sampling in the github repository folder example-configs/lacZ:

  • Quick Start: generate-sample-lacZ-v1.yaml fine-tunes on just one lacZ template sequence, and then generates.
  • Full Example: generate-sample-lacZ-v1.yaml fine-tunes on 20 lacZ sequences from the relevant PFAM domain, and then generates.

Below we show v1

## This YAML file specifies all the parameters for using Raygun. ##
##  At start, we suggest focusing only on parameters in Sections 1 and 2.  

###### Section 1: GPU, INPUT and OUTPUT Locations ########
device: 0  # CUDA device
## template FASTA file
templatefasta: "example-configs/lacZ/lacZ-template.fasta" 


## FINE-TUNING ##

## We strongly recommend starting from our pre-trained
## model and fine-tune it for your sequences.

## First time fine-tuning (or to overwrite previous fine-tune). Comment these lines if reusing fine-tuned model. 
finetune: true  # will overwrite the existing models in model folder if it exists
finetunetrain: "example-configs/lacZ/lacZ-template.fasta" # a single fasta file containing 1 or more sequences you want to fine-tune the decoder on
finetuned_model_loc: "lacZ-finetuned"  # folder where models are saved. Will be created if it doesn't exist

## Uncomment lines below to reuse fine-tuned model. 
# finetune: false
# finetuned_model_checkpoint: "lacZ-model/epoch_50.sav" 


## OUTPUT LOCATION ##

## output folder. Will be created if does not exist. Files may be overwritten if names clash
output_file_identifier: "lacZ"  # this will be a substring in all output files produced
sample_out_folder: "lacZ-samples"


###### Section 2: GENERATION PARAMETERS ########

# how many samples you want. This will be the count after the PLL filtering
num_raygun_samples_to_generate: 20

## how much pseudo log-likelihood filtering to do.
## value=0 means no filtering, 0.9 means keep best 10% of hits by PLL
## Raygun will actually generate <num_raygun_samples_to_generate>*(1/<filter_ratio_with_pll>) entries, storing them in a file with "unfiltered" in its name
##  It'll then filter them by PLL and store the <num_raygun_samples_to_generate> sequences in a file with "filtered" in its name
filter_ratio_with_pll: 0.5

## target lengths: a json file containing the target length range you want for each template
##
##  the format of the json file is: { "<fastaid>": [minlen, maxlen], ... }
##   here's an example: {  "sp|P00722|BGAL_ECOLI": [900, 950] }.
##   In this case, a total of <num_raygun_samples_to_generate> will be generated across 900-950 length
##  to specify a single target-length, you can set minlen=maxlen (e.g. [900,900])

lengthinfo: "example-configs/lacZ/leninfo-lacZ.json"

## noiseratio is a number >= 0. At 0, minimal substitutions will be introduced. If you go over 2.25, expect >50% substitution rate
## for most applications, we recommend noiseratio = 0.5 and randomize_noise = true 
noiseratio: 0.5
randomize_noise: true  # if true, the actual noise for any sample will actually be sampled from uniform(0, <noiseratio>)


###### Section 3: OTHER PARAMETERS ########
## you can ignore these for now

finetune_epoch: 50
finetune_save_every: 50
finetune_lr: 0.0001
minallowedlength: 55 # minimum length of template protein
usereconstructionloss: true
usereplicateloss: true
usecrossentropyloss: true
reconstructionlossratio: 1
replicatelossratio: 1
crossentropylossratio: 1
maxlength: 1000 
saveoptimizerstate: false

Note that raygun-sample gives users the option to perform finetuning on the pretrained model. So, only using raygun-sample satisfies majority of end-user requirements

Training the model

If the goal is to pre-train the model from scratch, we suggest using the raygun-train command. It can be invoked as:

raygun-train --config <YAML configuration file>

We also provide the configuration file for training the lacZ-train model in the example-configs/lacZ folder.

## This YAML file specifies all the parameters for finetuning Raygun, or training it from scratch ##
##  At start, we suggest focusing only on parameters in Sections 1 and 2.  
###### Section 1: GPU, INPUT and OUTPUT Locations ########
# your cuda device. If you only have one, 0 is likely the default
device: 0

## INPUT ##

# the input sequence(s) for training or finetuning. Requires the training fasta file. If the validation fasta not specified, the training epoch will not perform the validation step
trainfasta: "example-configs/lacZ/lacZ-selected-family.fasta"
# validfasta: "example-configs/lacZ/lacZ-selected-family.fasta"

# embedding location. If specified, the ESM-2 outputs will be saved at this location. If set to null, the embeddings are not saved
esm2_embedding_saveloc: null

# folder where the output model is to be saved. REQUIRED
output_model_loc: "lacZ-trained"
# if the `checkpoint` is specified, the training/finetuning will begin from this checkpoint. If not provided, the program will use the pretrain model
# checkpoint: "bgal-model/epoch_5.sav"

# Set this to true if the goal is finetuning. Finetuning freezes the encoder parameters only updating the Raygun decoder
# For training from scratch, set `finetune: false`. Default: false
finetune: false

## Total number of epochs to train/finetune, and the learning rate
epoch: 50
lr: 0.0001
# Default: 1, model is saved at every multiples of this parameter
save_every: 50

###### Section 3: OTHER PARAMETERS ########
## you can ignore these for now
usereconstructionloss: true
usereplicateloss: true
usecrossentropyloss: true
reconstructionlossratio: 1
replicatelossratio: 1
crossentropylossratio: 1
maxlength: 1000
minallowedlength: 55
clip: 0.0001
saveoptimizerstate: false

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raygun-0.1.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

raygun-0.1.0-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file raygun-0.1.0.tar.gz.

File metadata

  • Download URL: raygun-0.1.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for raygun-0.1.0.tar.gz
Algorithm Hash digest
SHA256 01580403cf4405af724c9eb2d0d67ea1d33271ab8e90b7528064c503f4ca7b59
MD5 a26987726137f6cbb005a482ad752ecd
BLAKE2b-256 f652750fd291475591c4935a2d2b0bdd3f1c2dd262fcb514f6499cbf329c21aa

See more details on using hashes here.

File details

Details for the file raygun-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: raygun-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for raygun-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d15b8b51c57600e18acc4b90280a45815ef7540473ae100a6caddd83609072bb
MD5 0748fc46fa0d9bde0437b1c63bc86dc3
BLAKE2b-256 d550725362bdbb49baecc7bfaa9879be6ea8d2843527085dfbc0565999409384

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page