Sandbox (in progress) for Computational Protein Design

These details have not been verified by PyPI

Project links

Project description

                          _____________________.___.____    .____     
                          \__    ___/\______   \   |    |   |    |    
                            |    |    |       _/   |    |   |    |    
                            |    |    |    |   \   |    |___|    |___ 
                            |____|    |____|_  /___|_______ \_______ \
                                             \/            \/       \/

status

TRILL

TRaining and Inference using the Language of Life

Set-Up

I recommend using a virtual environment with conda, venv etc.
Run $ pip install trill-proteins torch pytorch-lightning
$ pip install pyg-lib torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.13.0+cu117.html

Examples

Default (Fine-tuning ESM2)

The default mode for TRILL is to just fine-tune the base esm2_t12_35M_UR50D model from FAIR with the query input.

$ trill fine_tuning_ex 1 --query data/query.fasta

Embed with base esm2_t12_35M_UR50D model

You can also embed proteins with just the base model from FAIR and completely skip fine-tuning. The output will be a CSV file where each row corresponds to a single protein with the last column being the fasta header.

$ trill base_embed 1 --query data/query.fasta --noTrain

Embedding with a custom pre-trained model

If you have a pre-trained model, you can use it to embed sequences by passing the path to --preTrained_model.

$ trill pre_trained 1 --query data/query.fasta --preTrained_model /path/to/models/pre_trained_model.pt

Distributed Training/Inference

In order to scale/speed up your analyses, you can distribute your training/inference across many GPUs with a few extra flags to your command. You can even fit models that do not normally fit on your GPUs with sharding, CPU-offloading etc. Below is an example slurm batch submission file. The list of strategies can be found here (https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html). The example below utilizes 16 GPUs in total (4(GPUs) * 4(--nodes)) with Fully Sharded Data Parallel and the 650M parameter ESM2 model.

#!/bin/bash
#SBATCH --time=8:00:00   # walltime
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=4 # number of nodes
#SBATCH --gres=gpu:4 # number of GPUs
#SBATCH --mem-per-cpu=60G   # memory per CPU core
#SBATCH -J "tutorial"   # job name
#SBATCH --mail-user="" # change to your email
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --output=%x-%j.out
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
export MASTER_PORT=13579

srun trill distributed_example 4 --query data/query.fasta --nodes 4 --strategy fsdp --model esm2_t33_650M_UR50D

You can then submit this job with:

$ sbatch distributed_example.slurm

More examples for distributed training/inference without slurm coming soon!

Generating protein sequences using inverse folding with ESM-IF1

When provided a protein backbone structure (.pdb, .cif), the IF1 model is able to predict a sequence that might be able to fold into the input structure. The example input are the backbone coordinates from DWARF14, a rice hydrolase. For every chain in the structure, 2 in 4ih9.pdb, the following command will generate 3 sequences. In total, 6 sequences will be generated.

$ trill IF_Test 1 --query data/4ih9.pdb --if1 --genIters 3

Generating Proteins using ProtGPT2

You can also generate synthetic proteins using ProtGPT2. The command below generates 5 proteins with a max length of 100. The default seed sequence is "M", but you can also change this. Check out the command-line arguments for more details.

$ trill Gen_ProtGPT2 1 --protgpt2 --gen --max_length 100 --num_return_sequences 5

Fine-Tuning ProtGPT2

In case you wanted to generate certain "types" of proteins, below is an example of fine-tuning ProtGPT2 and then generating proteins with the fine-tuned model.

$ trill FineTune 2 --protgpt2 --epochs 100

$ trill Gen_With_FineTuned 1 --protgpt2 --gen --preTrained_model FineTune_ProtGPT2_100.pt

Arguments

Positional Arguments:

name (Name of run)
GPUs (Total # of GPUs requested for each node)

Optional Arguments:

-h, --help (Show help message)
--query (Input file. Needs to be either protein fasta (.fa, .faa, .fasta) or structural coordinates (.pdb, .cif))
--nodes (Total number of computational nodes. Default is 1)
--lr (Learning rate for adam optimizer. Default is 0.0001)
--epochs (Number of epochs for fine-tuning transformer. Default is 20)
--noTrain (Skips the fine-tuning and embeds the query sequences with the base model)
--preTrained_model (Input path to your own pre-trained ESM model)
--batch_size (Change batch-size number for fine-tuning. Default is 1)
--model (Change ESM model. Default is esm2_t12_35M_UR50D. List of models can be found at https://github.com/facebookresearch/esm)
--strategy (Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html)
--logger (Enable Tensorboard logger. Default is None)
--if1 (Utilize Inverse Folding model 'esm_if1_gvp4_t16_142M_UR50' to facilitate fixed backbone sequence design. Basically converts protein structure to possible sequences)
--temp (Choose sampling temperature. Higher temps will have more sequence diversity, but less recovery of the original sequence for ESM_IF1)
--genIters (Adjust number of sequences generated for each chain of the input structure for ESM_IF1)
--LEGGO (Use deepspeed_stage_3_offload with ESM. Will be removed soon...)
--profiler (Utilize PyTorchProfiler)
--protgpt2 (Utilize ProtGPT2. Can either fine-tune or generate sequences)
--gen (Generate protein sequences using ProtGPT2. Can either use base model or user-submitted fine-tuned model)
--seed_seq (Sequence to seed ProtGPT2 Generation)
--max_length (Max length of proteins generated from ProtGPT)
--do_sample (Whether or not to use sampling ; use greedy decoding otherwise)
--top_k (The number of highest probability vocabulary tokens to keep for top-k-filtering)
--repetition_penalty (The parameter for repetition penalty. 1.0 means no penalty)
--num_return_sequences (Number of sequences for ProtGPT2 to generate)

Misc. Tips

Make sure there are no "*" in the protein sequences
Don't run jobs on the login node, only submit jobs with sbatch or srun on the HPC
Caltech HPC Docs https://www.hpc.caltech.edu/documentation

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.8.2

Jun 26, 2024

1.8.1

Jun 26, 2024

1.8.0

Jun 24, 2024

1.7.8

Jun 18, 2024

1.7.7

Jun 18, 2024

1.7.6

Jun 14, 2024

1.7.5

Jun 14, 2024

1.7.4

Jun 12, 2024

1.7.3

Jun 11, 2024

1.7.2

Jun 7, 2024

1.7.1

Jun 2, 2024

1.7.0

May 24, 2024

1.6.0

Mar 11, 2024

1.5.3

Jan 25, 2024

1.5.2

Nov 10, 2023

1.5.1

Oct 22, 2023

1.5.0

Oct 6, 2023

1.4.5

Sep 18, 2023

1.4.4

Sep 15, 2023

1.4.3

Sep 13, 2023

1.4.2

Sep 13, 2023

1.3.11

Jun 20, 2023

1.3.10

Jun 19, 2023

1.3.9

Jun 14, 2023

1.3.8

May 22, 2023

1.3.7

May 15, 2023

1.3.5 yanked

May 15, 2023

1.3.4 yanked

May 15, 2023

1.3.3 yanked

May 15, 2023

1.3.2 yanked

May 15, 2023

1.3.1 yanked

May 15, 2023

1.3.0 yanked

May 14, 2023

1.2.0

May 10, 2023

1.1.1

Mar 28, 2023

1.0.14

Mar 27, 2023

1.0.13

Mar 22, 2023

1.0.12

Mar 17, 2023

1.0.11

Mar 16, 2023

1.0.10

Mar 11, 2023

1.0.9

Mar 7, 2023

1.0.8

Mar 7, 2023

1.0.7

Mar 7, 2023

1.0.6

Mar 4, 2023

1.0.5

Mar 1, 2023

1.0.3

Feb 22, 2023

1.0.2

Feb 16, 2023

1.0.1

Feb 16, 2023

1.0.0

Feb 8, 2023

0.4.5

Feb 6, 2023

0.4.4

Feb 6, 2023

0.4.2

Feb 5, 2023

0.4.1

Feb 5, 2023

0.4.0

Feb 3, 2023

0.3.3

Jan 26, 2023

0.3.2

Jan 25, 2023

0.3.1

Jan 23, 2023

0.3.0

Jan 23, 2023

This version

0.2.4

Jan 20, 2023

0.2.3

Dec 11, 2022

0.2.2

Dec 11, 2022

0.2.1

Dec 11, 2022

0.2.0

Dec 11, 2022

0.1.2

Dec 9, 2022

0.1.1

Dec 9, 2022

0.1.0

Dec 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trill-proteins-0.2.4.tar.gz (10.9 MB view details)

Uploaded Jan 20, 2023 Source

Built Distribution

trill_proteins-0.2.4-py3-none-any.whl (10.9 MB view details)

Uploaded Jan 20, 2023 Python 3

File details

Details for the file trill-proteins-0.2.4.tar.gz.

File metadata

Download URL: trill-proteins-0.2.4.tar.gz
Upload date: Jan 20, 2023
Size: 10.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.15 CPython/3.10.9 Linux/5.15.0-1031-azure

File hashes

Hashes for trill-proteins-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`f461d6b87673e22fdbb4feb00857b9c00c18d773a19f1809e1b9b82c22c4bbdc`
MD5	`1fd65ea71d6ffa11cdc4df494c4c585e`
BLAKE2b-256	`299c1de7ccdb9a27f5f9073b55f842700a828f3602a4320c1aa8798e14b5bb61`

See more details on using hashes here.

File details

Details for the file trill_proteins-0.2.4-py3-none-any.whl.

File metadata

Download URL: trill_proteins-0.2.4-py3-none-any.whl
Upload date: Jan 20, 2023
Size: 10.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.15 CPython/3.10.9 Linux/5.15.0-1031-azure

File hashes

Hashes for trill_proteins-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1692a9eec92d348c5a7e3c692d02395e73c7fe427851071263b9520522f0d87b`
MD5	`5d21be1a47cb7d2998a24480ac577d78`
BLAKE2b-256	`ea754f3c7c0774e14c70d1c4b2f34ff35b4ead4dafc0651c41aad74694bedc42`