Sandbox (in progress) for Computational Protein Design

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

TRILL

TRaining and Inference using the Language of Life

Arguments

Positional Arguments:

name (Name of run)
query (Input file. Needs to be either protein fasta (.fa, .faa, .fasta) or structural coordinates (.pdb, .cif))
GPUs (Total # of GPUs requested for each node)

Optional Arguments:

-h, --help (Show help message)
--database (Input database to embed with --blast mode)
--nodes (Total number of computational nodes. Default is 1)
--lr (Learning rate for adam optimizer. Default is 0.0001)
--epochs (Number of epochs for fine-tuning transformer. Default is 20)
--noTrain (Skips the fine-tuning and embeds the query sequences with the base model)
--preTrained_model (Input path to your own pre-trained ESM model)
--batch_size (Change batch-size number for fine-tuning. Default is 5)
--blast (Enables "BLAST" mode. --database argument is required)
--model (Change ESM model. Default is esm2_t12_35M_UR50D. List of models can be found at https://github.com/facebookresearch/esm)
--strategy (Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html)
--logger (Enable Tensorboard logger. Default is None)
--if1 (Utilize Inverse Folding model 'esm_if1_gvp4_t16_142M_UR50' to facilitate fixed backbone sequence design. Basically converts protein structure to possible sequences)
--chain (Don't use right now)
--temp (Choose sampling temperature. Higher temps will have more sequence diversity, but less recovery of the original sequence)
--genIters (Adjust number of sequences generated for each chain of the input structure)

Examples

Default (Fine-tuning)

The default mode for the pipeline is to just fine-tune the base esm1_t12 model from FAIR with the query input.

python3 main.py fine_tuning_ex ../data/query.fasta 4

Embed with base esm1_t12 model

You can also embed proteins with just the base model from FAIR and completely skip fine-tuning.

python3 main.py raw_embed ../data/query.fasta 4 --noTrain

Embedding with a custom pre-trained model

If you have a pre-trained model, you can use it to embed sequences by passing the path to --preTrained_model.

python3 main.py pre_trained ../data/query.fasta 4 --preTrained_model ../models/pre_trained_model.pt

BLAST-like (Fine-tune on query and embed query+database)

To enable a BLAST-like functionality, you can use the --blast flag in conjuction with passing a database fasta file to --database. The base model from FAIR is first fine-tuned with the query sequences and then both the query and the database sequences are embedded.

python3 main.py blast_search ../data/query.fasta 4 --blast --database ../data/database.fasta

Distributed Training/Inference

In order to scale/speed up your analyses, you can distribute your training/inference across many GPUs with a few extra flags to your command. You can even fit models that do not normally fit on your GPUs with sharding and CPU-offloading. The list of strategies can be found here (https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html). The example below utilizes 16 GPUs in total (4(GPUs) * 4(--nodes)) with Fully Sharded Data Parallel and the 650M parameter ESM2 model.

python3 main.py distributed_example ../data/query.fasta 4 --nodes 4 --strategy fsdp --model esm2_t33_650M_UR50D

Generating protein sequences using inverse folding with ESM-IF1

When provided a protein backbone structure (.pdb, .cif), the IF1 model is able to predict a sequence that might be able to fold into the input structure. The example input are the backbone coordinates from DWARF14, a rice hydrolase. For every chain in the structure, 2 in 4ih9.pdb, the following command will generate 3 sequences. In total, 6 sequences will be generated.

python3 main.py IF_Test ../data/4ih9.pdb 1 --if1 --gen_iters 3

Quick Tutorial (NOT CURRENT, DON'T USE):

Type git clone https://github.com/martinez-zacharya/DistantHomologyDetection in your home directory on the HPC
Download Miniconda by running wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh and then sh ./Miniconda3-latest-Linux-x86_64.sh.
Run conda env create -f environment.yml in the home directory of the repo to set up the proper conda environment and then type conda activate RemoteHomologyTransformer to activate it.
Shift your current working directory to the scripts folder with cd scripts.
Type vi tutorial_slurm to open the slurm file and then hit i.
Change the email in the tutorial_slurm file to your email (You can use https://s3-us-west-2.amazonaws.com/imss-hpc/index.html to make your own slurm files in the future).
Save the file by first hitting escape and then entering :x to exit and save the file.
You can view the arguments for the command line tool by typing python3 main.py -h.
To run the tutorial analysis, make the tutorial slurm file exectuable with chmod +x tutorial_slurm.sh and then type sbatch tutorial_slurm.sh.
You can now safely exit the ssh instance to the HPC if you want

Misc. Tips

Make sure there are no "*" in the protein sequences
Don't run jobs on the login node, only submit jobs with sbatch or srun on the HPC
Caltech HPC Docs https://www.hpc.caltech.edu/documentation

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.7.0

May 24, 2024

1.6.0

Mar 11, 2024

1.5.3

Jan 25, 2024

1.5.2

Nov 10, 2023

1.5.1

Oct 22, 2023

1.5.0

Oct 6, 2023

1.4.5

Sep 18, 2023

1.4.4

Sep 15, 2023

1.4.3

Sep 13, 2023

1.4.2

Sep 13, 2023

1.3.11

Jun 20, 2023

1.3.10

Jun 19, 2023

1.3.9

Jun 14, 2023

1.3.8

May 22, 2023

1.3.7

May 15, 2023

1.3.5 yanked

May 15, 2023

1.3.4 yanked

May 15, 2023

1.3.3 yanked

May 15, 2023

1.3.2 yanked

May 15, 2023

1.3.1 yanked

May 15, 2023

1.3.0 yanked

May 14, 2023

1.2.0

May 10, 2023

1.1.1

Mar 28, 2023

1.0.14

Mar 27, 2023

1.0.13

Mar 22, 2023

1.0.12

Mar 17, 2023

1.0.11

Mar 16, 2023

1.0.10

Mar 11, 2023

1.0.9

Mar 7, 2023

1.0.8

Mar 7, 2023

1.0.7

Mar 7, 2023

1.0.6

Mar 4, 2023

1.0.5

Mar 1, 2023

1.0.3

Feb 22, 2023

1.0.2

Feb 16, 2023

1.0.1

Feb 16, 2023

1.0.0

Feb 8, 2023

0.4.5

Feb 6, 2023

0.4.4

Feb 6, 2023

0.4.2

Feb 5, 2023

0.4.1

Feb 5, 2023

0.4.0

Feb 3, 2023

0.3.3

Jan 26, 2023

0.3.2

Jan 25, 2023

0.3.1

Jan 23, 2023

0.3.0

Jan 23, 2023

0.2.4

Jan 20, 2023

0.2.3

Dec 11, 2022

0.2.2

Dec 11, 2022

0.2.1

Dec 11, 2022

0.2.0

Dec 11, 2022

0.1.2

Dec 9, 2022

0.1.1

Dec 9, 2022

This version

0.1.0

Dec 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trill_proteins-0.1.0.tar.gz (10.9 MB view hashes)

Uploaded Dec 9, 2022 Source

Built Distribution

trill_proteins-0.1.0-py3-none-any.whl (10.9 MB view hashes)

Uploaded Dec 9, 2022 Python 3

Hashes for trill_proteins-0.1.0.tar.gz

Hashes for trill_proteins-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`80d2617549875aa4aedd2d223788649811fb647db081f41fe8890822f038cedb`
MD5	`be8c78194196d32ff936d545f50d1a83`
BLAKE2b-256	`ad0c9d0607c91a3555dd7e69670a293618f39f0f2006d1295954c7f97a5009b4`

Hashes for trill_proteins-0.1.0-py3-none-any.whl

Hashes for trill_proteins-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cf4f68bba2e3ded657b95183298960043633625bfe520dd652bd853958c2224`
MD5	`99d5dd0b03f56d898bdaa5373a6227e4`
BLAKE2b-256	`531d8b0bb815b9d19735e9ae07af553f643a758700bd61341679c6f1c333aad1`