Sandbox (in progress) for Computational Protein Design
Project description
TRILL
TRaining and Inference using the Language of Life
Arguments
Positional Arguments:
- name (Name of run)
- query (Input file. Needs to be either protein fasta (.fa, .faa, .fasta) or structural coordinates (.pdb, .cif))
- GPUs (Total # of GPUs requested for each node)
Optional Arguments:
- -h, --help (Show help message)
- --database (Input database to embed with --blast mode)
- --nodes (Total number of computational nodes. Default is 1)
- --lr (Learning rate for adam optimizer. Default is 0.0001)
- --epochs (Number of epochs for fine-tuning transformer. Default is 20)
- --noTrain (Skips the fine-tuning and embeds the query sequences with the base model)
- --preTrained_model (Input path to your own pre-trained ESM model)
- --batch_size (Change batch-size number for fine-tuning. Default is 5)
- --blast (Enables "BLAST" mode. --database argument is required)
- --model (Change ESM model. Default is esm2_t12_35M_UR50D. List of models can be found at https://github.com/facebookresearch/esm)
- --strategy (Change training strategy. Default is None. List of strategies can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html)
- --logger (Enable Tensorboard logger. Default is None)
- --if1 (Utilize Inverse Folding model 'esm_if1_gvp4_t16_142M_UR50' to facilitate fixed backbone sequence design. Basically converts protein structure to possible sequences)
- --chain (Don't use right now)
- --temp (Choose sampling temperature. Higher temps will have more sequence diversity, but less recovery of the original sequence)
- --genIters (Adjust number of sequences generated for each chain of the input structure)
Examples
Default (Fine-tuning)
- The default mode for the pipeline is to just fine-tune the base esm1_t12 model from FAIR with the query input.
python3 main.py fine_tuning_ex ../data/query.fasta 4
Embed with base esm1_t12 model
- You can also embed proteins with just the base model from FAIR and completely skip fine-tuning.
python3 main.py raw_embed ../data/query.fasta 4 --noTrain
Embedding with a custom pre-trained model
- If you have a pre-trained model, you can use it to embed sequences by passing the path to --preTrained_model.
python3 main.py pre_trained ../data/query.fasta 4 --preTrained_model ../models/pre_trained_model.pt
BLAST-like (Fine-tune on query and embed query+database)
- To enable a BLAST-like functionality, you can use the --blast flag in conjuction with passing a database fasta file to --database. The base model from FAIR is first fine-tuned with the query sequences and then both the query and the database sequences are embedded.
python3 main.py blast_search ../data/query.fasta 4 --blast --database ../data/database.fasta
Distributed Training/Inference
- In order to scale/speed up your analyses, you can distribute your training/inference across many GPUs with a few extra flags to your command. You can even fit models that do not normally fit on your GPUs with sharding and CPU-offloading. The list of strategies can be found here (https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html). The example below utilizes 16 GPUs in total (4(GPUs) * 4(--nodes)) with Fully Sharded Data Parallel and the 650M parameter ESM2 model.
python3 main.py distributed_example ../data/query.fasta 4 --nodes 4 --strategy fsdp --model esm2_t33_650M_UR50D
Generating protein sequences using inverse folding with ESM-IF1
- When provided a protein backbone structure (.pdb, .cif), the IF1 model is able to predict a sequence that might be able to fold into the input structure. The example input are the backbone coordinates from DWARF14, a rice hydrolase. For every chain in the structure, 2 in 4ih9.pdb, the following command will generate 3 sequences. In total, 6 sequences will be generated.
python3 main.py IF_Test ../data/4ih9.pdb 1 --if1 --gen_iters 3
Quick Tutorial (NOT CURRENT, DON'T USE):
- Type
git clone https://github.com/martinez-zacharya/DistantHomologyDetection
in your home directory on the HPC - Download Miniconda by running
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
and thensh ./Miniconda3-latest-Linux-x86_64.sh
. - Run
conda env create -f environment.yml
in the home directory of the repo to set up the proper conda environment and then typeconda activate RemoteHomologyTransformer
to activate it. - Shift your current working directory to the scripts folder with
cd scripts
. - Type
vi tutorial_slurm
to open the slurm file and then hiti
. - Change the email in the tutorial_slurm file to your email (You can use https://s3-us-west-2.amazonaws.com/imss-hpc/index.html to make your own slurm files in the future).
- Save the file by first hitting escape and then entering
:x
to exit and save the file. - You can view the arguments for the command line tool by typing
python3 main.py -h
. - To run the tutorial analysis, make the tutorial slurm file exectuable with
chmod +x tutorial_slurm.sh
and then typesbatch tutorial_slurm.sh
. - You can now safely exit the ssh instance to the HPC if you want
Misc. Tips
- Make sure there are no "*" in the protein sequences
- Don't run jobs on the login node, only submit jobs with sbatch or srun on the HPC
- Caltech HPC Docs https://www.hpc.caltech.edu/documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
trill_proteins-0.1.0.tar.gz
(10.9 MB
view details)
Built Distribution
File details
Details for the file trill_proteins-0.1.0.tar.gz
.
File metadata
- Download URL: trill_proteins-0.1.0.tar.gz
- Upload date:
- Size: 10.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.12 Linux/6.0.6-76060006-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80d2617549875aa4aedd2d223788649811fb647db081f41fe8890822f038cedb |
|
MD5 | be8c78194196d32ff936d545f50d1a83 |
|
BLAKE2b-256 | ad0c9d0607c91a3555dd7e69670a293618f39f0f2006d1295954c7f97a5009b4 |
File details
Details for the file trill_proteins-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: trill_proteins-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.12 Linux/6.0.6-76060006-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7cf4f68bba2e3ded657b95183298960043633625bfe520dd652bd853958c2224 |
|
MD5 | 99d5dd0b03f56d898bdaa5373a6227e4 |
|
BLAKE2b-256 | 531d8b0bb815b9d19735e9ae07af553f643a758700bd61341679c6f1c333aad1 |