A tool for Guided Generation based Protein Design and Engineering using FoldX, or any other properties adding to custom-scoring function.
Project description
Protein Understanding and Design Using Guided Generation
Introduction
Guided generation in protein design and engineering allows using external information to steer the output of a generative model towards specific biological or functional goals. PLMs often struggle to generate sequences with specific, desired properties that are not strongly represented in the training data.
This repository contains a Python-based framework for computational protein design that uses ESM3's guided generation capabilities to refine protein sequences, optimizing for structural stability as predicted by the FoldX energy function.
The primary goal of this tool is to take a wild-type protein structure and redesign a user-defined portion of its sequence to discover novel variants with enhanced stability ($ΔΔG$). The code is customizable to include other protein properties in the guided-generation custom-scoring function.
Folder Structure
ESM3-Guided-Generation-Based-Protein-Engineering
│
└─── data <- folder to keep the respective protein data bank and crystallographic information file
│ │
│ └───cif
│ └───pdb
│
└───ESM_Cookbook <- experimental notebooks provided by ESM for testing purposes
│
└───result <- folder to store the plots and obtained results
│
└───foldx <- folder to store pdb files, foldx generated repaired files, foldx binary, and rotabase.txt
│
└───logs <- folder to store the generated log from experiments, it includes all information and processes.
│
└───src <- main source folder
│ │
│ └───notebooks <- includes file to convert cif to pdb, analyzing pdb files, and code to generate plots from log file
│ │ │
│ │ └───ciftopdb.ipynb
│ │ └───PDB_analysis.ipynb
│ │ └───plot.ipynb
│ │
│ └───esm_foldx_guidedgeneration
│ │
│ └───guided_generation.py <- derivative-free guided generation, parallel foldx run
│ └───main.py <- main python file
│ └───scoring_utils.py <- pdb parsing, foldx call, foldx scorer
│ └───guided_generation.sh <- sample batch script to run on an HPC cluster using Slurm
│
└───.gitignore
└───environment.yml
└───pyproject.toml
└───LICENSE
└───README.md
└───setup.py
Features
- Guided Design: Leverages the state-of-the-art ESM3 protein language model to intelligently generate new sequence variants.
- Stability Scoring: Uses the physically-grounded FoldX energy function to score the stability of each generated candidate.
- Proportional Unmasking: Employs an adaptive unmasking schedule that makes large changes initially and fine-tunes the sequence in later steps.
- Parallel Processing: Significantly accelerates the scoring of candidates by running multiple FoldX instances in parallel.
- Automated Workflow: A single script handles PDB repair, sequence masking, iterative generation, scoring, and results visualization.
Methodology
The design process is an iterative, guided search that can be thought of as a Design-Build-Test cycle performed entirely in silico.
Installation and Setup
This framework is designed for a Linux-based environment with CPU/GPU acceleration.
pip install esm_foldx_guidedgeneration
System Requirements
- Operating System: Linux (tested on NERSC Perlmutter Custom Linux-based kernel)
- Processor: Modern multi-core CPU (8+ cores recommended for parallel scoring)
- Memory (RAM): 64 GB or more recommended
- GPU: NVIDIA GPU with CUDA support (16GB+ VRAM recommended for the 15B ESM3 model)
Dependencies
This project relies on several key pieces of software.
-
Python & Conda: Python 3.8+ is required. It is highly recommended to manage the environment using Conda.
-
FoldX Modeling Suite: This package calls the FoldX executable to perform stability calculations.
- Obtain a FoldX License: Request a free academic license from the FoldX website.
- Download FoldX: After receiving your license, download the Linux version of the FoldX executable.
- Set Up
foldxDirectory:- In the root of this repository, create a directory named
foldx. - Place the
foldxexecutable and therotabase.txtfile inside thisfoldxdirectory. - Place the pdb files inside the
foldxdirectory.
- In the root of this repository, create a directory named
-
Python Libraries: All required Python libraries and their specific versions are listed in the
environment.ymlfile. Key dependencies include:pytorchesmpandasmatplotlib&seaborn
-
Hugging Face Account: The ESM3 model is a gated model and requires a Hugging Face account for access.
Environment Setup
-
Log in to Hugging Face: Before setting up the environment, you must authenticate with Hugging Face.
- Go to the ESM3 model page and accept the terms of use.
- Go to your Hugging Face tokens page, generate a new read token, and copy it.
- In your terminal, run the login command and paste your token when prompted:
huggingface-cli login
-
Create the Conda Environment: You can recreate the necessary Conda environment using the provided file. This will install all the required Python packages.
# Create the conda environment from the file conda env create -f environment.yml # Activate the new environment conda activate proteinenv
Local Package Installation
To make the scripts callable from anywhere, install the package in "editable" mode. From the root of the repository, run:
pip install -e .
Usage
The main script can be run from the command line. You must provide a PDB filename, chain ID, and masking percentage.
python src/original_source_files/main.py --pdb_filename "1fbm.pdb" --chain_id "A" --masking_percentage 0.4 --num_decoding_steps 32 --num_samples_per_step 20 --num_workers 20
-
Change the
masking_percentagebased on the protein residue, if the residue is larger try to give a smallermasking_percentage, for smaller residue0.4-0.5works perfect. For thenum_decoding_stepsandnum_samples_per_stepgive the value based on the no of iterations desired for the optimization process.num_workersvalue will be same asnum_samples_per_stepfor performing simultaneousfoldxcall in parallel. -
Results, including log files and plots of the $ΔΔG$ trajectory, will be saved in the
logs/andresults/directories.
To run using a HPC system like Perlmutter, Can use the guided_generation.sh file provided.
sbatch guided_generation.sh
Make sure to change the #SBATCH --array=0-1 for the number of pdb file submitting for the job.
License
This project is licensed under the Apache License.
Acknowledgments
This is a summer internship work at NERSC from June 2025 - September 2025
- Perlmutter Supercomputer
- Lawrence Berkeley National Laboratory
- National Energy Research Scientific Computing Center
- ai4protein group
- University of California San Diego (Boolean Lab)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file esm_foldx_guidedgeneration-0.1.0.tar.gz.
File metadata
- Download URL: esm_foldx_guidedgeneration-0.1.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8952a091bec800ece8f9a37643b3c9a9c24026c62bea404fba18c10bbe3e091f
|
|
| MD5 |
f128f13548c00727c4154625c5921c54
|
|
| BLAKE2b-256 |
6803b2160ecf2f43b16acdd4480181625ef484511413302d1f706903f0cc753c
|
File details
Details for the file esm_foldx_guidedgeneration-0.1.0-py3-none-any.whl.
File metadata
- Download URL: esm_foldx_guidedgeneration-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
522ceed667742aa045ceac6d3f83cc2db4c43f403657d7050f7bd98b70548495
|
|
| MD5 |
32ca722a792c1630711ad6924ec3a51c
|
|
| BLAKE2b-256 |
b47197a60b593879351b39b836e11b0b87acb916aa0c2f715d1cf295740638c0
|