A de novo generation pipeline leveraging evolutionary information for broad-spectrum antimicrobial peptide design
Project description
AMPGen: A de novo Generation Pipeline Leveraging Evolutionary Information for Broad-Spectrum Antimicrobial Peptide Design
Overview
AMPGen is a pipeline for generating and evaluating novel antimicrobial peptide (AMP) sequences. Using EvoDiff, a novel diffusion framework in protein design, AMPGen generates new AMP sequences and employs machine learning models to classify and predict their antimicrobial efficacy. The pipeline has demonstrated exceptional success efficiency, with 29 out of 34 peptides (85.3%) exhibiting antimicrobial activity (MIC value less than 200 µg/ml against at least one bacterium).
Methods
Datasets Preparation
We compiled AMP and non-AMP datasets from various public databases for use in our classification and MIC prediction models. The AMP dataset includes sequences from APD, DADP, DBAASP, DRAMP, YADAMP, and dbAMP, resulting in a final set of 10,249 unique sequences with antibacterial targets. The non-AMP dataset, sourced from UniProt, consists of 11,989 sequences filtered to exclude those associated with specific antimicrobial keywords.
De Novo AMP Generation
AMP sequences are generated using two pre-trained order-agnostic autoregressive diffusion models (OADM) from the EvoDiff framework (paper).
-
Sequence-Based Generation:
- The sequence-based model, Evodiff-OA_DM_640M, is pre-trained on the Uniref50 dataset, containing 42 million protein sequences. This model unconditionally generates peptide sequences of length 15-35 aa.
-
MSA-Based Generation:
- The MSA-based model, Evodiff-MSA_OA_DM_MAXSUB, is trained on the OpenFold dataset and generates sequences in two ways:
- Unconditional generation of peptide sequences of length 15-35 aa.
- Conditional generation using MSAs with known AMP sequences as representative sequences.
- The MSA-based model, Evodiff-MSA_OA_DM_MAXSUB, is trained on the OpenFold dataset and generates sequences in two ways:
Classification and Efficacy Prediction
-
XGBoost-based AMP Classifier:
- Dataset Preparation: Sequences in the AMP dataset were filtered based on length, retaining those within the range of 5 to 65 aa, resulting in a total of 9,964 AMP-labeled peptide sequences as the positive dataset.
- Feature Extraction: Features were primarily derived from the PseKRAAC encoding method and QSOrder encoding parameters, resulting in 14 categories of 1,311 features.
- Model Training: The data was used to train an XGBoost model, with AMP sequences labeled as 1 and nonAMP sequences labeled as 0. Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting.
-
LSTM Regression-based MIC Predictor:
- Dataset Preparation: All entries in the AMP dataset with MIC values were included. The AMP sequences targeting Escherichia coli totaled 7,100, while those targeting Staphylococcus aureus totaled 6,482. Sequences with multiple MIC values targeting the same bacteria were averaged and converted to a uniform unit of μM. These values were then log-transformed (log₁₀). Additionally, 7,193 sequences from the nonAMP dataset were labeled with a logMIC value of 4.
- Model Training: Separate regression training was conducted on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) models. The datasets were split into training, validation, and test sets in the ratio of 72:18:10. Each model comprised two LSTM layers, a dropout layer with a dropout rate of 0.7, and a linear layer. The models were compiled using standard L2 loss and optimized with the Adam optimizer.
Project Structure
── AMPGen
├── AMP_discriminator
│ ├── Discriminator_model
│ │ ├── iFeature
│ │ │ ├── codes
│ │ │ ├── data
│ │ │ └── PseKRAAC
│ │ ├── discriminator.py
│ │ └── features.py
│ └── tools
│ ├── plt.ipynb
│ ├── RF_train.py
│ ├── split.py
│ └── XGboost_train.py
├── AMP_generator
│ ├── calculate_properties.py
│ ├── conditional_generation_msa.py
│ ├── unconditional_generation.py
│ └── unconditional_generation_msa.py
├── data
│ ├── Discriminator_training_data
│ │ ├── classify_all_data_v1.csv
│ │ ├── classify_amp_v1.csv
│ │ ├── classify_nonamp_v1.csv
│ │ └── top14Featured_all.csv
│ ├── example
│ │ ├── msa_files
│ │ │ └── example_1944.a3m
│ │ ├── output
│ │ │ ├── embeddings
│ │ │ └── sequences.fasta
│ │ ├── conditional_generated_sequences.csv
│ │ ├── generated_msa_sequences.csv
│ │ ├── generated_sequences.csv
│ │ └── sequence_properties.csv
│ ├── Scorer_training_data
│ │ ├── regression_ecoli_all.csv
│ │ └── regression_stpa_all.csv
│ ├── combined_database_filtered_v2(1).xlsx
│ └── combined_database_v2(1).xlsx
├── MIC_scorer
│ ├── Scorer_model
│ │ ├── 1stpa_best_model_checkpoint.pth
│ │ ├── 2ecoli_best_model_checkpoint.pth
│ │ ├── ecoliscaler.pkl
│ │ ├── regression.py
│ │ └── stpascaler.pkl
│ ├── tools
│ │ ├── embeddingload.py
│ │ ├── extract.py
│ │ ├── lstm_train.py
│ │ ├── pltlstm.ipynb
│ │ └── tofasta.py
│ └── scorer.py
├── results
│ ├── example_classified_sequences.csv
│ └── example_results.csv
├── .DS_Store
├── .gitattributes
├── .gitignore
├── LICENSE
├── print_directory_tree.py
├── README.md
├── requirements.txt
├── setup.py
└── test.py
Getting Started
Installation Guide
Welcome to the AMPGen project! This guide will walk you through the steps required to install and set up the necessary environment and dependencies to run AMPGen. Before getting started, please ensure that you have Anaconda installed on your system.
Prerequisites
To use the AMPGen system, you need Python 3.11 and a few essential libraries. We'll guide you through setting up a clean conda environment, installing EvoDiff, and then the necessary dependencies. Required Python libraries: numpy, pandas, tqdm, scikit-learn, xgboost.
Setting Up the Environment
-
Clone the AMPGen Repository
Begin by cloning the AMPGen repository to your local machine:git clone https://github.com/xiyanxiongnico/AMPGen.git cd AMPGen
-
Create a Conda Environment
Next, create a new conda environment with Python 3.8.5, which is the recommended version for this project:conda create --name AMPGen python=3.11 conda activate AMPGen
-
Install Environment
With the new environment activated, install all packages:pip install -r requirements.txt
-
Install PyTorch and Related Packages
EvoDiff requires PyTorch along with additional libraries. The following example demonstrates the installation of a CPU-compatible version of PyTorch. For optimal performance, adjust the pytorch version based on your system’s specifications. Install the required packages using the following commands:conda install pytorch torchvision torchaudio cpuonly -c pytorch
Usage Guide
1. Generate New AMP Sequences
You can generate AMP sequences using the following commands:
Unconditional Generation of AMP Sequences
You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model by running the following command in AMPGen/AMP_generator/unconditional_generation.py:
python unconditional_generation.py --total_sequences <total_sequences> --batch_size <batch_size> --output_file <path_to_output_file> --to_device<cuda or cpu>
Arguments:
--total_sequences(int, required): The total number of sequences to generate.--batch_size(int, optional): The batch size for sequence generation (default=1).--output_file(str, required): The path to the output CSV file where the generated sequences will be saved.--to_device(str, optional):Device to run the model, cuda or cpu (default=cuda).
Example:
python unconditional_generation.py --total_sequences 10 --output_file ../data/example/generated_sequences.csv -to_device cpu
This command will generate 10 sequences and save them to generated_sequences.csv.
Unconditional Generation of AMP Sequences with MSA
You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model with MSA by running the following command in AMPGen/AMP_generator/unconditional_generation_msa.py:
python unconditional_generation_msa.py --total_sequences <total_sequences> --batch_size <batch_size> --n_sequences <n_sequences> --output_csv_file <path_to_output_file> --to_device<cuda or cpu>
Arguments:
--total_sequences(int, required): The total number of sequences to generate.--batch_size(int, optional): The batch size for sequence generation (default=1).--n_sequences(int, optional): The number of sequences in MSA to subsample (default=64).--output_csv_file(str, required): The path to the output CSV file where the generated sequences will be saved.--to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).
Example:
python unconditional_generation_msa.py --total_sequences 10 --output_csv_file ../data/example/generated_msa_sequences.csv
This command will generate 100 sequences in batches of 10 and save them to generated_msa_sequences.csv.When using this model, significant computational power is required, and we recommend utilizing a GPU for optimal performance.
Conditional Generation of AMP Sequences with MSA
You can generate antimicrobial peptide (AMP) sequences using EvoDiff's conditional generation model with MSA by running the following command in AMPGen/AMP_generator/conditional_generation_msa.py:
python conditional_generation_msa.py --directory_path <path_to_msa_directory> --output_csv_file <path_to_output_file> --max_retries <max_retries> --to_device<cuda or cpu> --total_sequences <total_sequences>
Arguments:
--directory_path(str, required): Path to the directory containing the MSA files (in.a3mformat).--output_csv_file(str, required): The path to the output CSV file where the generated sequences will be saved.--max_retries(int, optional): Maximum number of retries for processing each file (default: 5).--to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).--total_sequences(int, required): The total number of sequences to generate.
Example:
python conditional_generation_msa.py --directory_path ../data/example/msa_files/ --output_csv_file ../data/example/conditional_generated_sequences.csv --to_device cpu --total_sequence 10
This command will process MSA files example_1944.a3m from the msa_files directory and generate sequences, saving the results to conditional_generated_sequences.csv.
2. Calculate Properties of Generated Sequences
Calculate Properties of Generated AMP Sequences
You can calculate the physical and chemical properties of the generated AMP sequences, including molecular weight, net charge, and hydrophobicity, by running the following command in AMPGen/AMP_generator/calculate_properties.py:
python calculate_properties.py --input_csv_file <path_to_input_file> --output_csv_file <path_to_output_file>
Arguments:
--input_csv_file(str, required): The path to the input CSV file containing sequences.--output_csv_file(str, required): The path to the output CSV file where the calculated properties will be saved.
Example:
python calculate_properties.py --input_csv_file ../data/example/generated_sequences.csv --output_csv_file ../data/example/sequence_properties.csv
This command will calculate the properties of the sequences in generated_sequences.csv and save the results to sequence_properties.csv.
3. Identify AMP Candidates
AMP Discriminator
To classify sequences as antimicrobial peptides (AMPs) using the AMP Discriminator, run the following command in AMPGen/AMP_discriminator/Discriminator_model/discriminator.py:
python discriminator.py --train_path <path_to_training_csv> --pre_path <path_to_input_csv> --out_path <path_to_output_csv>
Arguments:
--train_pathor-tp(str, required): The path to the CSV file containing the training data.--pre_pathor-pp(str, required): The path to the input CSV file containing the sequences to classify.--out_pathor-op(str, required): The path to the output CSV file where the classification results will be saved.--to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).
Example:
python discriminator.py --train_path ../../data/Discriminator_training_data/top14Featured_all.csv --pre_path ../../data/example/sequence_properties.csv --out_path ../../results/example_classified_sequences.csv --to_device cpu
This command will classify sequences in sequence_properties.csv using the model trained on top14Featured_all.csv, and save the results to example_classified_sequences.csv.
4. Run the MIC Scorer
This section provides a step-by-step guide to using the MIC scorer. The scorer predicts the MIC (Minimum Inhibitory Concentration) values using a pre-trained LSTM model based on protein embeddings generated by an ESM model.
Step 1: Convert Sequences to FASTA Format
Use the to_fasta function to convert your input CSV file (which contains sequences) into a FASTA file:
python mic_scorer.py --from_csv_path path/to/sequences.csv --to_fasta_path path/to/output/sequences.fasta
Step 2: Generate Embeddings with ESM Model
Once the sequences are in FASTA format, generate their embeddings using the get_embedding function and a pre-trained ESM model:
python mic_scorer.py --from_csv_path path/to/sequences.csv --esm_model_location esm_model_name --output_dir path/to/output/embeddings/
- Replace
esm_model_namewith the location of your ESM model (e.g.,esm2_t36_3B_UR50D). - The embeddings will be saved to the specified output directory.
Step 3: Load Embeddings
Use the load_embeding function to load the generated embeddings and merge them with the input sequence data:
python mic_scorer.py --from_csv_path path/to/sequences.csv --output_dir path/to/output/embeddings/
Step 4: Predict MIC Values
Finally, predict MIC values using a pre-trained LSTM model. The get_predicted_mic function handles this task:
python mic_scorer.py --from_csv_path path/to/sequences.csv --scaler_data_path path/to/scaler.pkl --model_path path/to/model.pth --result_path path/to/save/results.csv
Full Command Example
Run the entire MIC scoring process from sequence conversion to MIC prediction in AMPGen/MIC_scorer/scorer.py:
python scorer.py --from_csv_path ../results/example_classified_sequences.csv --to_fasta_path ../data/example/output/sequences.fasta --output_dir ../data/example/output/embeddings/ --scaler_data_path ./Scorer_model/stpascaler.pkl --model_path ./Scorer_model/1stpa_best_model_checkpoint.pth --result_path ../results/example_results.csv --to_device cpu
This command:
- Converts the sequences to FASTA format.
- Generates embeddings using the specified ESM model.
- Loads the embeddings and prepares the data.
- Predicts MIC values using the pre-trained LSTM model.
Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ampgen-0.1.13-py3-none-any.whl.
File metadata
- Download URL: ampgen-0.1.13-py3-none-any.whl
- Upload date:
- Size: 177.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86bf532c9c18f6803d16914a79124fcc051cdd2fdc1fd4156f8b27de5185c2b4
|
|
| MD5 |
b9bf0afcbcdeebf56ddf7eda2235da34
|
|
| BLAKE2b-256 |
0f9c553d3ac42d4b436c5d30d1ad4a84d303c8ea2a8de9364b992a2d3f513ae2
|