Skip to main content

A de novo generation pipeline leveraging evolutionary information for broad-spectrum antimicrobial peptide design

Project description

AMPGen: A de novo Generation Pipeline Leveraging Evolutionary Information for Broad-Spectrum Antimicrobial Peptide Design

Overview

AMPGen is a pipeline for generating and evaluating novel antimicrobial peptide (AMP) sequences. Using EvoDiff, a novel diffusion framework in protein design, AMPGen generates new AMP sequences and employs machine learning models to classify and predict their antimicrobial efficacy. The pipeline has demonstrated exceptional success efficiency, with 29 out of 34 peptides (85.3%) exhibiting antimicrobial activity (MIC value less than 200 µg/ml against at least one bacterium).

Methods

Datasets Preparation

We compiled AMP and non-AMP datasets from various public databases for use in our classification and MIC prediction models. The AMP dataset includes sequences from APD, DADP, DBAASP, DRAMP, YADAMP, and dbAMP, resulting in a final set of 10,249 unique sequences with antibacterial targets. The non-AMP dataset, sourced from UniProt, consists of 11,989 sequences filtered to exclude those associated with specific antimicrobial keywords.

De Novo AMP Generation

AMP sequences are generated using two pre-trained order-agnostic autoregressive diffusion models (OADM) from the EvoDiff framework (paper).

  1. Sequence-Based Generation:

    • The sequence-based model, Evodiff-OA_DM_640M, is pre-trained on the Uniref50 dataset, containing 42 million protein sequences. This model unconditionally generates peptide sequences of length 15-35 aa.
  2. MSA-Based Generation:

    • The MSA-based model, Evodiff-MSA_OA_DM_MAXSUB, is trained on the OpenFold dataset and generates sequences in two ways:
      • Unconditional generation of peptide sequences of length 15-35 aa.
      • Conditional generation using MSAs with known AMP sequences as representative sequences.

Classification and Efficacy Prediction

  1. XGBoost-based AMP Classifier:

    • Dataset Preparation: Sequences in the AMP dataset were filtered based on length, retaining those within the range of 5 to 65 aa, resulting in a total of 9,964 AMP-labeled peptide sequences as the positive dataset.
    • Feature Extraction: Features were primarily derived from the PseKRAAC encoding method and QSOrder encoding parameters, resulting in 14 categories of 1,311 features.
    • Model Training: The data was used to train an XGBoost model, with AMP sequences labeled as 1 and nonAMP sequences labeled as 0. Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting.
  2. LSTM Regression-based MIC Predictor:

    • Dataset Preparation: All entries in the AMP dataset with MIC values were included. The AMP sequences targeting Escherichia coli totaled 7,100, while those targeting Staphylococcus aureus totaled 6,482. Sequences with multiple MIC values targeting the same bacteria were averaged and converted to a uniform unit of μM. These values were then log-transformed (log₁₀). Additionally, 7,193 sequences from the nonAMP dataset were labeled with a logMIC value of 4.
    • Model Training: Separate regression training was conducted on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) models. The datasets were split into training, validation, and test sets in the ratio of 72:18:10. Each model comprised two LSTM layers, a dropout layer with a dropout rate of 0.7, and a linear layer. The models were compiled using standard L2 loss and optimized with the Adam optimizer.

Project Structure

── AMPGen
    ├── AMP_discriminator
    │   ├── Discriminator_model
    │   │   ├── iFeature
    │   │   │   ├── codes
    │   │   │   ├── data
    │   │   │   └── PseKRAAC
    │   │   ├── discriminator.py
    │   │   └── features.py
    │   └── tools
    │       ├── plt.ipynb
    │       ├── RF_train.py
    │       ├── split.py
    │       └── XGboost_train.py
    ├── AMP_generator
    │   ├── calculate_properties.py
    │   ├── conditional_generation_msa.py
    │   ├── unconditional_generation.py
    │   └── unconditional_generation_msa.py
    ├── data
    │   ├── Discriminator_training_data
    │   │   ├── classify_all_data_v1.csv
    │   │   ├── classify_amp_v1.csv
    │   │   ├── classify_nonamp_v1.csv
    │   │   └── top14Featured_all.csv
    │   ├── example
    │   │   ├── msa_files
    │   │   │   └── example_1944.a3m
    │   │   ├── output
    │   │   │   ├── embeddings
    │   │   │   └── sequences.fasta
    │   │   ├── conditional_generated_sequences.csv
    │   │   ├── generated_msa_sequences.csv
    │   │   ├── generated_sequences.csv
    │   │   └── sequence_properties.csv
    │   ├── Scorer_training_data
    │   │   ├── regression_ecoli_all.csv
    │   │   └── regression_stpa_all.csv
    │   ├── combined_database_filtered_v2(1).xlsx
    │   └── combined_database_v2(1).xlsx
    ├── MIC_scorer
    │   ├── Scorer_model
    │   │   ├── 1stpa_best_model_checkpoint.pth
    │   │   ├── 2ecoli_best_model_checkpoint.pth
    │   │   ├── ecoliscaler.pkl
    │   │   ├── regression.py
    │   │   └── stpascaler.pkl
    │   ├── tools
    │   │   ├── embeddingload.py
    │   │   ├── extract.py
    │   │   ├── lstm_train.py
    │   │   ├── pltlstm.ipynb
    │   │   └── tofasta.py
    │   └── scorer.py
    ├── results
    │   ├── example_classified_sequences.csv
    │   └── example_results.csv    
    ├── .DS_Store
    ├── .gitattributes
    ├── .gitignore
    ├── LICENSE
    ├── print_directory_tree.py
    ├── README.md
    ├── requirements.txt
    ├── setup.py
    └── test.py

Getting Started

Installation Guide

Welcome to the AMPGen project! This guide will walk you through the steps required to install and set up the necessary environment and dependencies to run AMPGen. Before getting started, please ensure that you have Anaconda installed on your system.

Prerequisites

To use the AMPGen system, you need Python 3.11 and a few essential libraries. We'll guide you through setting up a clean conda environment, installing EvoDiff, and then the necessary dependencies. Required Python libraries: numpy, pandas, tqdm, scikit-learn, xgboost.

Setting Up the Environment

  1. Clone the AMPGen Repository
    Begin by cloning the AMPGen repository to your local machine:

    git clone https://github.com/xiyanxiongnico/AMPGen.git
    cd AMPGen
    
  2. Create a Conda Environment
    Next, create a new conda environment with Python 3.8.5, which is the recommended version for this project:

    conda create --name AMPGen python=3.11
    conda activate AMPGen
    
  3. Install Environment
    With the new environment activated, install all packages:

    pip install  -r requirements.txt
    
  4. Install PyTorch and Related Packages
    EvoDiff requires PyTorch along with additional libraries. The following example demonstrates the installation of a CPU-compatible version of PyTorch. For optimal performance, adjust the pytorch version based on your system’s specifications. Install the required packages using the following commands:

    conda install pytorch torchvision torchaudio cpuonly -c pytorch
    

Usage Guide

1. Generate New AMP Sequences

You can generate AMP sequences using the following commands:

Unconditional Generation of AMP Sequences

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model by running the following command in AMPGen/AMP_generator/unconditional_generation.py:

python unconditional_generation.py --total_sequences <total_sequences> --batch_size <batch_size> --output_file <path_to_output_file> --to_device<cuda or cpu>

Arguments:

  • --total_sequences (int, required): The total number of sequences to generate.
  • --batch_size (int, optional): The batch size for sequence generation (default=1).
  • --output_file (str, required): The path to the output CSV file where the generated sequences will be saved.
  • --to_device(str, optional):Device to run the model, cuda or cpu (default=cuda).

Example:

python unconditional_generation.py --total_sequences 10  --output_file ../data/example/generated_sequences.csv -to_device cpu

This command will generate 10 sequences and save them to generated_sequences.csv.

Unconditional Generation of AMP Sequences with MSA

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model with MSA by running the following command in AMPGen/AMP_generator/unconditional_generation_msa.py:

python unconditional_generation_msa.py --total_sequences <total_sequences> --batch_size <batch_size> --n_sequences <n_sequences> --output_csv_file <path_to_output_file> --to_device<cuda or cpu>

Arguments:

  • --total_sequences (int, required): The total number of sequences to generate.
  • --batch_size (int, optional): The batch size for sequence generation (default=1).
  • --n_sequences (int, optional): The number of sequences in MSA to subsample (default=64).
  • --output_csv_file (str, required): The path to the output CSV file where the generated sequences will be saved.
  • --to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).

Example:

python unconditional_generation_msa.py --total_sequences 10 --output_csv_file ../data/example/generated_msa_sequences.csv

This command will generate 100 sequences in batches of 10 and save them to generated_msa_sequences.csv.When using this model, significant computational power is required, and we recommend utilizing a GPU for optimal performance.

Conditional Generation of AMP Sequences with MSA

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's conditional generation model with MSA by running the following command in AMPGen/AMP_generator/conditional_generation_msa.py:

python conditional_generation_msa.py --directory_path <path_to_msa_directory> --output_csv_file <path_to_output_file> --max_retries <max_retries> --to_device<cuda or cpu> --total_sequences <total_sequences>

Arguments:

  • --directory_path (str, required): Path to the directory containing the MSA files (in .a3m format).
  • --output_csv_file (str, required): The path to the output CSV file where the generated sequences will be saved.
  • --max_retries (int, optional): Maximum number of retries for processing each file (default: 5).
  • --to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).
  • --total_sequences (int, required): The total number of sequences to generate.

Example:

python conditional_generation_msa.py --directory_path ../data/example/msa_files/ --output_csv_file ../data/example/conditional_generated_sequences.csv --to_device cpu --total_sequence 10

This command will process MSA files example_1944.a3m from the msa_files directory and generate sequences, saving the results to conditional_generated_sequences.csv.

2. Calculate Properties of Generated Sequences

Calculate Properties of Generated AMP Sequences

You can calculate the physical and chemical properties of the generated AMP sequences, including molecular weight, net charge, and hydrophobicity, by running the following command in AMPGen/AMP_generator/calculate_properties.py:

python calculate_properties.py --input_csv_file <path_to_input_file> --output_csv_file <path_to_output_file>

Arguments:

  • --input_csv_file (str, required): The path to the input CSV file containing sequences.
  • --output_csv_file (str, required): The path to the output CSV file where the calculated properties will be saved.

Example:

python calculate_properties.py --input_csv_file ../data/example/generated_sequences.csv --output_csv_file ../data/example/sequence_properties.csv

This command will calculate the properties of the sequences in generated_sequences.csv and save the results to sequence_properties.csv.

3. Identify AMP Candidates

AMP Discriminator

To classify sequences as antimicrobial peptides (AMPs) using the AMP Discriminator, run the following command in AMPGen/AMP_discriminator/Discriminator_model/discriminator.py:

python discriminator.py --train_path <path_to_training_csv> --pre_path <path_to_input_csv> --out_path <path_to_output_csv>

Arguments:

  • --train_path or -tp (str, required): The path to the CSV file containing the training data.
  • --pre_path or -pp (str, required): The path to the input CSV file containing the sequences to classify.
  • --out_path or -op (str, required): The path to the output CSV file where the classification results will be saved.
  • --to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).

Example:

python discriminator.py --train_path  ../../data/Discriminator_training_data/top14Featured_all.csv --pre_path ../../data/example/sequence_properties.csv --out_path ../../results/example_classified_sequences.csv --to_device cpu

This command will classify sequences in sequence_properties.csv using the model trained on top14Featured_all.csv, and save the results to example_classified_sequences.csv.

4. Run the MIC Scorer

This section provides a step-by-step guide to using the MIC scorer. The scorer predicts the MIC (Minimum Inhibitory Concentration) values using a pre-trained LSTM model based on protein embeddings generated by an ESM model.

Step 1: Convert Sequences to FASTA Format

Use the to_fasta function to convert your input CSV file (which contains sequences) into a FASTA file:

python mic_scorer.py --from_csv_path path/to/sequences.csv --to_fasta_path path/to/output/sequences.fasta

Step 2: Generate Embeddings with ESM Model

Once the sequences are in FASTA format, generate their embeddings using the get_embedding function and a pre-trained ESM model:

python mic_scorer.py --from_csv_path path/to/sequences.csv --esm_model_location esm_model_name --output_dir path/to/output/embeddings/
  • Replace esm_model_name with the location of your ESM model (e.g., esm2_t36_3B_UR50D).
  • The embeddings will be saved to the specified output directory.

Step 3: Load Embeddings

Use the load_embeding function to load the generated embeddings and merge them with the input sequence data:

python mic_scorer.py --from_csv_path path/to/sequences.csv --output_dir path/to/output/embeddings/

Step 4: Predict MIC Values

Finally, predict MIC values using a pre-trained LSTM model. The get_predicted_mic function handles this task:

python mic_scorer.py --from_csv_path path/to/sequences.csv --scaler_data_path path/to/scaler.pkl --model_path path/to/model.pth --result_path path/to/save/results.csv

Full Command Example

Run the entire MIC scoring process from sequence conversion to MIC prediction in AMPGen/MIC_scorer/scorer.py:

python scorer.py --from_csv_path ../results/example_classified_sequences.csv --to_fasta_path ../data/example/output/sequences.fasta --output_dir ../data/example/output/embeddings/ --scaler_data_path ./Scorer_model/stpascaler.pkl --model_path ./Scorer_model/1stpa_best_model_checkpoint.pth --result_path ../results/example_results.csv --to_device cpu

This command:

  1. Converts the sequences to FASTA format.
  2. Generates embeddings using the specified ESM model.
  3. Loads the embeddings and prepares the data.
  4. Predicts MIC values using the pre-trained LSTM model.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ampgen-0.1.12.tar.gz (51.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ampgen-0.1.12-py3-none-any.whl (92.5 kB view details)

Uploaded Python 3

File details

Details for the file ampgen-0.1.12.tar.gz.

File metadata

  • Download URL: ampgen-0.1.12.tar.gz
  • Upload date:
  • Size: 51.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for ampgen-0.1.12.tar.gz
Algorithm Hash digest
SHA256 6f0f72f17ca8ecf0eb9c45e65c1c240cfe29da10a3ba76b57997fb4924652066
MD5 fe269327e8726dbad77354dd03ec0459
BLAKE2b-256 6d9ba12426031adc0137877c211a9e345ca3071bbbada194fb68913c252cef26

See more details on using hashes here.

File details

Details for the file ampgen-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: ampgen-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 92.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for ampgen-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 9010cab84f8cece16124ca85a28fc584eadd230c05de90be13cf76eabe19626f
MD5 37865bce9118b19d806dc6b34b558137
BLAKE2b-256 a06ba750ad68ca0710faa9bb6b51ac9fda6f6dd212b5908cad23588796cd2684

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page