ampgen

A de novo generation pipeline leveraging evolutionary information for broad-spectrum antimicrobial peptide design

These details have not been verified by PyPI

Project links

Project description

AMPGen: A de novo Generation Pipeline Leveraging Evolutionary Information for Broad-Spectrum Antimicrobial Peptide Design

Overview

AMPGen is a pipeline for generating and evaluating novel antimicrobial peptide (AMP) sequences. Using EvoDiff, a novel diffusion framework in protein design, AMPGen generates new AMP sequences and employs machine learning models to classify and predict their antimicrobial efficacy. The pipeline has demonstrated exceptional success efficiency, with 29 out of 34 peptides (85.3%) exhibiting antimicrobial activity (MIC value less than 200 µg/ml against at least one bacterium).

Methods

Datasets Preparation

We compiled AMP and non-AMP datasets from various public databases for use in our classification and MIC prediction models. The AMP dataset includes sequences from APD, DADP, DBAASP, DRAMP, YADAMP, and dbAMP, resulting in a final set of 10,249 unique sequences with antibacterial targets. The non-AMP dataset, sourced from UniProt, consists of 11,989 sequences filtered to exclude those associated with specific antimicrobial keywords.

De Novo AMP Generation

AMP sequences are generated using two pre-trained order-agnostic autoregressive diffusion models (OADM) from the EvoDiff framework (paper).

Sequence-Based Generation:
- The sequence-based model, Evodiff-OA_DM_640M, is pre-trained on the Uniref50 dataset, containing 42 million protein sequences. This model unconditionally generates peptide sequences of length 15-35 aa.
MSA-Based Generation:
- The MSA-based model, Evodiff-MSA_OA_DM_MAXSUB, is trained on the OpenFold dataset and generates sequences in two ways:
  - Unconditional generation of peptide sequences of length 15-35 aa.
  - Conditional generation using MSAs with known AMP sequences as representative sequences.

Classification and Efficacy Prediction

XGBoost-based AMP Classifier:
- Dataset Preparation: Sequences in the AMP dataset were filtered based on length, retaining those within the range of 5 to 65 aa, resulting in a total of 9,964 AMP-labeled peptide sequences as the positive dataset.
- Feature Extraction: Features were primarily derived from the PseKRAAC encoding method and QSOrder encoding parameters, resulting in 14 categories of 1,311 features.
- Model Training: The data was used to train an XGBoost model, with AMP sequences labeled as 1 and nonAMP sequences labeled as 0. Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting.
LSTM Regression-based MIC Predictor:
- Dataset Preparation: All entries in the AMP dataset with MIC values were included. The AMP sequences targeting Escherichia coli totaled 7,100, while those targeting Staphylococcus aureus totaled 6,482. Sequences with multiple MIC values targeting the same bacteria were averaged and converted to a uniform unit of μM. These values were then log-transformed (log₁₀). Additionally, 7,193 sequences from the nonAMP dataset were labeled with a logMIC value of 4.
- Model Training: Separate regression training was conducted on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) models. The datasets were split into training, validation, and test sets in the ratio of 72:18:10. Each model comprised two LSTM layers, a dropout layer with a dropout rate of 0.7, and a linear layer. The models were compiled using standard L2 loss and optimized with the Adam optimizer.

Project Structure

── AMPGen
    ├── AMP_discriminator
    │   ├── Discriminator_model
    │   │   ├── iFeature
    │   │   │   ├── codes
    │   │   │   ├── data
    │   │   │   └── PseKRAAC
    │   │   ├── discriminator.py
    │   │   └── features.py
    │   └── tools
    │       ├── plt.ipynb
    │       ├── RF_train.py
    │       ├── split.py
    │       └── XGboost_train.py
    ├── AMP_generator
    │   ├── calculate_properties.py
    │   ├── conditional_generation_msa.py
    │   ├── unconditional_generation.py
    │   └── unconditional_generation_msa.py
    ├── data
    │   ├── Discriminator_training_data
    │   │   ├── classify_all_data_v1.csv
    │   │   ├── classify_amp_v1.csv
    │   │   ├── classify_nonamp_v1.csv
    │   │   └── top14Featured_all.csv
    │   ├── example
    │   │   ├── msa_files
    │   │   │   └── example_1944.a3m
    │   │   ├── output
    │   │   │   ├── embeddings
    │   │   │   └── sequences.fasta
    │   │   ├── conditional_generated_sequences.csv
    │   │   ├── generated_msa_sequences.csv
    │   │   ├── generated_sequences.csv
    │   │   └── sequence_properties.csv
    │   ├── Scorer_training_data
    │   │   ├── regression_ecoli_all.csv
    │   │   └── regression_stpa_all.csv
    │   ├── combined_database_filtered_v2(1).xlsx
    │   └── combined_database_v2(1).xlsx
    ├── MIC_scorer
    │   ├── Scorer_model
    │   │   ├── 1stpa_best_model_checkpoint.pth
    │   │   ├── 2ecoli_best_model_checkpoint.pth
    │   │   ├── ecoliscaler.pkl
    │   │   ├── regression.py
    │   │   └── stpascaler.pkl
    │   ├── tools
    │   │   ├── embeddingload.py
    │   │   ├── extract.py
    │   │   ├── lstm_train.py
    │   │   ├── pltlstm.ipynb
    │   │   └── tofasta.py
    │   └── scorer.py
    ├── results
    │   ├── example_classified_sequences.csv
    │   └── example_results.csv    
    ├── .DS_Store
    ├── .gitattributes
    ├── .gitignore
    ├── LICENSE
    ├── print_directory_tree.py
    ├── README.md
    ├── requirements.txt
    ├── setup.py
    └── test.py

Getting Started

Installation Guide

Welcome to the AMPGen project! This guide will walk you through the steps required to install and set up the necessary environment and dependencies to run AMPGen. Before getting started, please ensure that you have Anaconda installed on your system.

Prerequisites

To use the AMPGen system, you need Python 3.11 and a few essential libraries. We'll guide you through setting up a clean conda environment, installing EvoDiff, and then the necessary dependencies. Required Python libraries: numpy, pandas, tqdm, scikit-learn, xgboost.

Setting Up the Environment

Clone the AMPGen Repository
Begin by cloning the AMPGen repository to your local machine:
```
git clone https://github.com/xiyanxiongnico/AMPGen.git
cd AMPGen
```
Create a Conda Environment
Next, create a new conda environment with Python 3.8.5, which is the recommended version for this project:
```
conda create --name AMPGen python=3.11
conda activate AMPGen
```
Install Environment
With the new environment activated, install all packages:
```
pip install  -r requirements.txt
```
Install PyTorch and Related Packages
EvoDiff requires PyTorch along with additional libraries. The following example demonstrates the installation of a CPU-compatible version of PyTorch. For optimal performance, adjust the pytorch version based on your system’s specifications. Install the required packages using the following commands:
```
conda install pytorch torchvision torchaudio cpuonly -c pytorch
```

Usage Guide

1. Generate New AMP Sequences

You can generate AMP sequences using the following commands:

Unconditional Generation of AMP Sequences

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model by running the following command in AMPGen/AMP_generator/unconditional_generation.py:

python unconditional_generation.py --total_sequences <total_sequences> --batch_size <batch_size> --output_file <path_to_output_file> --to_device<cuda or cpu>

Arguments:

--total_sequences (int, required): The total number of sequences to generate.
--batch_size (int, optional): The batch size for sequence generation (default=1).
--output_file (str, required): The path to the output CSV file where the generated sequences will be saved.
--to_device(str, optional):Device to run the model, cuda or cpu (default=cuda).

Example:

python unconditional_generation.py --total_sequences 10  --output_file ../data/example/generated_sequences.csv -to_device cpu

This command will generate 10 sequences and save them to generated_sequences.csv.

Unconditional Generation of AMP Sequences with MSA

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model with MSA by running the following command in AMPGen/AMP_generator/unconditional_generation_msa.py:

python unconditional_generation_msa.py --total_sequences <total_sequences> --batch_size <batch_size> --n_sequences <n_sequences> --output_csv_file <path_to_output_file> --to_device<cuda or cpu>

Arguments:

--total_sequences (int, required): The total number of sequences to generate.
--batch_size (int, optional): The batch size for sequence generation (default=1).
--n_sequences (int, optional): The number of sequences in MSA to subsample (default=64).
--output_csv_file (str, required): The path to the output CSV file where the generated sequences will be saved.
--to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).

Example:

python unconditional_generation_msa.py --total_sequences 10 --output_csv_file ../data/example/generated_msa_sequences.csv

This command will generate 100 sequences in batches of 10 and save them to generated_msa_sequences.csv.When using this model, significant computational power is required, and we recommend utilizing a GPU for optimal performance.

Conditional Generation of AMP Sequences with MSA

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's conditional generation model with MSA by running the following command in AMPGen/AMP_generator/conditional_generation_msa.py:

python conditional_generation_msa.py --directory_path <path_to_msa_directory> --output_csv_file <path_to_output_file> --max_retries <max_retries> --to_device<cuda or cpu> --total_sequences <total_sequences>

Arguments:

--directory_path (str, required): Path to the directory containing the MSA files (in .a3m format).
--output_csv_file (str, required): The path to the output CSV file where the generated sequences will be saved.
--max_retries (int, optional): Maximum number of retries for processing each file (default: 5).
--to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).
--total_sequences (int, required): The total number of sequences to generate.

Example:

python conditional_generation_msa.py --directory_path ../data/example/msa_files/ --output_csv_file ../data/example/conditional_generated_sequences.csv --to_device cpu --total_sequence 10

This command will process MSA files example_1944.a3m from the msa_files directory and generate sequences, saving the results to conditional_generated_sequences.csv.

2. Calculate Properties of Generated Sequences

Calculate Properties of Generated AMP Sequences

You can calculate the physical and chemical properties of the generated AMP sequences, including molecular weight, net charge, and hydrophobicity, by running the following command in AMPGen/AMP_generator/calculate_properties.py:

python calculate_properties.py --input_csv_file <path_to_input_file> --output_csv_file <path_to_output_file>

Arguments:

--input_csv_file (str, required): The path to the input CSV file containing sequences.
--output_csv_file (str, required): The path to the output CSV file where the calculated properties will be saved.

Example:

python calculate_properties.py --input_csv_file ../data/example/generated_sequences.csv --output_csv_file ../data/example/sequence_properties.csv

This command will calculate the properties of the sequences in generated_sequences.csv and save the results to sequence_properties.csv.

3. Identify AMP Candidates

AMP Discriminator

To classify sequences as antimicrobial peptides (AMPs) using the AMP Discriminator, run the following command in AMPGen/AMP_discriminator/Discriminator_model/discriminator.py:

python discriminator.py --train_path <path_to_training_csv> --pre_path <path_to_input_csv> --out_path <path_to_output_csv>

Arguments:

--train_path or -tp (str, required): The path to the CSV file containing the training data.
--pre_path or -pp (str, required): The path to the input CSV file containing the sequences to classify.
--out_path or -op (str, required): The path to the output CSV file where the classification results will be saved.
--to_device(str,optional):Device to run the model, cuda or cpu (default=cuda).

Example:

python discriminator.py --train_path  ../../data/Discriminator_training_data/top14Featured_all.csv --pre_path ../../data/example/sequence_properties.csv --out_path ../../results/example_classified_sequences.csv --to_device cpu

This command will classify sequences in sequence_properties.csv using the model trained on top14Featured_all.csv, and save the results to example_classified_sequences.csv.

4. Run the MIC Scorer

This section provides a step-by-step guide to using the MIC scorer. The scorer predicts the MIC (Minimum Inhibitory Concentration) values using a pre-trained LSTM model based on protein embeddings generated by an ESM model.

Step 1: Convert Sequences to FASTA Format

Use the to_fasta function to convert your input CSV file (which contains sequences) into a FASTA file:

python mic_scorer.py --from_csv_path path/to/sequences.csv --to_fasta_path path/to/output/sequences.fasta

Step 2: Generate Embeddings with ESM Model

Once the sequences are in FASTA format, generate their embeddings using the get_embedding function and a pre-trained ESM model:

python mic_scorer.py --from_csv_path path/to/sequences.csv --esm_model_location esm_model_name --output_dir path/to/output/embeddings/

Replace esm_model_name with the location of your ESM model (e.g., esm2_t36_3B_UR50D).
The embeddings will be saved to the specified output directory.

Step 3: Load Embeddings

Use the load_embeding function to load the generated embeddings and merge them with the input sequence data:

python mic_scorer.py --from_csv_path path/to/sequences.csv --output_dir path/to/output/embeddings/

Step 4: Predict MIC Values

Finally, predict MIC values using a pre-trained LSTM model. The get_predicted_mic function handles this task:

python mic_scorer.py --from_csv_path path/to/sequences.csv --scaler_data_path path/to/scaler.pkl --model_path path/to/model.pth --result_path path/to/save/results.csv

Full Command Example

Run the entire MIC scoring process from sequence conversion to MIC prediction in AMPGen/MIC_scorer/scorer.py:

python scorer.py --from_csv_path ../results/example_classified_sequences.csv --to_fasta_path ../data/example/output/sequences.fasta --output_dir ../data/example/output/embeddings/ --scaler_data_path ./Scorer_model/stpascaler.pkl --model_path ./Scorer_model/1stpa_best_model_checkpoint.pth --result_path ../results/example_results.csv --to_device cpu

This command:

Converts the sequences to FASTA format.
Generates embeddings using the specified ESM model.
Loads the embeddings and prepares the data.
Predicts MIC values using the pre-trained LSTM model.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.14

Aug 1, 2025

This version

0.1.13

Aug 1, 2025

0.1.12

Aug 1, 2025

0.1.11

Aug 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ampgen-0.1.13-py3-none-any.whl (177.9 kB view details)

Uploaded Aug 1, 2025 Python 3

File details

Details for the file ampgen-0.1.13-py3-none-any.whl.

File metadata

Download URL: ampgen-0.1.13-py3-none-any.whl
Upload date: Aug 1, 2025
Size: 177.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for ampgen-0.1.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86bf532c9c18f6803d16914a79124fcc051cdd2fdc1fd4156f8b27de5185c2b4`
MD5	`b9bf0afcbcdeebf56ddf7eda2235da34`
BLAKE2b-256	`0f9c553d3ac42d4b436c5d30d1ad4a84d303c8ea2a8de9364b992a2d3f513ae2`

See more details on using hashes here.

ampgen 0.1.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AMPGen: A de novo Generation Pipeline Leveraging Evolutionary Information for Broad-Spectrum Antimicrobial Peptide Design

Overview

Methods

Datasets Preparation

De Novo AMP Generation

Classification and Efficacy Prediction

Project Structure

Getting Started

Installation Guide

Prerequisites

Setting Up the Environment

Usage Guide

1. Generate New AMP Sequences

Unconditional Generation of AMP Sequences

Arguments:

Example:

Unconditional Generation of AMP Sequences with MSA

Arguments:

Example:

Conditional Generation of AMP Sequences with MSA

Arguments:

Example:

2. Calculate Properties of Generated Sequences

Calculate Properties of Generated AMP Sequences

Arguments:

Example:

3. Identify AMP Candidates

AMP Discriminator

Arguments:

Example:

4. Run the MIC Scorer

Step 1: Convert Sequences to FASTA Format

Step 2: Generate Embeddings with ESM Model

Step 3: Load Embeddings

Step 4: Predict MIC Values

Full Command Example

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes