A Python toolkit for Antimicrobial Peptide (AMP) prediction using ensemble machine learning
Project description
AMPidentifier
A Python toolkit for Antimicrobial Peptide (AMP) prediction and physicochemical assessment
////////////////////////////////////////////////////////////////////////
// //
// _ __ __ ____ _ _ _ _ __ _ //
// / \ | \/ | _ \(_) __| | ___ _ __ | |_(_)/ _(_) ___ _ __ //
// / _ \ | |\/| | |_) | |/ _` |/ _ \ '_ \| __| | |_| |/ _ \ '__| //
// / ___ \| | | | __/| | (_| | __/ | | | |_| | _| | __/ | //
// /_/ \_\_| |_|_| |_|\__,_|\___|_| |_|\__|_|_| |_|\___|_| //
// //
////////////////////////////////////////////////////////////////////////
About
AMPidentifier is an open-source, modular Python toolkit for predicting Antimicrobial Peptides (AMPs) from amino acid sequences. It combines three pre-trained Machine Learning models (Random Forest, SVM, Gradient Boosting) with an ensemble voting system, and computes dozens of physicochemical descriptors via modlamp.
Users can run predictions with the built-in models, combine them in ensemble mode, or integrate external .pkl models for side-by-side comparison.
Related Projects
| Project | Description | Link |
|---|---|---|
| AMPidentifier CLI | Full command-line version with training scripts, benchmarking, and extended documentation | github.com/madsondeluna/AMPidentifier |
| AMPidentifier Web Server | Browser-based interface for AMP prediction (no installation required) | github.com/madsondeluna/AMPidentifierServerBETA |
Installation
pip install ampidentifier
We recommend using a virtual environment:
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install ampidentifier
Available on PyPI: https://pypi.org/project/ampidentifier/
Quick Start
# Single model (Random Forest, default)
ampidentifier --input my_sequences.fasta --output_dir ./results
# Ensemble voting (recommended)
ampidentifier --input my_sequences.fasta --output_dir ./results --ensemble
# Compare SVM with an external model
ampidentifier --input my_sequences.fasta --output_dir ./results --model svm --external_models /path/to/my_model.pkl
Usage Examples
The examples below use this sample FASTA file (test_peptides.fasta) — it contains known AMPs and non-AMP peptides for demonstration:
>Magainin-2|Xenopus_laevis|Cationic_amphipathic_helix
GIGKFLHSAKKFGKAFVGEIMNS
>LL-37|Homo_sapiens|Cathelicidin_family
LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES
>Melittin|Apis_mellifera|Venom_peptide
GIGAVLKVLTTGLPALISWIKRKRQQ
>Insulin_Chain_B|Homo_sapiens|Peptide_hormone
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
>Glucagon|Homo_sapiens|Peptide_hormone
HSQGTFTSDYSKYLDSRRAQDFVQWLMNT
>Vasoactive_intestinal_peptide|Homo_sapiens|Neuropeptide
HSDAVFTDNYTRLRKQMAVKKYLNSILN
Usage Example — Google Colab / Jupyter Notebook
Click the badge to open directly in Colab:
Or run the cells below manually in any Colab notebook:
# Cell 1: Install
!pip install ampidentifier
# Cell 2: Create the example FASTA file
fasta_content = """>Magainin-2|Xenopus_laevis|Cationic_amphipathic_helix
GIGKFLHSAKKFGKAFVGEIMNS
>LL-37|Homo_sapiens|Cathelicidin_family
LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES
>Melittin|Apis_mellifera|Venom_peptide
GIGAVLKVLTTGLPALISWIKRKRQQ
>Insulin_Chain_B|Homo_sapiens|Peptide_hormone
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
>Glucagon|Homo_sapiens|Peptide_hormone
HSQGTFTSDYSKYLDSRRAQDFVQWLMNT
>Vasoactive_intestinal_peptide|Homo_sapiens|Neuropeptide
HSDAVFTDNYTRLRKQMAVKKYLNSILN
"""
with open("test_peptides.fasta", "w") as f:
f.write(fasta_content)
print("FASTA file created with 6 sequences (3 known AMPs + 3 non-AMPs)")
# Cell 3: Run with default model (Random Forest)
# Import the pipeline function directly from the package
import os
from amp_identifier.core import run_prediction_pipeline
os.makedirs("./results_rf", exist_ok=True)
run_prediction_pipeline(
input_file="test_peptides.fasta",
output_dir="./results_rf",
internal_model_type="rf", # Random Forest: best single-model AUC-ROC (0.9503)
use_ensemble=False,
external_model_paths=[],
)
# Cell 4: Run with ensemble mode (recommended)
# Combines RF + SVM + GB via majority voting for maximum robustness
import os
from amp_identifier.core import run_prediction_pipeline
os.makedirs("./results_ensemble", exist_ok=True)
run_prediction_pipeline(
input_file="test_peptides.fasta",
output_dir="./results_ensemble",
internal_model_type="rf", # ignored when use_ensemble=True
use_ensemble=True, # activates majority vote across all three models
external_model_paths=[],
)
# Cell 5: Inspect results
# Runs ensemble first if output does not exist yet, then displays results
import os
import pandas as pd
from amp_identifier.core import run_prediction_pipeline
report_path = "./results_ensemble/prediction_comparison_report.csv"
features_path = "./results_ensemble/physicochemical_features.csv"
if not os.path.exists(report_path):
os.makedirs("./results_ensemble", exist_ok=True)
run_prediction_pipeline(
input_file="test_peptides.fasta",
output_dir="./results_ensemble",
internal_model_type="rf",
use_ensemble=True,
external_model_paths=[],
)
report = pd.read_csv(report_path)
print("=== Ensemble Prediction Report ===")
print(report.to_string(index=False))
features = pd.read_csv(features_path)
print(f"\n=== Physicochemical Features ===")
print(f"Shape: {features.shape[0]} sequences x {features.shape[1]} descriptors")
print(features[['ID', 'Length', 'Charge', 'HydrophRatio']].to_string(index=False))
# Cell 6: Compare all three internal models individually
import os
import pandas as pd
from amp_identifier.core import run_prediction_pipeline
for model in ["rf", "svm", "gb"]:
os.makedirs(f"./results_{model}", exist_ok=True)
run_prediction_pipeline(
input_file="test_peptides.fasta",
output_dir=f"./results_{model}",
internal_model_type=model,
use_ensemble=False,
external_model_paths=[],
)
report = pd.read_csv(f"./results_{model}/prediction_comparison_report.csv")
pred_col = [c for c in report.columns if c.startswith("pred_")][0]
amp_count = int(report[pred_col].sum())
print(f"[{model.upper()}] Predicted AMPs: {amp_count}/6")
Arguments
| Argument | Description | Required | Default |
|---|---|---|---|
-i, --input |
Path to the input FASTA file | Yes | - |
-o, --output_dir |
Path to the output directory | Yes | - |
-m, --model |
Internal model to use: rf, svm, gb |
No | rf |
--ensemble |
Enable majority-vote ensemble across all internal models | No | Flag |
-e, --external_models |
One or more paths to external .pkl models for comparison (comma-separated) |
No | - |
Key Features
- Three pre-trained ML models: Random Forest, Gradient Boosting, SVM
- Ensemble voting: Majority vote across all models for improved robustness
- External model support: Load custom
.pklmodels for comparison - Physicochemical descriptors: Compute and export an extensive set of sequence features via
modlamp - Fully open-source and modular: Each component can be used independently
Pre-Trained Model Performance
Best values per metric in bold.
| Metric | Random Forest (RF) | SVM | Gradient Boosting (GB) |
|---|---|---|---|
| Accuracy | 0.8845 | 0.8740 | 0.8585 |
| Precision | 0.8910 | 0.8880 | 0.8665 |
| Recall | 0.8762 | 0.8558 | 0.8475 |
| F1-Score | 0.8836 | 0.8716 | 0.8569 |
| MCC | 0.7692 | 0.7484 | 0.7172 |
| AUC-ROC | 0.9503 | 0.9356 | 0.9289 |
Recommended: use --ensemble for most robust predictions (Accuracy: 87.47%, Sensitivity: 85.96%, Specificity: 88.98%).
Outputs
| File | Description |
|---|---|
physicochemical_features.csv |
Computed physicochemical descriptors for each input sequence |
prediction_comparison_report.csv |
AMP/non-AMP predictions with confidence scores per model and consensus |
Project Structure
amp_identifier/
├── __init__.py
├── core.py # Main prediction workflow
├── data_io.py # FASTA input reader
├── feature_extraction.py # Physicochemical descriptor computation
├── prediction.py # Model loading and inference
└── reporting.py # CSV report generation
Contributors
| Name | Role | Affiliation |
|---|---|---|
| Madson A. de Luna-Aragão, MSc | Lead developer; architecture; ML; docs | UFMG |
| Rafael L. da Silva, BSc | Collaborator; preprocessing; pipeline testing | UFPE |
| Ana M. Benko-Iseppon, PhD | Advisor; study design; biological validation | UFPE |
| João Pacífico, PhD | Co-Advisor; computational review; evaluation | UPE |
| Carlos A. dos Santos-Silva, PhD | Co-Advisor; pipeline testing; review | CESMAC |
Funding & Acknowledgments
- Officially registered under UFPE - Universidade Federal de Pernambuco, Brazil
- Supported by FACEPE - Fundação de Amparo à Pesquisa do Estado de Pernambuco
- INPI Registration: BR 51 2025 005859-4
How to Cite
Luna-Aragão, M. A., da Silva, R. L., Pacífico, J., Santos-Silva, C. A. & Benko-Iseppon, A. M.
(2025). AMPidentifier: A Python toolkit for predicting antimicrobial peptides using ensemble
machine learning and physicochemical descriptors.
https://github.com/madsondeluna/AMPidentifier
License
This project is licensed under the terms specified in the repository. All rights reserved. © Madson A. de Luna Aragão et al., 2025.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ampidentifier-1.0.1.tar.gz.
File metadata
- Download URL: ampidentifier-1.0.1.tar.gz
- Upload date:
- Size: 3.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
480370c9d8e1f84a4011581760586af096d320d59e3c33a87220f05ac195fb3a
|
|
| MD5 |
fb551364e4b1b9e4fd0c00b3a77feda8
|
|
| BLAKE2b-256 |
e4be32566290a81239ecbef70adad9bcf52dbb77d3e1c32c9325a9c017a5b63d
|
File details
Details for the file ampidentifier-1.0.1-py3-none-any.whl.
File metadata
- Download URL: ampidentifier-1.0.1-py3-none-any.whl
- Upload date:
- Size: 3.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f613127a342239a7b5dcba51fa993dd3ae9d2f0ecc39dadfe2df61ac25d3aaf4
|
|
| MD5 |
8c9f850c89858a1ba2169f8f7e056e79
|
|
| BLAKE2b-256 |
63f1f4934aa133b53e8c52256c9aca7d1d6b3a8ce6e26f6cfb13191110d74547
|