WASP: A computational pipeline for protein functional annotation using structural information.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

WASP: Protein Functional Annotation using AlphaFold structures

Welcome to the official repository for the paper WASP: A pipeline for functional annotation based on AlphaFold structural models!

WASP, Whole-proteome Annotation via Structural-homology Prediction, is a python-based software designed for comprehensive organism annotation at the whole-proteome level based on structural homology.

WASP is a user-friendly command-line tool that only requires the NCBI taxonomy ID of the organism of interest as an input. Using the computational speed of Foldseek [1], WASP generates a graphical representation of reciprocal hits between the organism protein query and the AlphaFold database [2, 3], enabling downstream robust functional enrichment and statistical testing. WASP annotates uncharacterised proteins using multiple functional descriptors, including GO terms, Pfam domains, PANTHER family classification and CATH superfamilies, Rhea IDs and EC numbers. Additionally, WASP provides a module to map native proteins to orphan reactions in genome-scale models based on structural homology.

drawing

Installation
- External requirements
- Quickstart
Usage
- Whole-proteome annotation
- GEM gap-filling module
Additional utils
- Download custom proteins
- Predict custom strucrures
References

1. Installation

To ensure full reproducibility and easy setup, WASP packages all external dependencies (Foldseek, gsutil) via Conda.

✅ Recommended: Install via Conda

Ensure you have Conda or Miniconda installed. This will automatically set up Python 3.10, Foldseek 8, gsutil, and the WASP pipeline in an isolated environment.

# Clone the repository
git clone [https://github.com/gioodm/wasp-proteins-annotation.git](https://github.com/gioodm/wasp-proteins-annotation.git)
cd wasp-proteins-annotation

# Create the environment and install all dependencies
conda env create -f environment.yml

# Activate the environment
conda activate wasp-env

The manuscript results were obtained using Python 3.10.14 and Foldseek 8-ef4e960.

2. Usage

2.1 Whole-proteome annotation

Usage

First, identify the NCBI taxonomy ID of your organism of interest. You can find the ID at NCBI Taxonomy. You can use AlphaFold DB to check how many structures are linked to that ID by searching the same ID on the AFDB search bar.

Once you have identified the taxid, run the pipeline as follows (e.g., for organism S. cerevisiae S288c, taxid: 559292):

wasp-run 559292

This requires gsutil installed.
Note: On the first run, the AlphaFold DB clustered at 50% will need to be downloaded using Foldseek. This can take some time depending on your machine's performance and internet speed. Ensure you have sufficient storage space to host the database. After the initial setup, WASP annotation will take up to 3 hours for iteration for a proteome of approximately 6000 proteins.

Additional parameters can be customised, including:

-e evalue_threshold: set the evalue threshold (default: 10e-10)
-b bitscore threshold: set the bitscore threshold (default: 50)
-n max_neighbours: set the max number of neighbours (default: 10)
-s step: set step to add to max neighbours (n) in additional iterations (default: 10)
-i iterations: set number of iterations to perform (default: 3)

Usage examples:

wasp-run -e 1e-50 -b 200 -n 5 -i 5 559292
wasp-run -s 5 559292

To use a custom dataset (e.g., a newly sequenced genome or a set of proteins from different organisms), create a tarred folder containing the protein structures (.cif.gz or .pdb.gz format) and place it in a folder called proteomes/ within the WASP folder - example folder in example_files/price.tar. Then run WASP with:

wasp-run your_custom.tar

An example of the final output can be found at example_files/price_annotated.xlsx

2.2 GEM gap-filling module

Pre-processing

Some pre-processing steps are required to obtain a standardized input file for the WASP pipeline. Potential modifications to the Python scripts might be needed depending on the GEM format.

find-orphans: identifies orphan reactions in the GEM (accepted extensions: .xml, .sbml, .json, and .mat) using the Python3 cobrapy module. Annotation in different formats (accepted: BiGG, Rhea, EC number, KEGG, PubMed, MetaNetX) present in the model is retrieved - input example at: example_files/gap_filling/iYLI649.xml, output example at: example_files/gap_filling/iYLI649_orphans.txt.
run-rxn2code: each reaction annotation is mapped to the corresponding Rhea reaction ID and/or EC number when available. The output file includes: 1. reaction id, 2. reaction extended name, 3. Rhea IDs, 4. EC numbers. If MetaNetX codes are present, the reac_xref.tsv file (retrieved from MetaNetX) is needed - input example at: example_files/gap_filling/iYLI649_orphans.txt, output example at: example_files/gap_filling/iYLI649_gaps.txt.

The final gaps_file.txt must contain all 4 columns; if the reaction extended name is not present in the model, an empty column should be present. If no Rhea/EC IDs are identified, empty columns should be present instead.

Usage

wasp-gem [-h] [-e evalue_threshold] [-b bitscore_threshold] [-t tmscore] taxid gaps_file.txt

An example of the final output can be found at example_files/gap_filling/iYLI649_hits.txt

3 Additional utils

3.1 Downloading custom protein structures from AFDB

The afdb-download utility allows you to fetch protein structures from the AlphaFold Database (AFDB) using either a FASTA file containing protein sequences or a list of UniProt IDs.

Usage:

afdb-download [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR [--num_cores NUM_CORES]

3.2 Predicting structures with ColabFold

A Python script (predict_structures.py) is provided to run ColabFold predictions on a list of UniProt IDs.

Prerequisites:

ColabFold must be installed and accessible in your PATH.
Ensure your system meets hardware requirements (GPU recommended for large predictions).

Usage:

python3 predict_structures.py -i <uniprot_ids.txt> -o <output_dir> [-c <cores>]

Best practices & recommendations

For large-scale predictions, combine proteins into a single FASTA file to reduce overhead, but split into smaller batches (50-100 proteins) to avoid memory issues. Use parallel processing (-c 8 for 8 cores) and submit as a batch job (e.g., SLURM) for heavy workloads.

Pre-download FASTA sequences to avoid failures. Test with one protein first to verify setup. If predictions fail, reduce --num-recycle (default: 3) or check GPU memory.

For custom workflows, modify the script (e.g., add --amber or --templates). See the ColabFold docs for advanced options.

4 References

[1] van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nature Biotechnology. 2023. doi: https://doi.org/10.1038/s41587-023-01773-0

[2] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583-9. doi: https://doi.org/10.1038/s41586-021-03819-2

[3] Varadi M, Bertoni D, Magana P, Paramval U, Pidruchna I, Radhakrishnan M, et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research. 2023;52(D1):D368–D375. doi: http://dx.doi.org/10.1093/nar/gkad1011

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

2.0

Apr 29, 2026

1.1.1

Mar 30, 2026

1.1.0

Mar 30, 2026

1.0.7

Jun 13, 2025

1.0.6

Jun 13, 2025

1.0.5

Jun 13, 2025

1.0.4

Jun 13, 2025

1.0.3

Jun 10, 2025

1.0.2

Jun 10, 2025

1.0.1

Jun 10, 2025

1.0.0

Jun 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wasp_proteins_annotation-2.0.tar.gz (16.6 MB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wasp_proteins_annotation-2.0-py3-none-any.whl (5.4 MB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file wasp_proteins_annotation-2.0.tar.gz.

File metadata

Download URL: wasp_proteins_annotation-2.0.tar.gz
Upload date: Apr 29, 2026
Size: 16.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for wasp_proteins_annotation-2.0.tar.gz
Algorithm	Hash digest
SHA256	`964d2fa1f446049d37fead44a771cb1a50f9336be3e8c566ccdd17cb88c3065c`
MD5	`865d1085980533c5f29c6358823a8130`
BLAKE2b-256	`f7966b8103045390f198b262376d0a4aa06ae7c10cb78362497cedc27dc1de91`

See more details on using hashes here.

File details

Details for the file wasp_proteins_annotation-2.0-py3-none-any.whl.

File metadata

Download URL: wasp_proteins_annotation-2.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 5.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for wasp_proteins_annotation-2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fee4817e7a7648cb003a78c549c7d2dff0239c94a75634037fd728042e9838b9`
MD5	`8e427d407583848a891194dd2c6b9100`
BLAKE2b-256	`a966af19d8b2dba8102bedc52efa8d3e164e4978d09db348ed6b941f676c5a9f`

See more details on using hashes here.

wasp-proteins-annotation 2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WASP: Protein Functional Annotation using AlphaFold structures

Table of Contents

1. Installation

✅ Recommended: Install via Conda

2. Usage

2.1 Whole-proteome annotation

Usage

2.2 GEM gap-filling module

Pre-processing

Usage

3 Additional utils

3.1 Downloading custom protein structures from AFDB

Usage:

3.2 Predicting structures with ColabFold

Usage:

Best practices & recommendations

4 References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes