ProtPen: Protein function prediction pipeline using sequence and structure-based tools
Project description
ProtPen: Protein Function Prediction Pipeline
This repository contains a pipeline for predicting and analyzing protein function using structure- and sequence-based methods. ProtPen integrates EggNOG-mapper, AlphaFold structure retrieval, Foldseek structural searches, and result enrichment to investigate differentially abundant proteins of unknown function in a proteomic dataset.
Table of Contents
Overview
ProtPen takes a FASTA file containing UniProt protein identifiers as input, retrieves AlphaFold PDB structures, performs EggNOG-mapper and Foldseek searches, filters and consolidates Foldseek results, enriches them with UniProt annotations, and merges the outputs for downstream functional interpretation.
Installation
The pipeline is intended to run on a Unix-based HPC system using SLURM and virtual environments.
Prerequisites
- Python ≥ 3.10
- SLURM scheduler
- EggNOG-mapper v2
- Foldseek
psutil≥ 6.0- Python packages:
pandas,requests,biopython,pytest
Clone the Repository
git clone https://github.com/ProtPen/ProtPen.git
cd ProtPen
Install
EggNOG-mapper and Foldseek are external tools and must be installed separately (see the links above); everything else is installed via pip.
It's recommended to install into a virtual environment:
python -m venv venv
source venv/bin/activate
Install ProtPen and its Python dependencies (pandas, requests, biopython, psutil):
pip install -e .
To also install the test/dev dependencies (pytest, black):
pip install -e ".[test]"
Pipeline Workflow
-
Run EggNOG-mapper (
cli_eggnog.py)
Annotates proteins using the EggNOG orthology database. -
Download AlphaFold PDB files (
cli_download.py)
Retrieves AlphaFold-predicted structures from UniProt based on FASTA input. -
Run Foldseek search (
cli_foldseek.py)
Compares AlphaFold structures to a preprocessed Foldseek database. -
Filter and consolidate Foldseek results (
cli_consolidate_foldseek.py)
Keeps top hits per query and filters based on the input FASTA file. -
Enrich Foldseek results (
cli_enrich.py)
Adds UniProt annotations to Foldseek hits using PDB-to-UniProt mapping. -
Merge EggNOG and Foldseek results (
cli_merge.py)
Joins both result tables on query ID for final annotation output.
Scripts
| Script | Description |
|---|---|
cli_eggnog.py |
Runs EggNOG-mapper with UniProt ID input |
cli_download.py |
Downloads AlphaFold PDB files from UniProt |
cli_foldseek.py |
Runs Foldseek structural search on PDBs |
cli_consolidate_foldseek.py |
Filters and consolidates Foldseek search results |
cli_enrich.py |
Enriches Foldseek results with UniProt metadata |
cli_merge.py |
Merges enriched Foldseek and EggNOG-mapper results |
Usage
All example SLURM scripts below live in sample_search/; run sbatch from that directory (or adjust paths accordingly).
A complete, ready-to-run pipeline for a sample dataset is provided as a single script, sample_run_pipeline.sh, which runs all six steps below in the correct order (with EggNOG-mapper running in the background alongside the Foldseek branch):
sbatch sample_run_pipeline.sh
Alternatively, each step can be run as its own SLURM job, e.g. to rerun a single step or swap in a different tool:
# Step 1: EggNOG-mapper
sbatch run_eggnog_mapper.sh
# Step 2: Download AlphaFold PDBs
sbatch run_download_alphafold.sh
# Step 3: Run Foldseek search
sbatch run_foldseek.sh
# Step 4: Consolidate Foldseek results
sbatch run_consolidate.sh
# Step 5: Enrich Foldseek results
sbatch run_enrich.sh
# Step 6: Merge Foldseek and EggNOG-mapper results
sbatch run_merge.sh
Output
merged_annotations.tsv: Final merged annotations from EggNOG and Foldseek.enriched_foldseek_results.tsv: Foldseek results with added UniProt metadata.consolidated_foldseek_results.tsv: Filtered and top Foldseek matches.eggnog_results.tsv: Raw output from EggNOG-mapper.- AlphaFold
.pdbfiles downloaded for input UniProt IDs.
Modular Design and Extensibility
ProtPen is designed as a modular pipeline, where each analysis step is implemented as an independent module that can be added, removed, or replaced without modifying the core codebase. Modules communicate exclusively through file-based inputs and outputs (primarily TSV and FASTA files), allowing users to customize the workflow for different datasets or analysis goals.
What Is a Module?
Each ProtPen module:
- Performs a single logical task (e.g., annotation, structure search, enrichment)
- Is implemented as a Python module within the
protpen/package - Exposes a command-line interface (CLI) via a
cli_*.pywrapper - Accepts standardized input files and produces standardized output files
The overall workflow is orchestrated by a shell script (e.g., run_pipeline.sh) that sequentially calls these CLI modules. There are no hard-coded dependencies between modules beyond expected input/output formats.
Removing a Module
To remove a module from the pipeline, simply delete or comment out the corresponding CLI call in your pipeline script. Ensure that downstream steps do not require the removed module’s output.
Example: Skipping Foldseek enrichment
Remove the following line from your pipeline script:
python -m protpen.cli_enrich -i consolidated_foldseek_results.tsv -o enriched_foldseek_results.tsv
Contributing
Contributions are welcome! See CONTRIBUTING.md for how to report issues, propose new tools or modules, and submit pull requests.
Citation
If you are using ProtPen for your work, please don't forget to cite us. While the publication is pending, please cite this GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protpen-1.0.0.tar.gz.
File metadata
- Download URL: protpen-1.0.0.tar.gz
- Upload date:
- Size: 237.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4090c9ab509144e86334c089b48b91113469edd02153586ef43fb127b6739933
|
|
| MD5 |
be5b9c6eb21311f994da0ca2b1929eab
|
|
| BLAKE2b-256 |
b9227c9bcbb0a172d3206310bf8ee0259cc7abe3be0564b44f03c0a4d19e7145
|
File details
Details for the file protpen-1.0.0-py3-none-any.whl.
File metadata
- Download URL: protpen-1.0.0-py3-none-any.whl
- Upload date:
- Size: 42.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0420294cc4200719883b0adbda7b803f37e4b0887885982548c2534af1dc4730
|
|
| MD5 |
09b2cfcb38d5095361c4449ac2a46013
|
|
| BLAKE2b-256 |
9d3da29a82fc4068f3ff03d0db381c09048e292e26455e74f191b4543194e306
|