Skip to main content

ProtPen: Protein function prediction pipeline using sequence and structure-based tools

Project description

ProtPen: Protein Function Prediction Pipeline

This repository contains a pipeline for predicting and analyzing protein function using structure- and sequence-based methods. ProtPen integrates EggNOG-mapper, AlphaFold structure retrieval, Foldseek structural searches, and result enrichment to investigate differentially abundant proteins of unknown function in a proteomic dataset.

Table of Contents

Overview

ProtPen takes a FASTA file containing UniProt protein identifiers as input, retrieves AlphaFold PDB structures, performs EggNOG-mapper and Foldseek searches, filters and consolidates Foldseek results, enriches them with UniProt annotations, and merges the outputs for downstream functional interpretation.

Installation

The pipeline is intended to run on a Unix-based HPC system using SLURM and virtual environments.

Prerequisites

  • Python ≥ 3.10
  • SLURM scheduler
  • EggNOG-mapper v2
  • Foldseek
  • psutil ≥ 6.0
  • Python packages: pandas, requests, biopython, pytest

Clone the Repository

git clone https://github.com/ProtPen/ProtPen.git
cd ProtPen

Install

EggNOG-mapper and Foldseek are external tools and must be installed separately (see the links above); everything else is installed via pip.

It's recommended to install into a virtual environment:

python -m venv venv
source venv/bin/activate

Install ProtPen and its Python dependencies (pandas, requests, biopython, psutil):

pip install -e .

To also install the test/dev dependencies (pytest, black):

pip install -e ".[test]"

Pipeline Workflow

  1. Run EggNOG-mapper (cli_eggnog.py)
    Annotates proteins using the EggNOG orthology database.

  2. Download AlphaFold PDB files (cli_download.py)
    Retrieves AlphaFold-predicted structures from UniProt based on FASTA input.

  3. Run Foldseek search (cli_foldseek.py)
    Compares AlphaFold structures to a preprocessed Foldseek database.

  4. Filter and consolidate Foldseek results (cli_consolidate_foldseek.py)
    Keeps top hits per query and filters based on the input FASTA file.

  5. Enrich Foldseek results (cli_enrich.py)
    Adds UniProt annotations to Foldseek hits using PDB-to-UniProt mapping.

  6. Merge EggNOG and Foldseek results (cli_merge.py)
    Joins both result tables on query ID for final annotation output.

Scripts

Script Description
cli_eggnog.py Runs EggNOG-mapper with UniProt ID input
cli_download.py Downloads AlphaFold PDB files from UniProt
cli_foldseek.py Runs Foldseek structural search on PDBs
cli_consolidate_foldseek.py Filters and consolidates Foldseek search results
cli_enrich.py Enriches Foldseek results with UniProt metadata
cli_merge.py Merges enriched Foldseek and EggNOG-mapper results

Usage

All example SLURM scripts below live in sample_search/; run sbatch from that directory (or adjust paths accordingly).

A complete, ready-to-run pipeline for a sample dataset is provided as a single script, sample_run_pipeline.sh, which runs all six steps below in the correct order (with EggNOG-mapper running in the background alongside the Foldseek branch):

sbatch sample_run_pipeline.sh

Alternatively, each step can be run as its own SLURM job, e.g. to rerun a single step or swap in a different tool:

# Step 1: EggNOG-mapper
sbatch run_eggnog_mapper.sh

# Step 2: Download AlphaFold PDBs
sbatch run_download_alphafold.sh

# Step 3: Run Foldseek search
sbatch run_foldseek.sh

# Step 4: Consolidate Foldseek results
sbatch run_consolidate.sh

# Step 5: Enrich Foldseek results
sbatch run_enrich.sh

# Step 6: Merge Foldseek and EggNOG-mapper results
sbatch run_merge.sh

Output

  • merged_annotations.tsv: Final merged annotations from EggNOG and Foldseek.
  • enriched_foldseek_results.tsv: Foldseek results with added UniProt metadata.
  • consolidated_foldseek_results.tsv: Filtered and top Foldseek matches.
  • eggnog_results.tsv: Raw output from EggNOG-mapper.
  • AlphaFold .pdb files downloaded for input UniProt IDs.

Modular Design and Extensibility

ProtPen is designed as a modular pipeline, where each analysis step is implemented as an independent module that can be added, removed, or replaced without modifying the core codebase. Modules communicate exclusively through file-based inputs and outputs (primarily TSV and FASTA files), allowing users to customize the workflow for different datasets or analysis goals.

What Is a Module?

Each ProtPen module:

  • Performs a single logical task (e.g., annotation, structure search, enrichment)
  • Is implemented as a Python module within the protpen/ package
  • Exposes a command-line interface (CLI) via a cli_*.py wrapper
  • Accepts standardized input files and produces standardized output files

The overall workflow is orchestrated by a shell script (e.g., run_pipeline.sh) that sequentially calls these CLI modules. There are no hard-coded dependencies between modules beyond expected input/output formats.

Removing a Module

To remove a module from the pipeline, simply delete or comment out the corresponding CLI call in your pipeline script. Ensure that downstream steps do not require the removed module’s output.

Example: Skipping Foldseek enrichment

Remove the following line from your pipeline script:

python -m protpen.cli_enrich -i consolidated_foldseek_results.tsv -o enriched_foldseek_results.tsv

Contributing

Contributions are welcome! See CONTRIBUTING.md for how to report issues, propose new tools or modules, and submit pull requests.

Citation

If you are using ProtPen for your work, please don't forget to cite us. While the publication is pending, please cite this GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protpen-1.0.0.tar.gz (237.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protpen-1.0.0-py3-none-any.whl (42.8 kB view details)

Uploaded Python 3

File details

Details for the file protpen-1.0.0.tar.gz.

File metadata

  • Download URL: protpen-1.0.0.tar.gz
  • Upload date:
  • Size: 237.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for protpen-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4090c9ab509144e86334c089b48b91113469edd02153586ef43fb127b6739933
MD5 be5b9c6eb21311f994da0ca2b1929eab
BLAKE2b-256 b9227c9bcbb0a172d3206310bf8ee0259cc7abe3be0564b44f03c0a4d19e7145

See more details on using hashes here.

File details

Details for the file protpen-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: protpen-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for protpen-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0420294cc4200719883b0adbda7b803f37e4b0887885982548c2534af1dc4730
MD5 09b2cfcb38d5095361c4449ac2a46013
BLAKE2b-256 9d3da29a82fc4068f3ff03d0db381c09048e292e26455e74f191b4543194e306

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page