Skip to main content

Kinase-substrate network prediction: PostgreSQL, Numba, pynetphorest, live data, false-negative recovery

Project description

PyNetworKIN

DOI PyPI version PyPI - Python Version PyPI - Downloads Docker Image Version GitHub Release GitHub Actions License

PyNetworKIN is a Bayesian kinase–substrate prediction pipeline for phosphoproteomics. It integrates sequence-motif scoring (via pynetphorest) with protein-interaction context (via the STRING network) to predict which kinases, phosphatases, or phospho-binding domains are responsible for observed phosphorylation events.

This repository is a modernised Python 3 port of the original NetworKIN 3.0 tool (Linding, Jensen, Horn & Kim, 2005–2013), extended to support STRING v12 protein interaction data.


Features

  • Predicts kinase/phosphatase/phospho-binding domain substrates from FASTA + phosphosite input.
  • Supports human (9606) and yeast (4932) proteomes.
  • Accepts multiple phosphosite input formats: NetworKIN TSV, ProteomeDiscoverer, MaxQuant, and custom formats.
  • Integrates sequence motif posterior probabilities with STRING network proximity scores using pre-calibrated Bayesian likelihood-ratio tables.
  • Outputs per-site predictions as a TSV file in the results/ directory.

Requirements

Dependency Version Notes
Python ≥ 3.10
NumPy ≥ 1.26
Pandas ≥ 2.2
pynetphorest ≥ 0.1.1 Motif scoring atlas
NCBI BLAST+ ≥ 2.9 blastp must be on PATH or supplied via --blast-dir

Installation

From source

pip install -e .

Docker (GHCR)

docker pull ghcr.io/bibymaths/pynetworkin:latest
docker run --rm -v "$(pwd):/work" ghcr.io/bibymaths/pynetworkin:latest predict /work/input.fasta

Or use the provided Compose file:

docker compose up -d
docker compose exec networkin pynetworkin predict /work/input.fasta

Usage

CLI

pynetworkin predict <FASTA-file> [options]
Argument / Option Default Description
FASTA-file (required) Input FASTA or phosphosite file
--output / -o <input>.networkin.tsv Output file path
--format / -f tsv Output format: tsv or sif
--species 9606 NCBI taxonomy ID (9606 = human, 4932 = yeast)
--refresh / -r off Force re-fetch of cached network data
--verbose / -v off Enable verbose logging

Example

pynetworkin predict data_MaxQuant_sample/test.fasta --output results/test.networkin.tsv

Results are written to results/<fasta-filename>.result.tsv.

Other commands

pynetworkin info       # Show runtime/package information
pynetworkin cache      # Show cache contents
pynetworkin cache --clear  # Clear cached network data

Python API

from pynetworkin import AppConfig, run_pipeline

config = AppConfig(
    organism="9606",
    fasta_path="data_MaxQuant_sample/test.fasta",
    sites_path=None,
    datadir="data",
    blast_dir="",
)
results = run_pipeline(config)
print(results["prediction_count"], "predictions written to", results["output_path"])

Input formats

FASTA file

Standard FASTA format. Protein IDs are taken as everything between > and the first _ on the header line.

Sites file (auto-detected)

Format Detection Description
NetworKIN TSV 3-column TSV protein_id \t position \t residue
ProteomeDiscoverer 2-column protein_id \t phosphopeptide (phosphosites in lowercase)
MaxQuant Column header Proteins + Leading Direct MaxQuant phosphosite output
Space-separated column 2 = phospho Space-separated with residue+position in col 2

Output format

Results TSV columns:

Column Description
Name Target protein ID
Position Phosphosite position in the protein
Tree NetPhorest tree (KIN, SH2, PTP, 1433, …)
Motif Group NetPhorest classifier group
Kinase/Phosphatase/Phospho-binding domain Predicted enzyme
NetworKIN score Integrated Bayesian score (≥ 0.02 reported)
Motif probability Raw NetPhorest posterior
STRING score STRING best-path proximity score
Target STRING ID Ensembl protein ID of the substrate
Kinase STRING ID Ensembl protein ID of the enzyme
Target Name Human-readable substrate name
Kinase Name Human-readable enzyme name
Target description STRING functional description of substrate
Kinase description STRING functional description of enzyme
Peptide sequence window ±7 aa window around the phosphosite
Intermediate nodes Best-path intermediate proteins in STRING
recovered True if recovered by the false-negative recovery step
recovery_method Method used for recovery (e.g. context_proximity)

Repository structure

src/
  pynetworkin/          # Core pipeline package
    __init__.py         # Public API (AppConfig, run_pipeline)
    networkin.py        # Main pipeline: AppConfig, run_pipeline, detect_site_file_type, …
    motif_scoring.py    # pynetphorest batch scorer wrapper
    graph_scoring.py    # STRING network context scoring & prediction ranking
    likelihood.py       # Bayesian likelihood conversion tables
    logger.py           # Loguru/Rich logging wrapper
    output.py           # TSV / Cytoscape SIF output writers
    recovery.py         # False-negative recovery via network proximity
    cli.py              # Typer CLI entry-point
    inputs/
      phosphosites.py   # OmniPath / PhosphoSitePlus / fallback fetcher
      string_network.py # STRING flat-file / REST API / fallback fetcher
scripts/
  backup.py                  # Legacy NetworKIN 3.0 reference script (Python 3 port)
  cleanup_HGNC_mapping.py    # HGNC symbol–Ensembl ID reconciliation utility
  generate_sample_data.py    # Generate offline fallback data files
  migrate_to_parquet.py      # Migrate legacy .txt conversion tables → Parquet
data/
  conversion_direct.parquet   # Pre-built likelihood tables (direct STRING paths)
  conversion_indirect.parquet # Pre-built likelihood tables (indirect STRING paths)
  fallback/                   # Bundled offline sample data
  string_data/                # STRING interaction flat files
tests/
  conftest.py             # pytest path setup (adds src/ to sys.path)
  test_motif_scoring.py
  test_output.py
  test_recovery.py
  test_networkin.py       # Tests for load_conversion_tables, detect_site_file_type, run_pipeline

See ARCHITECTURE.md for a detailed description of the execution flow.


Data sources

  • pynetphorest: kinase-group motif models (Python package).
  • STRING v12: human protein interactions and sequences. Downloaded from string-db.org.
  • OmniPath: phosphorylation site reference data (fetched live, cached locally).

This repository provides a modern reimplementation of the NetworKIN framework.

  • Original NetworKIN was described in: Linding et al., Cell 2007

  • This implementation:

    • Does NOT reuse original NetworKIN source code
    • Replaces NetPhorest with pynetphorest
    • Uses a rewritten likelihood model
    • Implements a new modular pipeline

License: MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pynetworkin_bio-0.1.2-py3-none-any.whl (448.0 kB view details)

Uploaded Python 3

File details

Details for the file pynetworkin_bio-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pynetworkin_bio-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c98ec5a7f8eec95fa367ebcbc794b604ddce08b4b70b3837acafa3b707d3149e
MD5 20c9a2d525d1367f133c28fe8b623edc
BLAKE2b-256 7f7e63849e10af52f63450c5687c5f9d37747d6a3b674c8772f31cc6cd497d6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page