Skip to main content

Python-native implementation of selected PhosR-style phosphoproteomics workflows.

Project description

PhosPy

PhosPy 1.0.0 is an unofficial Python implementation of selected PhosR-style workflows for phosphoproteomics.

It is designed for people who want a small, Python-native way to:

  • preprocess phosphoproteomics tables
  • analyse kinase activity from predMat
  • run a native kinase workflow from scoring through prediction

PhosPy is deliberately narrow. It is not a full replacement for the R PhosR package.

Install

PhosPy supports Python 3.10 and newer.

Install the supported Python API and the phospy CLI:

pip install phospy

A small note before you start: the file-path examples below use examples/data/..., so they assume you are working from a repository checkout. If you installed from PyPI, use the same code with paths to your own input files instead.

What You Can Do With PhosPy

Preprocess Phosphoproteomics Data

Start from total and phospho input tables and produce corrected phosphosite matrices for downstream use.

Analyse Kinase Activity From predMat

Generate weighted activity scores, KSEA-style summaries, and target counts from predicted kinase–substrate relationships.

Run a Native Kinase Workflow

Construct substrate profiles, score motifs, combine evidence, select candidates, and perform adaptive SVM-based kinase prediction.

Supported Public API for 1.0.0

The stable root-level API for 1.0.0 is intentionally small:

  • PhosphoDataset
  • PhosRPipeline
  • analyze_kinase_activity
  • KinaseWorkflow

Returned result dataclasses:

  • CoreProcessingResult
  • SiteMatrixResult
  • CoreOutputs
  • KinaseActivityResult
  • KinasePredictionResult
  • KinaseWorkflowResult

The examples below use only those imports.

For a compact guide to the supported classes, methods, and result objects, see docs/api.md.

Input Tables at a Glance

PhosPy expects a small, fixed set of input shapes.

Total-proteome table

Required columns:

  • genes
  • group1 to group6

Phosphoproteome table

Required columns:

  • uid
  • gene_names
  • gene_p_site
  • localization_prob
  • centralized_sequence
  • p_group1 to p_group6

gene_p_site must look like GENE_SITE, for example PRKACA_S339.

predMat

predMat must be a numeric matrix with:

  • phosphosite IDs as the index, for example BTK;Y551;
  • kinase names as columns
  • scores in the range [0, 1]

When you load tables from files, PhosPy normalises input headers to lowercase snake case before validation. For example, Gene Names and gene-names both become gene_names. That makes file input a little more forgiving, but it also means loading fails if two raw headers collapse to the same cleaned name.

If you build PhosphoDataset from in-memory pandas data frames instead, those column names are validated as provided.

Quick Start

The quickest way to get started from a source checkout is to use the bundled example data in examples/data/.

Core Preprocessing

from phospy import PhosphoDataset

dataset = PhosphoDataset.from_files(
    "examples/data/total.tsv",
    "examples/data/phospho.tsv",
    phospho_encoding="utf-16le",
)
core = dataset.process_core(max_unmatched_fraction=0.1)

site_matrix = core.site_matrix.matrix
corrected = core.phospho_corrected

For the bundled example data, site_matrix.index.tolist() is ['BTK;Y551;'].

process_core() returns a CoreProcessingResult with:

  • total_unique
  • total_filtered
  • phospho_filtered
  • phospho_corrected
  • site_matrix

If your analysis needs explicit pairwise comparisons, pass them when you build the dataset:

from phospy import PhosphoDataset

dataset = PhosphoDataset.from_files(
    "examples/data/total.tsv",
    "examples/data/phospho.tsv",
    phospho_encoding="utf-16le",
    comparisons=[("group1", "group4"), ("group2", "group5")],
)
core = dataset.process_core(max_unmatched_fraction=0.1)

If you do not pass comparisons, preprocessing still runs normally and no extra pairwise columns are added.

Downstream Kinase Analysis From predMat

from phospy import PhosphoDataset, analyze_kinase_activity
import pandas as pd

dataset = PhosphoDataset.from_files(
    "examples/data/total.tsv",
    "examples/data/phospho.tsv",
    phospho_encoding="utf-16le",
)
core = dataset.process_core(max_unmatched_fraction=0.1)
pred_mat = pd.read_csv("examples/data/predMat.csv", index_col=0)

kinase = analyze_kinase_activity(
    pred_mat=pred_mat,
    phospho_matrix=core.site_matrix.matrix,
    threshold=0.6,
    min_substrates=1,
    top_n_substrates=1,
)

target_counts = kinase.target_counts
ksea_scores = kinase.ksea_scores

The bundled example uses min_substrates=1 and top_n_substrates=1 because the example matrix is intentionally tiny. For larger real datasets, the defaults (min_substrates=3, top_n_substrates=20) are usually the better starting point.

For the bundled example data, target_counts.to_dict() is {'PRKACA': 3, 'BTK': 2}.

analyze_kinase_activity(...) returns a KinaseActivityResult with:

  • weighted_activity
  • ksea_scores
  • ksea_counts
  • target_counts
  • target_table

End-to-End Pipeline

from phospy import PhosRPipeline

pipeline = PhosRPipeline.from_files(
    total_path="examples/data/total.tsv",
    phospho_path="examples/data/phospho.tsv",
    pred_mat_path="examples/data/predMat.csv",
    phospho_encoding="utf-16le",
    max_unmatched_fraction=0.1,
)
outputs = pipeline.run(outdir="examples/output")

outputs is a CoreOutputs object with:

  • outputs.core
  • outputs.kinase_activity

This writes the core CSV outputs together with downstream kinase-analysis tables, including:

  • df_total_unique.csv
  • df_total_filtered.csv
  • df_phospho_filtered.csv
  • df_phospho_corrected.csv
  • phosr_input.csv
  • mat_phospho_corrected.csv
  • site_sequences.csv
  • kinase_activity_matrix.csv
  • ksea_scores.csv
  • ksea_counts.csv
  • kinase_target_counts.csv
  • kinase_target_table.csv

If you omit pred_mat_path, the pipeline still runs the core preprocessing path and simply skips the downstream kinase-analysis outputs.

Native End-to-End Kinase Workflow

A complete runnable native-workflow example is included at examples/native_workflow_demo.py.

If PhosPy is installed in the environment, for example with pip install phospy or pip install -e . from a local checkout, you can run it directly:

python examples/native_workflow_demo.py

From a local checkout, there is also a Make target that runs the example with the repository src/ path configured for that shell session:

make native-workflow-demo

That example uses only the supported 1.0.0 root API and prints a small prediction matrix for a synthetic two-kinase setup.

The native workflow expects:

  • a phosphosite matrix
  • a substrate_map
  • site_sequences keyed by phosphosite ID when motif scoring is used
  • motif_sequences for end-to-end motif-aware prediction

site_sequences can be passed as either a mapping keyed by phosphosite ID or a pandas Series with a phosphosite index. If you want profile-only prediction, pass allow_profile_only_fallback=True and omit motif_sequences.

Command-Line Demo

After installation, you can run the CLI on your own files. The example below uses the bundled tables from a source checkout:

phospy \
  --total examples/data/total.tsv \
  --phospho examples/data/phospho.tsv \
  --pred-mat examples/data/predMat.csv \
  --phospho-encoding utf-16le \
  --max-unmatched-fraction 0.1 \
  --outdir examples/output

The example output directory in examples/output/ shows the generated CSV files.

The CLI currently supports these options:

  • --total and --phospho are required input files
  • --phospho-encoding optionally overrides the default utf-8 reader encoding
  • --outdir is the required output directory
  • --pred-mat is optional
  • --localization-threshold defaults to 0.75
  • --min-observed defaults to 4
  • --total-sentinel defaults to 10.0
  • --phospho-sentinel defaults to 12.0
  • --max-unmatched-fraction defaults to 0.0

--max-unmatched-fraction=0.0 means protein correction fails if the inner join would silently drop any phosphosite rows. Raise it only when you want to allow a small, bounded amount of row loss.

The CLI is intentionally small in 1.0.0. It does not currently expose pairwise comparison generation or the native KinaseWorkflow path.

Validation Rules Worth Knowing

A few checks are especially useful to know up front:

  • localization_prob must stay within [0, 1].
  • predMat values must stay within [0, 1].
  • file-loaded total and phospho headers are cleaned to lowercase snake case before validation, so duplicate cleaned names are rejected.
  • predMat and the phosphosite matrix must overlap by at least one phosphosite row, and that overlap must cover at least 10% of the phosphosite matrix.
  • Protein correction normalises gene identifiers before matching and, by default, refuses to drop unmatched phosphosite rows.
  • Site-matrix construction drops rows with missing sequences or incomplete corrected values, then deduplicates repeated phosphosites by keeping the row with the highest mean corrected signal.
  • In the native workflow, motif_sequences require matching site_sequences. If you omit motif data entirely, set allow_profile_only_fallback=True.

Where to Go Next

If you want more detail, these are the most useful follow-on docs:

If you want to contribute or work from a local checkout, see CONTRIBUTING.md.

Known Limitations

A few boundaries are worth knowing up front:

  • Selective scope only. PhosPy 1.0.0 covers the workflows documented above and nothing broader.
  • Parity is seam-level, not package-wide. Validation claims are limited to the committed fixture-backed seams described in docs/validation-and-parity.md and docs/parity.md.
  • KinaseWorkflow is native first. It includes an svm_mode="r_parity" option for narrower learner-seam comparison, but the default mode is the preferred Python-native path and is not claimed to numerically match every PhosR result.
  • The CLI is intentionally small. It covers the core preprocessing and predMat-driven downstream path. The native kinase workflow is currently exposed through the Python API and example script.
  • R is only required for fixture regeneration. You do not need R to install PhosPy or run the committed Python test suite.

For Contributors

Most users can ignore this section.

To work from a local checkout:

pip install -e .

To run tests:

pip install -e ".[test]"
pytest -m "not parity"
pytest -m parity

If you want the parity suite to print its optional comparison metrics while you debug a seam, these environment variables are available:

  • PHOSPY_SHOW_PARITY: master switch for parity metrics output
  • PHOSPY_SHOW_PROFILE_CONSTRUCTION: also print the optional profile-construction metrics
  • PHOSPY_SHOW_PREDICTION_MODE_COMPARISON: also print default-versus-r_parity prediction comparison metrics
  • PHOSPY_SHOW_REPLAYED_PREDICTION_MODE_COMPARISON: also print replayed prediction comparison metrics

The three more specific flags only do anything when PHOSPY_SHOW_PARITY is enabled first. Truthy values are case-insensitive and include 1, true, yes, and on.

To see the printed summaries in the terminal, run pytest with -s (or --capture=no). If you enable all four flags and run the full parity suite, PhosPy prints every available metrics block reached by those tests.

Linux or macOS quick example:

PHOSPY_SHOW_PARITY=1 PHOSPY_SHOW_PROFILE_CONSTRUCTION=1 PHOSPY_SHOW_PREDICTION_MODE_COMPARISON=1 PHOSPY_SHOW_REPLAYED_PREDICTION_MODE_COMPARISON=1 pytest -m parity -s

For Linux/macOS, PowerShell, and Command Prompt examples together with a sample of the bundled parity output, see docs/parity.md.

To run the usual contributor checks:

pip install -e ".[dev]"
pre-commit install
pre-commit run --all-files

R Requirements for Fixture Regeneration

The committed parity fixtures are already included in the repository. You only need R if you want to regenerate or extend them.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phospy-1.1.0.tar.gz (85.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phospy-1.1.0-py3-none-any.whl (67.0 kB view details)

Uploaded Python 3

File details

Details for the file phospy-1.1.0.tar.gz.

File metadata

  • Download URL: phospy-1.1.0.tar.gz
  • Upload date:
  • Size: 85.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phospy-1.1.0.tar.gz
Algorithm Hash digest
SHA256 cbd8de56d6fcae223edbb7c1d373354c51ab4ee7974fbc8886b62b8d55212b93
MD5 688c65154142ce1d3ce2edacedef7f9d
BLAKE2b-256 f4b0cabae58064e166f25f009489cf22e1dae9998081f7d06d0b518cc5c91438

See more details on using hashes here.

Provenance

The following attestation bundles were made for phospy-1.1.0.tar.gz:

Publisher: publish.yml on falconsmilie/phospy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phospy-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: phospy-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 67.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phospy-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3edb9da55b5ad5c7be9b6eea4ab1fa81f5b9da5d30a25271b60fbb3340675b4c
MD5 f6eefd8c9498f1d193808e254128dda5
BLAKE2b-256 3465ac1451413bcc4cb7c232d7f0bcb71da25641a419c3072f2e3590ea0552a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for phospy-1.1.0-py3-none-any.whl:

Publisher: publish.yml on falconsmilie/phospy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page