Python-native implementation of selected PhosR-style phosphoproteomics workflows.
Project description
PhosPy
PhosPy is an unofficial Python implementation of selected PhosR-style workflows for phosphoproteomics.
It is designed for people who want a small, Python-native way to:
- preprocess phosphoproteomics tables
- analyse kinase activity from
predMat - run a native kinase workflow from scoring through prediction
PhosPy is deliberately narrow. It is not a full replacement for the R PhosR package.
Install
PhosPy supports Python 3.10 and newer.
Install the supported Python API and the phospy CLI:
pip install phospy
The file-path examples below use examples/data/..., so they assume you are working from a
repository checkout. If you installed from PyPI, use the same code with paths to your own input files.
What You Can Do With PhosPy
Preprocess Phosphoproteomics Data
Start from total and phospho input tables and produce corrected phosphosite matrices for downstream use.
Analyse Kinase Activity From predMat
Generate weighted activity scores, KSEA-style summaries, and target counts from predicted kinase–substrate relationships.
Run a Native Kinase Workflow
Construct substrate profiles, score motifs, combine evidence, select candidates, and perform adaptive SVM-based kinase prediction.
Supported Public API
The stable API is intentionally small:
PhosphoDatasetPhosRPipelineKinaseActivityAnalyzerKinaseWorkflow
Returned result dataclasses:
CoreProcessingResultSiteMatrixResultCoreOutputsKinaseActivityResultKinasePredictionResultKinaseWorkflowResult
The examples below use only those imports.
For a compact guide to the supported classes, methods, and result objects, see docs/api.md.
Input Tables at a Glance
PhosPy expects a small, fixed set of input shapes.
Total Proteome Table
Required columns:
genesgroup1togroup6
Phosphoproteome Table
Required columns:
uidgene_namesgene_p_sitelocalization_probcentralized_sequencep_group1top_group6
gene_p_site must look like GENE_SITE, for example PRKACA_S339.
predMat
predMat must be a numeric matrix with:
- phosphosite IDs as the index, for example
BTK;Y551; - kinase names as columns
- scores in the range
[0, 1]
On disk, PhosphoDataset.from_files(...), PhosRPipeline.from_files(...), and the CLI read the total and phospho
inputs as tab-delimited text tables. predMat is read separately as CSV with the first column used as the phosphosite
index.
When you load tables from files, PhosPy normalises input headers to lowercase snake case before validation. For example,
Gene Names and gene-names both become gene_names. That makes file input a little more forgiving, but it also
means loading fails if two raw headers collapse to the same cleaned name.
If you build PhosphoDataset from in-memory pandas data frames instead, those column names are validated as provided.
Quick Start
The quickest way to get started from a source checkout is to use the bundled example data in examples/data/.
Core Preprocessing
from phospy import CoreOutputWriter, PhosphoDataset
dataset = PhosphoDataset.from_files(
"examples/data/total.tsv",
"examples/data/phospho.tsv",
phospho_encoding="utf-16le",
)
core = dataset.preprocessing.run(max_unmatched_fraction=0.1)
writer = CoreOutputWriter()
writer.write(core, outdir="examples/output", format="csv")
# Use format="tsv" or format="parquet" for alternative core output bundles.
site_matrix = core.site_matrix.matrix
corrected = core.phospho_corrected
For the bundled example data, site_matrix.index.tolist() returns ['BTK;Y551;'].
dataset.preprocessing is the bound preprocessing facade for the dataset and the preferred public entrypoint for core preprocessing. Use dataset.preprocessing.run(...) as the routine API. CoreOutputWriter is the canonical public API for persisting core preprocessing outputs.
dataset.preprocessing.run() returns a CoreProcessingResult with:
total_uniquetotal_filteredphospho_filteredphospho_correctedsite_matrix
If your analysis needs explicit pairwise comparisons, pass them when you build the dataset:
from phospy import PhosphoDataset
dataset = PhosphoDataset.from_files(
"examples/data/total.tsv",
"examples/data/phospho.tsv",
phospho_encoding="utf-16le",
comparisons=[("group1", "group4"), ("group2", "group5")],
)
core = dataset.preprocessing.run(max_unmatched_fraction=0.1)
If you do not pass comparisons, preprocessing still runs normally and no extra pairwise columns are added.
If you only want the phosphosite localisation filter as a standalone preprocessing step, use the public helper in
phospy.preprocessing:
from phospy.preprocessing import filter_localized_sites
filtered = filter_localized_sites(phospho_df, threshold=0.75)
summary_result = filter_localized_sites(
phospho_df,
threshold=0.75,
return_summary=True,
)
summary_result.filtered contains the retained rows and summary_result.summary reports how many rows were kept or
removed.
If you want to filter by observed data coverage before the broader workflow, use the standalone coverage helper:
from phospy.preprocessing import filter_sites_by_coverage
coverage_result = filter_sites_by_coverage(
phospho_df,
columns=["p_group1", "p_group2", "p_group3", "p_group4", "p_group5", "p_group6"],
min_coverage=0.5,
return_summary=True,
)
filter_localized_sites(...) removes sites with weak localisation evidence, while
filter_sites_by_coverage(...) removes sites with too many missing sample values. These standalone helpers are for
targeted advanced use; the preferred end-to-end preprocessing path remains dataset.preprocessing.run(...). Coverage filtering currently
operates across the sample columns you provide rather than from a separate group metadata model.
Downstream Kinase Analysis From predMat
KinaseActivityAnalyzer is the public orchestration layer for standalone downstream kinase analysis. Use it when you
already have a phosphosite matrix and a predMat and want the downstream kinase summary tables without going through
PhosRPipeline.
from phospy import KinaseActivityAnalyzer, PhosphoDataset
dataset = PhosphoDataset.from_files(
"examples/data/total.tsv",
"examples/data/phospho.tsv",
phospho_encoding="utf-16le",
)
core = dataset.preprocessing.run(max_unmatched_fraction=0.1)
analyzer = KinaseActivityAnalyzer()
kinase = analyzer.load_and_analyze(
pred_mat_path="examples/data/predMat.csv",
phospho_matrix=core.site_matrix.matrix,
threshold=0.6,
min_substrates=1,
top_n_substrates=1,
)
analyzer.write_outputs(kinase, outdir="examples/output")
target_counts = kinase.target_counts
ksea_scores = kinase.ksea_scores
The bundled example uses min_substrates=1 and top_n_substrates=1 because the example matrix is intentionally tiny.
For larger real datasets, the defaults (min_substrates=3, top_n_substrates=20) are usually the better starting
point.
For the bundled example data, target_counts.to_dict() is {'PRKACA': 3, 'BTK': 2}.
KinaseActivityAnalyzer.load_and_analyze(...) returns a KinaseActivityResult with:
weighted_activityksea_scoresksea_countstarget_countstarget_table
End-to-End Pipeline
from phospy import PhosRPipeline
pipeline = PhosRPipeline.from_files(
total_path="examples/data/total.tsv",
phospho_path="examples/data/phospho.tsv",
pred_mat_path="examples/data/predMat.csv",
phospho_encoding="utf-16le",
max_unmatched_fraction=0.1,
)
outputs = pipeline.run(outdir="examples/output")
outputs is a CoreOutputs object with:
outputs.coreoutputs.kinase_activity
This writes the default core CSV outputs together with downstream kinase-analysis tables, including:
df_total_unique.csvdf_total_filtered.csvdf_phospho_filtered.csvdf_phospho_corrected.csvphosr_input.csvmat_phospho_corrected.csvsite_sequences.csvkinase_activity_matrix.csvksea_scores.csvksea_counts.csvkinase_target_counts.csvkinase_target_table.csvrun_manifest.json
run_manifest.json records a small summary of the run, including whether kinase activity outputs were produced, row
counts for the core tables, the preprocessing configuration, and the installed package version.
If you omit pred_mat_path, the pipeline still runs the core preprocessing path and simply skips the downstream
kinase-analysis outputs. For explicit non-CSV core persistence outside the pipeline, use CoreOutputWriter directly with format="tsv", format="csv", or format="parquet". Parquet output requires an installed pandas parquet engine such as pyarrow; the package now exposes this as the optional phospy[parquet] extra.
Native End-to-End Kinase Workflow
A complete runnable native-workflow example is included at
examples/native_workflow_demo.py.
If PhosPy is installed in the environment, for example with pip install phospy or pip install -e . from a local
checkout, you can run it directly:
python examples/native_workflow_demo.py
From a local checkout, there is also a Make target that runs the example with the repository src/ path configured for
that shell session:
make native-workflow-demo
That example uses only the supported API and prints a small prediction matrix for a synthetic two-kinase setup.
The native workflow expects:
- a phosphosite matrix
- a
substrate_map site_sequenceskeyed by phosphosite ID when motif scoring is usedmotif_sequencesfor end-to-end motif-aware prediction
site_sequences can be passed as either a mapping keyed by phosphosite ID or a pandas Series with a phosphosite index.
If you want profile-only prediction, pass allow_profile_only_fallback=True and omit motif_sequences.
Command-Line Demo
After installation, you can run the CLI on your own files. The example below uses the bundled tables from a source checkout:
phospy \
--total examples/data/total.tsv \
--phospho examples/data/phospho.tsv \
--pred-mat examples/data/predMat.csv \
--phospho-encoding utf-16le \
--max-unmatched-fraction 0.1 \
--outdir examples/output
The checked-in example output directory in examples/output/ shows the generated CSV tables. A fresh CLI or pipeline
run also writes run_manifest.json to the chosen output directory.
The CLI currently supports these options:
--totaland--phosphoare required tab-delimited input files--phospho-encodingoptionally overrides the defaultutf-8reader encoding--outdiris the required output directory--pred-matis optional--localization-thresholddefaults to0.75--min-observeddefaults to4--total-sentineldefaults to10.0--phospho-sentineldefaults to12.0--max-unmatched-fractiondefaults to0.0
--max-unmatched-fraction=0.0 means protein correction fails if the inner join would silently drop any phosphosite
rows. Raise it only when you want to allow a small, bounded amount of row loss.
The CLI is intentionally small. It does not currently expose pairwise comparison generation or the native
KinaseWorkflow path.
Validation Rules Worth Knowing
A few checks are especially useful to know up front:
localization_probmust stay within[0, 1].predMatvalues must stay within[0, 1].- file-loaded total and phospho headers are cleaned to lowercase snake case before validation, so duplicate cleaned names are rejected.
predMatand the phosphosite matrix must overlap by at least one phosphosite row, and that overlap must cover at least 10% of the phosphosite matrix.- Protein correction normalises gene identifiers before matching and, by default, refuses to drop unmatched phosphosite rows.
- Site-matrix construction drops rows with missing sequences or incomplete corrected values, then deduplicates repeated phosphosites by keeping the row with the highest mean corrected signal.
- In the native workflow,
motif_sequencesrequire matchingsite_sequences. If you omit motif data entirely, setallow_profile_only_fallback=True.
Where to Go Next
If you want more detail, these are the most useful follow-on docs:
docs/api.mdfor the public API referencedocs/validation-and-parity.mdfor the short validation and PhosR parity guidedocs/parity.mdfor the detailed parity guide against the RPhosRpackagedocs/fixtures.mdfor the committed fixture and trace layoutdocs/roadmap.mdfor likely next steps
If you want to contribute or work from a local checkout, see CONTRIBUTING.md.
Known Limitations
A few boundaries are worth knowing up front:
- Selective scope only. PhosPy covers the workflows documented above and nothing broader.
- Parity claims are narrow. When PhosPy says there is parity, it means seam-level parity to the R
PhosRpackage backed by committed fixtures and tests. Seedocs/parity.md. KinaseWorkflowis native first.svm_mode="r_parity"narrows one learner comparison seam. It does not make the whole workflow equivalent toPhosR.- The CLI is intentionally small. It covers the core preprocessing and
predMat-driven downstream path. The native kinase workflow is exposed through the Python API and example script. - R is only required for fixture regeneration. You do not need R to install PhosPy or run the committed Python test suite.
For Contributors to PhosPy
To work from a local checkout:
pip install -e .
To run tests:
pip install -e ".[test]"
pytest -m "not parity"
pytest -m parity
For the short validation and parity guide, see docs/validation-and-parity.md.
For the detailed guide to parity against the R PhosR package, see docs/parity.md.
To run the usual contributor checks:
pip install -e ".[dev]"
pre-commit install
pre-commit run --all-files
R Requirements for Fixture Regeneration
The committed parity fixtures are already included in the repository. You only need R if you want to regenerate or extend them.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phospy-1.2.1.tar.gz.
File metadata
- Download URL: phospy-1.2.1.tar.gz
- Upload date:
- Size: 116.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
263d1015aa7a86663d181aa36bedd7040671c190a72e9f607f5402c039214fd7
|
|
| MD5 |
fd96da0b6c18073bfce2b925077a0456
|
|
| BLAKE2b-256 |
8bb8af3a6f4a8a3d6950958df8d3aab2e3df180fc6a33eb1498ead9c31d8e546
|
Provenance
The following attestation bundles were made for phospy-1.2.1.tar.gz:
Publisher:
publish.yml on falconsmilie/phospy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phospy-1.2.1.tar.gz -
Subject digest:
263d1015aa7a86663d181aa36bedd7040671c190a72e9f607f5402c039214fd7 - Sigstore transparency entry: 1217490700
- Sigstore integration time:
-
Permalink:
falconsmilie/phospy@451343b6c6367b41b181da51603c00e3ce270b48 -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/falconsmilie
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@451343b6c6367b41b181da51603c00e3ce270b48 -
Trigger Event:
push
-
Statement type:
File details
Details for the file phospy-1.2.1-py3-none-any.whl.
File metadata
- Download URL: phospy-1.2.1-py3-none-any.whl
- Upload date:
- Size: 91.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac04a71f568eb45557b11af6073d89da45cecea10e8296cec6e24af38f08f35a
|
|
| MD5 |
9ce4618e9b4d7dbc804d2d812da4cf3c
|
|
| BLAKE2b-256 |
361cb610ea2a76516b1ef122248fad5500d90b5bbd2cdc445cd83a292db80a1f
|
Provenance
The following attestation bundles were made for phospy-1.2.1-py3-none-any.whl:
Publisher:
publish.yml on falconsmilie/phospy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phospy-1.2.1-py3-none-any.whl -
Subject digest:
ac04a71f568eb45557b11af6073d89da45cecea10e8296cec6e24af38f08f35a - Sigstore transparency entry: 1217490771
- Sigstore integration time:
-
Permalink:
falconsmilie/phospy@451343b6c6367b41b181da51603c00e3ce270b48 -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/falconsmilie
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@451343b6c6367b41b181da51603c00e3ce270b48 -
Trigger Event:
push
-
Statement type: