Directed, signed, provenance-annotated phospho-signaling graph builder and query CLI.
Project description
phosphograph
What phosphograph is
phosphograph is a Python library that builds and queries a directed, signed, provenance-annotated graph of phospho-signaling relationships among human proteins. Nodes are proteins and individual phosphosites; edges are kinase, phosphatase, autophosphorylation, and protein-protein binding relationships drawn from manually curated public databases (SIGNOR and OmniPath by default; pass --sources signor for a SIGNOR-only build, or --sources signor,omnipath,psp to opt into PhosphoSitePlus for site-level signed coverage — see the PhosphoSitePlus section for the CC BY-NC-SA 3.0 license caveat).
Why it exists
Spatial proteomics with phospho-specific stainings (e.g. p-ERK1/2, p-c-Jun, p-AKT, p-STAT3) reports on the activity state of signaling pathways at single-cell, in-tissue resolution. A single observed phospho-state is rarely interpretable on its own: the relevant questions are always what upstream input produced it and what downstream events it predicts. Designing a multiplexed-IF panel that resolves these questions requires knowing, for any given phospho-target, which other phosphorylation events are mechanistically coupled to it and could be co-stained to corroborate or refute the inferred pathway state.
Existing pathway resources address parts of this problem but require manual cross-referencing. KEGG encodes topology but not consistent effect direction at the phospho level. PhosphoSitePlus has site-level kinase-substrate data but no network view. SIGNOR has signed phospho-edges. OmniPath integrates many sources but exposes them as a general signaling network rather than as a phospho-measurable subgraph. None of them directly answer "given p-ERK1/2 T202/Y204 is elevated in this region, which other antibodies would test or extend my inference of MAPK pathway state in the same section?"
phosphograph exists to make that question scriptable and reproducible.
What phosphograph does
- Ingests phospho-relevant edges from SIGNOR (default), OmniPath (default), and PhosphoSitePlus (opt-in, CC BY-NC-SA 3.0).
- Harmonizes identifiers to UniProt canonical accessions and normalizes site nomenclature (residue letter + 1-based position on UniProt canonical).
- Merges edges across sources with consensus-effect resolution, conflict logging, and factual per-edge provenance counts.
- Detects autophosphorylation, synthesizes site-to-host "consequence" propagation edges, and assembles a
networkx.MultiDiGraphof(protein, phosphosite)nodes. - Resolves free-text protein names ("p-ERK", "phospho-c-Jun S63") to ranked UniProt candidates.
- Runs bidirectional k-hop walks from a query node with best-first source-count pruning and returns the induced subgraph, enumerated paths, and per-path signed predictions.
- Optionally collapses the result into a protein-only view with aggregated effect counts ("3 activating, 1 inhibiting").
- Exports to GraphML, Cytoscape JSON, GEXF, parquet edge lists, and Graphviz-rendered SVG/PDF/PNG.
- Exposes all functionality through a
clickCLI (with an interactivewalkthroughwizard), a Python API, and a FastMCP server (phosphograph mcp) that surfaces the walks as MCP tools with an inline Cytoscape viewer for LLM clients.
What phosphograph is not
- Not an image analysis tool for spatial proteomics data.
- Not a predictor of phospho-state magnitude or kinetics.
- Not a panel optimizer in v0; walks inform manual panel decisions but do not solve set-cover automatically.
- Not a quantitative or mechanistic model of signaling.
- Not a substitute for experimental validation of any kinase-substrate relationship.
Intended users
Bioinformaticians and computational biologists designing multiplexed-IF panels for spatial proteomics, who already work with phospho-target stainings and want a scriptable, license-clean, reproducible way to retrieve the mechanistic neighborhood around a phospho-target as a queryable graph.
Algorithmic pipeline
These are the steps from raw curated data to a query result. Each is implemented in one small module and documented inline.
1. Ingest (ingest/signor_src.py, ingest/omnipath_src.py, ingest/psp_src.py)
- SIGNOR: bulk TSV download parsed row-by-row. Each row becomes one or more
PhosphoEdges. Filtered to human (TAX_ID==9606) and to mechanisms we model (phosphorylation,dephosphorylation,binding). TheEFFECTcolumn collapses toactivates|inhibits|unknown. TheDIRECTcolumn ("t" = directly observed, "f" = inferred) flows through toSourceRef.directas real per-row provenance. - OmniPath: lazy import of
pypath-omnipath, pulled only when the user opts in. Adds enzyme-substrate coverage. OmniPath's aggregatedenz_subtable carries no per-row effect direction, so OmniPath-only edges areeffect="unknown"by construction. - PhosphoSitePlus (opt-in): lazy import of
pypath.inputs.phosphosite. Joins PSP'sKinase_Substrate_Dataset(kinase → substrate site, unsigned) with theRegulatory_sitestable (site-level effect direction) on(substrate_ac, residue, position, 'phosphorylation'). Sites with matching regsite annotations carry signed effects derived from PSP'sON_FUNCTIONkeywords (positive=True→activates,negative=True→inhibits, contradictory or absent →unknown); unmatched K-S rows still ingest aseffect="unknown"for structural coverage. PSP is opt-in because of license restrictions; see PhosphoSitePlus (opt-in) for details.
2. Resolve (harmonize/resolver.py, harmonize/phospho_parser.py)
Free-text input like "p-ERK" or "phospho-c-Jun S63" is normalized:
- A regex strips phospho prefixes/suffixes and extracts an optional
(residue, position). - The cleaned symbol is sent to
mygene.info(human only, cached). - Candidates are ranked by mygene's Lucene score. Both the normalized score (top hit = 1.0) AND the raw score are returned so the caller can distinguish "top of a strong field" from "top of nothing."
low_confidence=Truewhen the top hit's raw score is below a threshold;ambiguous=Truewhen the gap between top-1 and top-2 normalized scores is belowAMBIGUITY_THRESHOLD. Never auto-pick — the caller decides.
3. Merge (harmonize/merge.py)
For each (source_id, target_id, mechanism) triple seen across sources:
- Union the
referencesfrom contributing edges. - Effect consensus: all agree → that effect; one says X and the rest say
unknown→ X (silence is not contradiction); two distinct signed effects →unknownand the disagreement is logged toconflicts.tsv. - Factual provenance counts (no synthetic confidence):
n_sources= distinct curated databases;n_references= distinct PMIDs. These drive the--min-sourcesand--require-signedwalk filters directly.
4. Build the graph (graph/build.py)
- Autophosphorylation detection: any phosphorylation edge whose kinase and substrate share a UniProt AC is re-tagged
mechanism="autophosphorylation". Source/target stayprotein:X → site:X:Yso the graph never grows a self-loop at the protein level. - Consequence edges (site → host protein): for every site with at least one phos/dephos parent, emit one synthetic edge that lets walks traverse from a phospho-event to "the host protein is now active/inactive." Effect is the consensus across phosphorylation/autophosphorylation parents only — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. References are unioned across phos parents;
n_sources/n_referencesrecomputed from that union. - Add to MultiDiGraph: nodes are created on demand; every site node gets its host protein materialized if not already present (invariant 2).
5. K-hop neighborhood walk (walk/neighborhood.py)
Best-first expansion using a heap keyed by -n_sources of the next edge. Edges supported by more curated databases are explored first, so when max_nodes is hit we have kept the strongest edges. Filters happen during expansion (min_sources, allow_dephosphorylation, allow_binding, require_signed), never post-hoc. When the cap fires, a MaxNodesPruned warning is emitted with the visited count and remaining-frontier size so the CLI can surface "you hit the cap; raise --max-nodes to see more."
6. Path enumeration and sign propagation (walk/paths.py, walk/sign.py)
- The caller builds one
filtered_subgraph(induced_subgraph(g, visited), ...)and passes it to both path enumeration AND sign reading — so the two cannot disagree about which parallel edge "exists." all_simple_paths_up_to(g, source, cutoff)runs a single DFS vianx's container-target overload and yields each simple path once.- For each path, the sign is the product of per-step effects (
activates=+1,inhibits=-1). Anyunknownstep makes the whole path's signNone. Per-path only: the same node can sit on+and−paths from different starting points, so we never collapse to per-node sign.
7. Protein-collapsed view (graph/collapse.py)
A high-level overview for visualization. Rules:
protein → site:X:Yroutes through toprotein:X(host materialized if missing).site → protein(consequence) is dropped; already accounted for via the kinase→site that produced it.protein → protein(e.g. binding) kept as-is.
Per (source, target, mechanism) bucket: effect counts {activates, inhibits, unknown}, aggregated effect ∈ {activates, inhibits, mixed, unknown}, n_underlying_edges, and a summary_label like "3 activating, 2 inhibiting". References are intentionally dropped in the collapsed view — switch back to the full graph if you need PMIDs.
8. Invariants (graph/invariants.py)
Checked after every build:
- Every phosphosite has an incoming kinase/phosphatase edge from a protein, OR an autophosphorylation edge from its own host protein.
- Every
site:X:Yhas a matchingprotein:Xnode. - All node IDs validate (structural regex + canonical UniProt AC).
- Post-merge: no two parallel edges with the same
(source, target, mechanism)carry disagreeing signed effects. Different mechanisms between the same protein pair are not flagged (phos can activate while binding inhibits — these are two distinct biological events, not a contradiction).
Scope (v0)
| Item | Decision |
|---|---|
| Species | Human only (taxid 9606). Mouse deferred to v0.1; cross-species inheritance has biological caveats around residue translation that v0 does not solve. |
| Node resolution | Protein-level required; phosphosite-level where annotated |
| Antibody filter | None in v0 |
| Use case | Academic |
| Secondary scope | Autophosphorylation detection |
| Deliverable | Python package + click CLI + graph export (GraphML, Cytoscape JSON, GEXF, SVG/PDF via Graphviz, parquet) |
Data sources
SIGNOR and OmniPath are both included by default. Pass --sources signor for a SIGNOR-only build (smaller, higher signed share, no OmniPath unsigned edges). PhosphoSitePlus is opt-in: pass --sources signor,omnipath,psp (or --sources signor,psp) to include it; see PhosphoSitePlus (opt-in) below for the license-and-mirror caveat. CollecTRI is not used.
| Source | Default? | Role | Access |
|---|---|---|---|
| SIGNOR | yes | Manually-curated, signed phospho/dephospho edges with explicit mechanism, effect direction, PMID, and SIGNOR record ID. The large majority of edges carry a signed effect. | TSV bulk dump via https://signor.uniroma2.it/releases/getLatestRelease.php |
| OmniPath | yes | Aggregated enzyme-substrate (PTM) network from many underlying resources. Adds broader site and kinase coverage but contributes zero signed edges in v0 — OmniPath's aggregated enz_sub table doesn't expose per-row effect direction. |
pypath-omnipath Python client (heavy; downloads on first use) |
| PhosphoSitePlus | opt-in | Site-level kinase-substrate dataset joined with PSP's Regulatory_sites annotations to recover signed effect direction for the annotated subset. Substantially expands site-level coverage and adds signed edges beyond the SIGNOR baseline. Licensed CC BY-NC-SA 3.0 (academic / non-commercial only). |
pypath.inputs.phosphosite (fetches from the OmniPath team's mirror at rescued.omnipathdb.org; see caveat below) |
Default tradeoff: SIGNOR + OmniPath maximizes structural coverage but dilutes the signed share. Pass --sources signor for a smaller, more-signed graph. Add psp to grow site-level coverage with a meaningful signed contribution from PSP's regulatory-sites annotations — at the cost of accepting the CC BY-NC-SA 3.0 terms and depending on the rescued mirror. KEGG, Reactome, iPTMnet, DEPOD, INDRA, and CollecTRI are not used.
PhosphoSitePlus (opt-in)
The PSP source is implemented in ingest/psp_src.py and is disabled by default for two reasons that users should be aware of before opting in:
- License: PhosphoSitePlus is distributed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0). This means:
- Non-commercial use only. If you are using phosphograph in any commercial context (industry, contract research, fee-for-service analysis), do not enable PSP. Building with
--sources ...,pspcauses PSP data to be downloaded onto your machine; that download itself is subject to PSP's terms. - ShareAlike propagates to derivative datasets. If you redistribute a phosphograph-built graph that incorporates PSP edges (e.g., as a parquet file, GraphML export, or downstream model), you must license the redistribution under the same CC BY-NC-SA 3.0 terms.
- Attribution required. Cite the canonical PSP reference (Hornbeck PV et al., Nucleic Acids Res. 2015,
doi:10.1093/nar/gku1267) in any work that uses a PSP-enabled phosphograph build.
- Non-commercial use only. If you are using phosphograph in any commercial context (industry, contract research, fee-for-service analysis), do not enable PSP. Building with
- Access via a third-party mirror: pypath's PSP downloaders point at
https://rescued.omnipathdb.org/phosphosite/...rather than the officialphosphosite.orgendpoint (which requires registration and a manual web download). This mirror is maintained by the OmniPath team as a courtesy and is not endorsed by PhosphoSitePlus. The redistribution itself sits in a gray area of PSP's terms — by opting into PSP via phosphograph, you accept that:- the mirror may disappear at any time, in which case
--sources ...,pspbuilds will start failing with a clear error; - you are choosing to obtain PSP data through this informal channel rather than the canonical one;
- the PSP edges in your graph carry
database="psp"and a derivation trail back to the rescued-mirror files (the cache is at~/.cache/phosphograph/raw/psp/edges.parquet).
- the mirror may disappear at any time, in which case
The --sources ...,psp opt-in encodes informed consent to both points; phosphograph does not silently pull PSP under any default configuration.
Opting in via environment variable
For environments where you want PSP to be on for every invocation without having to remember --sources signor,omnipath,psp each time, set PHOSPHOGRAPH_ENABLE_PSP=1 (or true, yes, on — case-insensitive). When that variable is set, psp is automatically appended to DEFAULT_SOURCES, so both phosphograph build (no --sources flag) and the MCP server's auto-build on first boot will include PSP. An explicit --sources flag still overrides whatever the default resolves to.
# one-shot
PHOSPHOGRAPH_ENABLE_PSP=1 phosphograph build
# permanent for this shell
export PHOSPHOGRAPH_ENABLE_PSP=1
phosphograph mcp --transport stdio
For Claude Desktop, put the env var into claude_desktop_config.json (the standard MCP server spec supports env per-server):
{
"mcpServers": {
"phosphograph": {
"command": "phosphograph",
"args": ["mcp", "--transport", "stdio"],
"env": { "PHOSPHOGRAPH_ENABLE_PSP": "1" }
}
}
}
For an .mcpb bundle (Claude Desktop's MCP bundle format) the same env block goes inside server.mcp_config in manifest.json. The bundled manifest.json in this repository ships with PHOSPHOGRAPH_ENABLE_PSP=1 pre-set, which means installing the bundled MCP server implicitly accepts PSP's CC BY-NC-SA 3.0 terms on behalf of whoever runs it. If you redistribute the bundle to others, ensure they understand the license implication or remove the env block before redistribution.
For a remote HTTP deployment, set the env var on the server process (systemd unit, container env, etc.) — the operator is the party accepting PSP's terms, not the end user calling the MCP tools.
Schema
Strongly typed via pydantic>=2. Node IDs are deterministic strings; v0 is human-only so the taxid is implicit.
from pydantic import BaseModel, Field
from typing import Literal, Optional
Residue = Literal["S", "T", "Y", "H"]
TAXID_HUMAN = 9606
class ProteinNode(BaseModel):
kind: Literal["protein"] = "protein"
uniprot_ac: str
protein_symbol: str
class PhosphoSiteNode(BaseModel):
kind: Literal["phosphosite"] = "phosphosite"
uniprot_ac: str
protein_symbol: str
residue: Residue
position: int = Field(ge=1) # 1-based, UniProt canonical
Mechanism = Literal[
"phosphorylation",
"dephosphorylation",
"autophosphorylation",
"binding", # protein-protein, no site coordinate
]
Effect = Literal["activates", "inhibits", "unknown"]
class SourceRef(BaseModel):
database: Literal["omnipath", "signor"]
record_id: Optional[str] = None
pmid: Optional[str] = None
direct: Optional[bool] = None # SIGNOR DIRECT column
class PhosphoEdge(BaseModel):
source_id: str
target_id: str
mechanism: Mechanism
effect: Effect
references: list[SourceRef]
n_sources: int = Field(ge=1) # distinct curated databases
n_references: int = Field(ge=0) # distinct PMIDs across references
Node ID conventions:
protein:P28482site:P28482:T185
Natural-language resolver
harmonize/resolver.py converts free-text input ("p-ERK", "phospho-c-Jun S63", "p38 alpha") to UniProt entries. Pipeline:
phospho_parser.py: regex stripsphospho-,p-,pS\d+,pT\d+,pY\d+; returns cleaned name and optional(residue, position).- Query
mygene.info(Pythonmygeneclient) forspecies="human". Matches on official symbol, alias, previous symbol, name. - Rank by mygene Lucene score. Both the normalized score (top hit = 1.0) and the raw score are returned — the normalization makes the top hit always 1.0 even when it's actually a poor match, so the raw score (and the
low_confidenceflag derived from it) is what tells you whether to trust the top pick at all.
class ResolutionCandidate(BaseModel):
uniprot_ac: str
protein_symbol: str
matched_via: Literal["symbol", "alias", "previous_symbol", "name"]
score: float = Field(ge=0.0, le=1.0) # normalized within this query
raw_score: float = 0.0 # mygene Lucene score verbatim
class ResolutionResult(BaseModel):
query: str
parsed_site: Optional[tuple[Residue, int]] = None
parsed_phospho_prefix: bool = False
candidates: list[ResolutionCandidate] # sorted by score desc
ambiguous: bool = False # top1 - top2 < AMBIGUITY_THRESHOLD
low_confidence: bool = False # top raw_score < LOW_CONFIDENCE_RAW_SCORE
Never auto-pick. Caller decides.
Graph model
networkx.MultiDiGraph. Edge conventions:
- Kinase to substrate site:
protein(kinase) →phosphosite(substrate),mechanism="phosphorylation". - Phosphatase to substrate site: same shape,
mechanism="dephosphorylation". - Autophosphorylation:
protein:X→site:X:Y, re-tagged at build time when source AC equals target AC. Self-loops at the protein level are avoided. - Site-to-host "consequence" edge:
phosphosite→proteinof the same UniProt AC. Synthesized at build time as a structural propagation hop so walks can traverse from a phospho-event to the host protein's activity. Effect is the consensus across the site's phosphorylation/autophosphorylation parents (dephos parents excluded — see Algorithmic pipeline / Build). - Binding:
protein→protein(no site coordinate). From SIGNOR's binding mechanism rows.
Walks
Two primary entry points:
upstream(target: str, k: int = 2, *,
include_phosphatases: bool = True,
include_binding: bool = True,
min_sources: int = 1,
require_signed: bool = False,
max_nodes: int | None = None,
sources: Iterable[str] | None = None) -> Walk
downstream(source: str, k: int = 2, *,
include_binding: bool = True,
min_sources: int = 1,
require_signed: bool = False,
max_nodes: int | None = None,
sources: Iterable[str] | None = None) -> Walk
Walk returns the (filtered) induced subgraph, the enumerated simple paths up to length k, and a per-path propagated sign.
Filters use factual provenance: min_sources=N keeps only edges asserted by at least N curated databases (min_sources=2 is the "consensus only" view the interactive walkthrough prompts for); require_signed=True drops effect="unknown" edges. There is no synthetic confidence score.
Query-time source filter: sources={"signor", "psp"} (a subset of SUPPORTED_SOURCES) restricts the walk to edges whose references include at least one SourceRef with database in the allowed set. This is distinct from build-time --sources — it carves a per-call subset out of the already-built graph.pkl without rebuilding, and pairs naturally with min_sources (e.g. "edges with PSP AND SIGNOR both asserting" = sources={"signor","psp"}, min_sources=2).
Sign propagation: product of edge effects along the path. activates=+1, inhibits=-1, unknown sets propagated_sign=None for that path. Never aggregated to a single per-node sign — the same node can sit on + and − paths from different starting points.
Hub blow-up: around hubs (AKT, ERK, MTOR) k≥2 neighborhoods can easily exceed the default max_nodes cap. max_nodes triggers best-first expansion ordered by edge n_sources, so when the cap fires the strongest edges are kept. A MaxNodesPruned warning carries the cap, visited count, and remaining-frontier size so the CLI can suggest raising --max-nodes.
Conflict resolution and provenance
harmonize/merge.py. For the same (source_id, target_id, mechanism) triple from multiple databases or multiple rows:
- Union references into one
PhosphoEdge. - Effect resolution:
- All sources agree → that effect.
- One says
X, the rest sayunknown→X(silence is not contradiction). - Genuine disagreement →
effect = "unknown", conflict logged toconflicts.tsv.
- Provenance counts (factual, not heuristic):
n_sources= number of distinct databases asserting the edge.n_references= number of distinct PMIDs across all unioned references.
No synthetic "confidence score" is produced. Walk filters use n_sources directly (--min-sources N) and the boolean --require-signed flag for effect direction.
Source precedence (for downstream consumers picking a representative reference): SIGNOR > PSP > OmniPath. SIGNOR ranks highest because its rows carry explicit per-edge signed effect; PSP is next because regulatory-site annotations recover signed effect for a meaningful subset of K-S edges; OmniPath's enz_sub aggregation carries no per-row direction and ranks last.
Orthology
Not in v0. v0 is human only. Mouse would require sequence-aligned site coordinate translation between orthologs, which v0 does not implement honestly; the prior "copy residue+position verbatim" inheritance was biologically unreliable and has been removed. Mouse may return in v0.1 with proper alignment-aware site translation.
Output formats
graph/io.py:
to_graphml(g, path) # interchange, Cytoscape desktop, yEd
to_gexf(g, path) # Gephi
to_cytoscape_json(g, path) # web viewers, .cyjs
to_graphviz(g, path, layout="dot") # SVG/PDF/PNG via system Graphviz
to_pickle(g, path) # full round-trip with typed attributes
to_parquet_edges(g, path) # pandas-friendly edge list
Format inferred from file extension unless explicit. Graphviz requires the system binary; layouts: dot for hierarchical (upstream/downstream views), sfdp for large neighborhoods.
Module layout
phosphograph/
__init__.py
config.py # paths, species toggles, source toggles, cache dir, weights
models.py # pydantic schemas above
util/
node_id.py # deterministic node-ID helpers
ingest/
base.py # Ingestor protocol -> Iterator[PhosphoEdge]
omnipath_src.py # enz_sub via pypath (opt-in, human only)
signor_src.py # SIGNOR bulk TSV
harmonize/
ids.py # UniProt canonical resolution
sites.py # residue+position normalization, isoform handling
merge.py # consensus-effect merge + conflict logging
resolver.py # mygene-backed free-text -> UniProt resolver
phospho_parser.py # regex parser for "p-X S123"-style input
graph/
build.py # MultiDiGraph assembly
io.py # all exports
invariants.py # property tests (no orphan sites, valid node IDs, etc.)
collapse.py # protein-only collapsed view for high-level overview
walk/
neighborhood.py # bidirectional k-hop BFS
paths.py # all simple paths up to length k
sign.py # per-path sign accumulation
query/
upstream.py
downstream.py
mcp/
server.py # FastMCP server: tools, resources, prompts, run()
resolution.py # free-text -> node ID with MCP elicitation
payload.py # Walk / paths -> MCP wire payload (cytoscape + summary + structured)
view.py # ui://phosphograph/view.html Cytoscape app
cli.py # click entry point (including `phosphograph mcp`)
tests/ # pytest + hypothesis
Dependencies
Required: pypath-omnipath, httpx, pandas, pydantic>=2, networkx>=3, click>=8, mygene, graphviz (Python wrapper), pyarrow, fastmcp (powers the MCP server), pytest, hypothesis.
System: Graphviz binaries (apt install graphviz or equivalent).
Optional extras: pyvis (interactive HTML preview).
CLI
phosphograph build [--sources signor[,omnipath][,psp]] [--force]
phosphograph resolve <query> [--top-k 5]
phosphograph upstream <gene_or_ac> [--depth 2] [--include-phosphatases] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph downstream <gene_or_ac> [--depth 2] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph neighborhood <gene_or_ac> [--upstream-depth N] [--downstream-depth N] [--upstream-max-nodes N] [--downstream-max-nodes N] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph paths <source> <target> [--max-length 4] [--sources signor,psp] [--output FILE]
phosphograph export [--format graphml|gexf|cyjs|svg|pdf|parquet] <output>
phosphograph info <gene_or_ac>
phosphograph conflicts [--output conflicts.tsv]
phosphograph walkthrough
Output format is inferred from the file extension unless --format is set. --orientation horizontal|vertical controls Graphviz layout direction (LR vs TB) for upstream, downstream, neighborhood, paths, export; ignored for non-Graphviz formats.
Note on --sources semantics. On build, --sources is a build-time directive that decides which curated databases get merged into the cached graph.pkl. On the walk subcommands (upstream, downstream, neighborhood, paths), --sources is a query-time filter that carves a per-call subset out of the existing cached graph: an edge passes the filter iff at least one of its references has database in the allowed set. The walk filter never triggers a rebuild.
MCP server
phosphograph mcp runs a FastMCP-based Model Context Protocol server so the walks are callable directly from LLM agents (Claude.ai, Claude Desktop, custom hosts). The same query semantics as the CLI, but with an inline interactive Cytoscape viewer rendered in the chat window and MCP elicitation for ambiguous protein names.
phosphograph mcp # streamable HTTP on 127.0.0.1:8765/mcp (default)
phosphograph mcp --transport stdio # Claude Desktop / subprocess hosts
phosphograph mcp --host 0.0.0.0 --port 8765 # autodeploy / container
Transports. http (alias streamable-http) is the default and the modern MCP HTTP transport — use it for Claude.ai and most autodeploy setups. stdio is for hosts that spawn the server as a subprocess (Claude Desktop).
Auto-build on first boot. If the cached graph is missing, the server runs the build step automatically before accepting tool calls, so a freshly deployed container is usable without a manual phosphograph build. Disable with --no-auto-build; override sources on cache miss with --sources signor,omnipath (the default — includes both), --sources signor for the signed-only subset, or --sources signor,omnipath,psp to opt into PhosphoSitePlus (CC BY-NC-SA 3.0; see PhosphoSitePlus (opt-in)). Note that PSP must be opted in by whoever runs the server — for a hosted MCP deployment that means the server operator, not the end user, accepts PSP's license terms.
Tools (all read-only, annotated for hosts):
| Tool | Purpose |
|---|---|
upstream |
Walk upstream from a query (gene symbol / UniProt AC / SYMBOL:T185). |
downstream |
Walk downstream from a query. |
neighborhood |
Bidirectional neighborhood with independent up/down depth and node caps. |
paths |
Enumerate signed simple paths between two proteins. |
resolve_protein |
Free-text → ranked UniProt candidates (fallback when the client doesn't support elicitation). |
node_info |
Attributes + in/out degrees for a single node. |
Walk-tool parameters at parity with the CLI walkthrough. Every walk tool (upstream, downstream, neighborhood) takes the full filter set the interactive walkthrough prompts for: depth, max_nodes (and per-direction variants for neighborhood), include_phosphatases, include_binding, min_sources (a.k.a. the consensus knob — min_sources=2 keeps only edges asserted by ≥2 curated DBs), require_signed, plus two flags unique to v0:
sources: list[str] | None— query-time database filter (["signor", "psp"]etc.). Restricts to edges with at least one reference from the named databases. Does not trigger a rebuild — pairs with build-time--sources(the latter decides what lands in the cache; this one carves a subset out at query time).collapse: bool— return the protein-only aggregated view (phosphosites hidden, parallel edges merged by(source, target, mechanism)). Path enumeration is omitted undercollapse=Truebecause paths reference site nodes that the protein-only view hides.pathsdoes not takecollapse(path enumeration is inherently node-level) but does takesources.
The MCP tool surface for filters matches the CLI walkthrough 1:1, so anything a user can do interactively is also reachable from an LLM agent.
Resources. ui://phosphograph/view.html — the Cytoscape viewer (loaded into a sandboxed iframe by the host). phosphograph://stats — graph statistics as JSON.
Prompts. Canonical query templates the LLM (and slash-command UIs) can discover and invoke: kinase_network, regulators_of, path_between.
Cytoscape rendering. Each walk tool returns three things in its result: a short text summary (for the LLM), a Cytoscape elements JSON blob (picked up by the bundled viewer via app.ontoolresult and rendered as an interactive graph in the chat window), and a structured payload (focus node, counts, full path list, prune warnings) for programmatic consumption. The viewer styles activating edges green, inhibitory red, and binding edges dashed; protein nodes are ellipses, phosphosite nodes are boxes. Toolbar buttons: fit, re-layout, toggle phosphosites, PNG export.
Interactive disambiguation. When a free-text query maps to multiple candidates in the graph, the tool issues an MCP elicitation so the user picks one inline. If the client does not support elicitation, the tool raises a ToolError pointing the agent at resolve_protein to do an explicit candidate listing first.
Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{
"mcpServers": {
"phosphograph": {
"command": "phosphograph",
"args": ["mcp", "--transport", "stdio"]
}
}
}
Claude.ai or other streamable-HTTP clients: point them at http://<host>:<port>/mcp.
For programmatic use:
from phosphograph.mcp import build_server, run
run(transport="http", host="0.0.0.0", port=8765) # autodeploy entry point
mcp = build_server(graph=g) # inject a pre-loaded graph (tests/scripts)
Caching
~/.cache/phosphograph/ (override via PHOSPHOGRAPH_CACHE_DIR env var):
raw/: source-version-stamped JSON/TSV downloadsedges/: parquet edge lists per sourcegraph/: builtMultiDiGraphpickle keyed by (sources, species, build-timestamp)
Idempotent rebuild: phosphograph build --force.
Implementation invariants (enforced by graph/invariants.py)
Asserted after every build_graph(..., merge=True):
- Every phosphosite node has an incoming
phosphorylation/dephosphorylationedge from a protein node, OR anautophosphorylationedge whose source is exactly its own host protein. - Every
site:X:Ynode has a matchingprotein:Xnode in the graph. - All node IDs validate (structural regex + canonical UniProt AC).
- Post-merge: no two parallel edges sharing the same
(source, target, mechanism)carry disagreeing signed effects. Different mechanisms between the same protein pair (e.g. phos:activates + binding:inhibits) are not flagged — they describe distinct biological events, not a contradiction.
Out of scope for v0
- CollecTRI / transcription factor regulatory edges (TFs are gene-level, off-mission for a phospho-signaling tool)
- Antibody catalog / Antibody Registry integration
- INDRA / text-mined statements
- Panel optimization / set-cover suggestions
- Kinetic or quantitative modeling
- KEGG, Reactome, iPTMnet, DEPOD as separate ingestors
Project state and continuation notes
This appendix documents the actual runtime quirks future contributors (or future sessions) need to know — things not derivable from the code alone.
Locked dependency pins (do not bump without testing pypath end-to-end)
| Pin | Reason |
|---|---|
paramiko<3 |
paramiko 3.x removed DSSKey; the unmaintained pysftp (which pypath-omnipath imports unconditionally in pypath/share/curl.py) crashes on import. Pinning to 2.x is the cleanest workaround. |
pandas>=2.2,<3 |
pypath.inputs.uniprot_idmapping.idtypes() calls groups.fillna(-1.0, inplace=True) on a string column. Pandas 3.x uses Arrow-backed string arrays that reject float fill values. |
If pypath upstream fixes either, both pins can be relaxed. Verify with uv run python -c "from pypath import omnipath; omnipath.db.get_db('enz_sub').make_df(tax_id=True)" after any bump.
pypath API surface actually used
from pypath import omnipath
es = omnipath.db.get_db('enz_sub') # EnzymeSubstrateAggregator
es.make_df(tax_id=True) # populates es.df
df = es.df # pd.DataFrame
DataFrame columns (verified against the pinned pypath-omnipath):
enzyme, enzyme_genesymbol, substrate, substrate_genesymbol, isoforms, residue_type, residue_offset, modification, sources, references, curation_effort, ncbi_tax_id.
We rename residue_type → residue_letter in phosphograph/ingest/omnipath_src.py:_fetch_enz_sub_live so the rest of the pipeline keeps a single column contract.
There is no to_dataframe() method — earlier docs hinted at one but the supported API is make_df() + .df.
SIGNOR API
SIGNOR ships its full corpus as a single TSV at https://signor.uniroma2.it/releases/getLatestRelease.php.
Columns we use: IDA, IDB, DATABASEA, DATABASEB, EFFECT, MECHANISM, RESIDUE, TAX_ID, PMID, DIRECT, SIGNOR_ID.
EFFECTcollapses toactivates|inhibits|unknownvia theup-regulates*/down-regulates*prefixes (seesignor_src._effect_to_enum).MECHANISMis kept iff one ofphosphorylation,dephosphorylation,binding. Phos/dephos rows goprotein → site:residue:position; binding rows goprotein → protein(no site coordinate).DIRECTis propagated toSourceRef.direct(True fort, False forf, None when blank). This is a real per-row signal distinguishing directly observed interactions from inferred ones.
SIGNOR trust score is NOT in the bulk TSV. The published per-edge score combines several features (PMID count, pathway co-occurrence, Reactome cross-reference, UniProt co-mention) but the bulk download omits it. Recomputing locally would require pulling Reactome and UniProt sidecars. v0 uses the readily-available signals (n_sources, n_references, direct) instead.
OmniPath REST endpoint naming
The query type for enzyme-substrate is /enz_sub (with underscore), not /enzsub. The /ptms alias also works. Metadata at /queries/enz_sub returns the parameter dictionary. Both /enzsub and /enz-sub 502. (phosphograph itself uses pypath-omnipath rather than the REST endpoint directly; this note is for orientation if you ever need to verify column shapes against the web service.)
Build pipeline non-obvious behaviors
- Protein-symbol plumbing: ingestors do not contribute labels. After
build_graphassembles the graph, a single post-build pass inphosphograph.harmonize.symbols.apply_protein_symbols(g)collects every UniProt AC on the graph and askspypath.utils.mapping.label(ac, id_type='uniprot', ncbi_tax_id=9606)for the HGNC primary symbol, writing it onto each node asprotein_symbol. This runs unconditionally when the CLI / MCP autobuild constructs the graph (build_graph(..., enrich_symbols=True)) and is opt-in elsewhere — tests that pre-setprotein_symbolon synthetic graphs leave the defaultenrich_symbols=Falseso the pypath touch stays out of unit tests. Thenode_label(data)helper ingraph/io.pyreadsprotein_symbolfirst, falling back touniprot_acwhen missing, and produces the human-readable label used everywhere (MAPK1,MAPK1:T185). - Consequence edges:
synthesize_consequence_edgesemits onesite → host_proteinedge per site with at least one kinase/phosphatase parent. These are structural propagation hops required for upstream walks (gating them on known effect would break reachability). Effect is the consensus across phosphorylation/autophosphorylation parents only — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. Consensus rule mirrorsmerge_edges. References are unioned across phos parents;n_sourcesandn_referencesrecomputed from the union. - Autophosphorylation: detected at build time by AC equality between source protein and target site (
detect_autophosphorylation). Re-tags the mechanism but leaves source/target IDs asprotein:X→site:X:Yso self-loops at the protein level are avoided. Invariant 1 verifies that anyautophosphorylationedge originates at the site's host protein. - Merge produces factual provenance counts (
n_sources= distinct curated databases;n_references= distinct PMIDs). No synthetic confidence score is computed — earlier heuristic weights (manual-curation flag, LTP boost) were not grounded in real evidence quality and have been removed. Walk filters operate directly onn_sourcesand on whether the effect is signed. - Walks and signs share one filtered view:
query/{downstream,upstream}.pybuild a singlefiltered_subgraphfrom the induced subgraph and pass it to BOTH path enumeration and sign reading. The two cannot disagree about which parallel edge "exists." Path enumeration runs a single DFS vianx's container-target overload. - K-hop pruning is best-first, ordered by
-n_sourcesof the next edge. Whenmax_nodesfires, aMaxNodesPrunedwarning is emitted with{max_nodes, visited_count, remaining_candidates, direction, source}for the CLI to surface.
Tests
pytest + hypothesis, organized undertests/. Run withuv run pytest tests/.- No test hits the network: OmniPath / SIGNOR / mygene are mocked or fixtured; the MCP layer uses FastMCP's in-process client (
fastmcp.utilities.tests.run_server_asyncfor HTTP roundtrips). - The
pypath_log/directory at the repo root is created by pypath the first time it runs from any cwd. Gitignored.
Cache layout
~/.cache/phosphograph/ (override via PHOSPHOGRAPH_CACHE_DIR):
raw/omnipath/enz_sub.parquet # cached pypath dataframe (only if opted in)
raw/signor/all_data.tsv # SIGNOR bulk dump
raw/psp/edges.parquet # cached PSP K-S × regsites join (only if --sources ...,psp)
raw/mygene/ # resolver responses
edges/ # per-source parquet (currently unused)
graph/graph.pkl # merged MultiDiGraph pickle (consumed by CLI)
graph/graph.meta.json # sidecar recording which sources produced the pickle
conflicts.tsv # merge conflict log
phosphograph build always rebuilds and replaces graph/graph.pkl so that changes to --sources (or to PHOSPHOGRAPH_ENABLE_PSP) take effect immediately. The raw per-source caches under raw/ are reused on rebuild unless you pass --force, which re-fetches every source from upstream. The MCP server's auto-build (ensure_graph_cached) keeps cache-hit behavior idempotent, but invalidates the pickle when the requested source set differs from the one recorded in graph.meta.json — so restarting the server after flipping PHOSPHOGRAPH_ENABLE_PSP=1 triggers a clean rebuild on first tool call.
Live build expectations
| Build | Signed coverage | First-build cost |
|---|---|---|
| SIGNOR + OmniPath (default) | lower — OmniPath dilutes the signed share because its enz_sub rows are unsigned | slower — pypath downloads and caches OmniPath on first use |
SIGNOR only (--sources signor) |
high — the large majority of edges are signed | fast — no pypath import |
SIGNOR + OmniPath + PSP (--sources signor,omnipath,psp) |
meaningfully higher signed-edge count than the default — PSP regulatory-sites annotations recover signed direction for an annotated subset of K-S edges; the rest land as unknown | slowest on first build — phosphosite_regsites_one_organism pulls multiple PSP files plus SwissProt and runs orthology translation. The joined edge frame is then cached at raw/psp/edges.parquet so subsequent builds skip the pypath roundtrip |
Run phosphograph build once to see the actual node/edge counts and signed percentage for your build (printed on stdout). Every invocation rebuilds the merged pickle from the raw caches; subsequent builds are seconds when the raw caches are warm. Pass --force to also re-fetch the raw upstream downloads (equivalent to deleting ~/.cache/phosphograph/raw/ first).
CLI surface
Subcommands: build, resolve, upstream, downstream, neighborhood, paths, export, info, conflicts, walkthrough, mcp. Run phosphograph --help for the canonical list. walkthrough is an interactive wizard that auto-builds when no cache exists, then loops a menu — the recommended entry point for new users. mcp starts the FastMCP server (see MCP server).
--orientation horizontal|vertical controls Graphviz layout (rankdir=LR vs rankdir=TB) on upstream, downstream, neighborhood, paths, export. Ignored for non-Graphviz formats.
Things that bit us, briefly
omnipath(lighter REST-only client at https://github.com/saezlab/omnipath) is a different package frompypath-omnipath. The lock-in is on pypath because of its richer database building.- The
omnipathPyPI package on RTD docs (omnipath.readthedocs.io) is the other one — don't follow that for pypath's API. - The
parquetPyPI package is unmaintained — we usepyarrowfor parquet I/O. Theparquetentry in pyproject is a vestige and can be removed. - pyproject originally declared
pandas>=3.0.3andpypath-omnipath>=0.16.20without pinning paramiko or pandas upper bounds. Both turned out to be wrong; fixed.
License
phosphograph is released under the GNU General Public License v3.0 or later (GPL-3.0-or-later). The full license text is in LICENSE.
The GPL choice is dictated by a runtime dependency: pypath-omnipath is GPL-3.0, and importing it as a library makes the combined work a derivative work under GPL terms. Anyone redistributing phosphograph — or a program that imports it — must therefore comply with the GPL (source availability, same-license redistribution, no additional restrictions). Internal academic use and modification are unrestricted; the obligations only kick in on distribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phosphograph-0.1.1.tar.gz.
File metadata
- Download URL: phosphograph-0.1.1.tar.gz
- Upload date:
- Size: 246.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ccae58f5c2092449efd8884bc7fd5fa0fbaf02cb50dc3e7a2e0469cc4c6a433
|
|
| MD5 |
47d01b9106de94c7ac89e356f4b29014
|
|
| BLAKE2b-256 |
c901cdb1f5ec31d58a8fde5cca5a3b3df764c6ed22e2fc14b0aafc89e4d2090d
|
Provenance
The following attestation bundles were made for phosphograph-0.1.1.tar.gz:
Publisher:
release.yaml on complextissue/phosphograph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phosphograph-0.1.1.tar.gz -
Subject digest:
4ccae58f5c2092449efd8884bc7fd5fa0fbaf02cb50dc3e7a2e0469cc4c6a433 - Sigstore transparency entry: 1582378274
- Sigstore integration time:
-
Permalink:
complextissue/phosphograph@0807c0c676f4a05c2d87f4e903183ac7128b08d2 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/complextissue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@0807c0c676f4a05c2d87f4e903183ac7128b08d2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file phosphograph-0.1.1-py3-none-any.whl.
File metadata
- Download URL: phosphograph-0.1.1-py3-none-any.whl
- Upload date:
- Size: 100.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7727193cdb395be1474707d2e376047a2dcc089165c8d1748d154f3daecd9b5
|
|
| MD5 |
e1016302d7f86a122610a119bf0552b8
|
|
| BLAKE2b-256 |
144df2ff98b234bf2af5a92e21ba066c5d29768f744f85b05af0836067106dee
|
Provenance
The following attestation bundles were made for phosphograph-0.1.1-py3-none-any.whl:
Publisher:
release.yaml on complextissue/phosphograph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phosphograph-0.1.1-py3-none-any.whl -
Subject digest:
c7727193cdb395be1474707d2e376047a2dcc089165c8d1748d154f3daecd9b5 - Sigstore transparency entry: 1582378385
- Sigstore integration time:
-
Permalink:
complextissue/phosphograph@0807c0c676f4a05c2d87f4e903183ac7128b08d2 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/complextissue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@0807c0c676f4a05c2d87f4e903183ac7128b08d2 -
Trigger Event:
release
-
Statement type: