Skip to main content

Directed, signed, provenance-annotated phospho-signaling graph builder and query CLI.

Project description

phosphograph

What phosphograph is

phosphograph is a Python library that builds and queries a directed, signed, provenance-annotated graph of phospho-signaling relationships among human proteins. Nodes are proteins and individual phosphosites; edges are kinase, phosphatase, autophosphorylation, and protein-protein binding relationships drawn from manually curated public databases (SIGNOR and OmniPath by default; pass --sources signor for a SIGNOR-only build, or --sources signor,omnipath,psp to opt into PhosphoSitePlus for site-level signed coverage — see the PhosphoSitePlus section for the CC BY-NC-SA 3.0 license caveat).

Why it exists

Spatial proteomics with phospho-specific stainings (e.g. p-ERK1/2, p-c-Jun, p-AKT, p-STAT3) reports on the activity state of signaling pathways at single-cell, in-tissue resolution. A single observed phospho-state is rarely interpretable on its own: the relevant questions are always what upstream input produced it and what downstream events it predicts. Designing a multiplexed-IF panel that resolves these questions requires knowing, for any given phospho-target, which other phosphorylation events are mechanistically coupled to it and could be co-stained to corroborate or refute the inferred pathway state.

Existing pathway resources address parts of this problem but require manual cross-referencing. KEGG encodes topology but not consistent effect direction at the phospho level. PhosphoSitePlus has site-level kinase-substrate data but no network view. SIGNOR has signed phospho-edges. OmniPath integrates many sources but exposes them as a general signaling network rather than as a phospho-measurable subgraph. None of them directly answer "given p-ERK1/2 T202/Y204 is elevated in this region, which other antibodies would test or extend my inference of MAPK pathway state in the same section?"

phosphograph exists to make that question scriptable and reproducible.

What phosphograph does

  1. Ingests phospho-relevant edges from SIGNOR (default), OmniPath (default), and PhosphoSitePlus (opt-in, CC BY-NC-SA 3.0).
  2. Harmonizes identifiers to UniProt canonical accessions and normalizes site nomenclature (residue letter + 1-based position on UniProt canonical).
  3. Merges edges across sources with consensus-effect resolution, conflict logging, and factual per-edge provenance counts.
  4. Detects autophosphorylation, synthesizes site-to-host "consequence" propagation edges, and assembles a networkx.MultiDiGraph of (protein, phosphosite) nodes.
  5. Resolves free-text protein names ("p-ERK", "phospho-c-Jun S63") to ranked UniProt candidates.
  6. Runs bidirectional k-hop walks from a query node with best-first source-count pruning and returns the induced subgraph, enumerated paths, and per-path signed predictions.
  7. Optionally collapses the result into a protein-only view with aggregated effect counts ("3 activating, 1 inhibiting").
  8. Exports to GraphML, Cytoscape JSON, GEXF, parquet edge lists, and Graphviz-rendered SVG/PDF/PNG.
  9. Exposes all functionality through a click CLI (with an interactive walkthrough wizard), a Python API, and a FastMCP server (phosphograph mcp) that surfaces the walks as MCP tools with an inline Cytoscape viewer for LLM clients.

What phosphograph is not

  • Not an image analysis tool for spatial proteomics data.
  • Not a predictor of phospho-state magnitude or kinetics.
  • Not a panel optimizer in v0; walks inform manual panel decisions but do not solve set-cover automatically.
  • Not a quantitative or mechanistic model of signaling.
  • Not a substitute for experimental validation of any kinase-substrate relationship.

Intended users

Bioinformaticians and computational biologists designing multiplexed-IF panels for spatial proteomics, who already work with phospho-target stainings and want a scriptable, license-clean, reproducible way to retrieve the mechanistic neighborhood around a phospho-target as a queryable graph.

Algorithmic pipeline

These are the steps from raw curated data to a query result. Each is implemented in one small module and documented inline.

1. Ingest (ingest/signor_src.py, ingest/omnipath_src.py, ingest/psp_src.py)

  • SIGNOR: bulk TSV download parsed row-by-row. Each row becomes one or more PhosphoEdges. Filtered to human (TAX_ID==9606) and to mechanisms we model (phosphorylation, dephosphorylation, binding). The EFFECT column collapses to activates|inhibits|unknown. The DIRECT column ("t" = directly observed, "f" = inferred) flows through to SourceRef.direct as real per-row provenance.
  • OmniPath: lazy import of pypath-omnipath, pulled only when the user opts in. Adds enzyme-substrate coverage. OmniPath's aggregated enz_sub table carries no per-row effect direction, so OmniPath-only edges are effect="unknown" by construction.
  • PhosphoSitePlus (opt-in): lazy import of pypath.inputs.phosphosite. Joins PSP's Kinase_Substrate_Dataset (kinase → substrate site, unsigned) with the Regulatory_sites table (site-level effect direction) on (substrate_ac, residue, position, 'phosphorylation'). Sites with matching regsite annotations carry signed effects derived from PSP's ON_FUNCTION keywords (positive=Trueactivates, negative=Trueinhibits, contradictory or absent → unknown); unmatched K-S rows still ingest as effect="unknown" for structural coverage. PSP is opt-in because of license restrictions; see PhosphoSitePlus (opt-in) for details.

2. Resolve (harmonize/resolver.py, harmonize/phospho_parser.py)

Free-text input like "p-ERK" or "phospho-c-Jun S63" is normalized:

  1. A regex strips phospho prefixes/suffixes and extracts an optional (residue, position).
  2. The cleaned symbol is sent to mygene.info (human only, cached).
  3. Candidates are ranked by mygene's Lucene score. Both the normalized score (top hit = 1.0) AND the raw score are returned so the caller can distinguish "top of a strong field" from "top of nothing."
  4. low_confidence=True when the top hit's raw score is below a threshold; ambiguous=True when the gap between top-1 and top-2 normalized scores is below AMBIGUITY_THRESHOLD. Never auto-pick — the caller decides.

3. Merge (harmonize/merge.py)

For each (source_id, target_id, mechanism) triple seen across sources:

  • Union the references from contributing edges.
  • Effect consensus: all agree → that effect; one says X and the rest say unknown → X (silence is not contradiction); two distinct signed effects → unknown and the disagreement is logged to conflicts.tsv.
  • Factual provenance counts (no synthetic confidence): n_sources = distinct curated databases; n_references = distinct PMIDs. These drive the --min-sources and --require-signed walk filters directly.

4. Build the graph (graph/build.py)

  • Autophosphorylation detection: any phosphorylation edge whose kinase and substrate share a UniProt AC is re-tagged mechanism="autophosphorylation". Source/target stay protein:X → site:X:Y so the graph never grows a self-loop at the protein level.
  • Consequence edges (site → host protein): for every site with at least one phos/dephos parent, emit one synthetic edge that lets walks traverse from a phospho-event to "the host protein is now active/inactive." Effect is the consensus across phosphorylation/autophosphorylation parents only — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. References are unioned across phos parents; n_sources / n_references recomputed from that union.
  • Add to MultiDiGraph: nodes are created on demand; every site node gets its host protein materialized if not already present (invariant 2).

5. K-hop neighborhood walk (walk/neighborhood.py)

Best-first expansion using a heap keyed by -n_sources of the next edge. Edges supported by more curated databases are explored first, so when max_nodes is hit we have kept the strongest edges. Filters happen during expansion (min_sources, allow_dephosphorylation, allow_binding, require_signed), never post-hoc. When the cap fires, a MaxNodesPruned warning is emitted with the visited count and remaining-frontier size so the CLI can surface "you hit the cap; raise --max-nodes to see more."

6. Path enumeration and sign propagation (walk/paths.py, walk/sign.py)

  • The caller builds one filtered_subgraph(induced_subgraph(g, visited), ...) and passes it to both path enumeration AND sign reading — so the two cannot disagree about which parallel edge "exists."
  • all_simple_paths_up_to(g, source, cutoff) runs a single DFS via nx's container-target overload and yields each simple path once.
  • For each path, the sign is the product of per-step effects (activates=+1, inhibits=-1). Any unknown step makes the whole path's sign None. Per-path only: the same node can sit on + and paths from different starting points, so we never collapse to per-node sign.

7. Protein-collapsed view (graph/collapse.py)

A high-level overview for visualization. Rules:

  • protein → site:X:Y routes through to protein:X (host materialized if missing).
  • site → protein (consequence) is dropped; already accounted for via the kinase→site that produced it.
  • protein → protein (e.g. binding) kept as-is.

Per (source, target, mechanism) bucket: effect counts {activates, inhibits, unknown}, aggregated effect ∈ {activates, inhibits, mixed, unknown}, n_underlying_edges, and a summary_label like "3 activating, 2 inhibiting". References are intentionally dropped in the collapsed view — switch back to the full graph if you need PMIDs.

8. Invariants (graph/invariants.py)

Checked after every build:

  1. Every phosphosite has an incoming kinase/phosphatase edge from a protein, OR an autophosphorylation edge from its own host protein.
  2. Every site:X:Y has a matching protein:X node.
  3. All node IDs validate (structural regex + canonical UniProt AC).
  4. Post-merge: no two parallel edges with the same (source, target, mechanism) carry disagreeing signed effects. Different mechanisms between the same protein pair are not flagged (phos can activate while binding inhibits — these are two distinct biological events, not a contradiction).

Scope (v0)

Item Decision
Species Human only (taxid 9606). Mouse deferred to v0.1; cross-species inheritance has biological caveats around residue translation that v0 does not solve.
Node resolution Protein-level required; phosphosite-level where annotated
Antibody filter None in v0
Use case Academic
Secondary scope Autophosphorylation detection
Deliverable Python package + click CLI + graph export (GraphML, Cytoscape JSON, GEXF, SVG/PDF via Graphviz, parquet)

Data sources

SIGNOR and OmniPath are both included by default. Pass --sources signor for a SIGNOR-only build (smaller, higher signed share, no OmniPath unsigned edges). PhosphoSitePlus is opt-in: pass --sources signor,omnipath,psp (or --sources signor,psp) to include it; see PhosphoSitePlus (opt-in) below for the license-and-mirror caveat. CollecTRI is not used.

Source Default? Role Access
SIGNOR yes Manually-curated, signed phospho/dephospho edges with explicit mechanism, effect direction, PMID, and SIGNOR record ID. The large majority of edges carry a signed effect. TSV bulk dump via https://signor.uniroma2.it/releases/getLatestRelease.php
OmniPath yes Aggregated enzyme-substrate (PTM) network from many underlying resources. Adds broader site and kinase coverage but contributes zero signed edges in v0 — OmniPath's aggregated enz_sub table doesn't expose per-row effect direction. pypath-omnipath Python client (heavy; downloads on first use)
PhosphoSitePlus opt-in Site-level kinase-substrate dataset joined with PSP's Regulatory_sites annotations to recover signed effect direction for the annotated subset. Substantially expands site-level coverage and adds signed edges beyond the SIGNOR baseline. Licensed CC BY-NC-SA 3.0 (academic / non-commercial only). pypath.inputs.phosphosite (fetches from the OmniPath team's mirror at rescued.omnipathdb.org; see caveat below)

Default tradeoff: SIGNOR + OmniPath maximizes structural coverage but dilutes the signed share. Pass --sources signor for a smaller, more-signed graph. Add psp to grow site-level coverage with a meaningful signed contribution from PSP's regulatory-sites annotations — at the cost of accepting the CC BY-NC-SA 3.0 terms and depending on the rescued mirror. KEGG, Reactome, iPTMnet, DEPOD, INDRA, and CollecTRI are not used.

PhosphoSitePlus (opt-in)

The PSP source is implemented in ingest/psp_src.py and is disabled by default for two reasons that users should be aware of before opting in:

  1. License: PhosphoSitePlus is distributed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0). This means:
    • Non-commercial use only. If you are using phosphograph in any commercial context (industry, contract research, fee-for-service analysis), do not enable PSP. Building with --sources ...,psp causes PSP data to be downloaded onto your machine; that download itself is subject to PSP's terms.
    • ShareAlike propagates to derivative datasets. If you redistribute a phosphograph-built graph that incorporates PSP edges (e.g., as a parquet file, GraphML export, or downstream model), you must license the redistribution under the same CC BY-NC-SA 3.0 terms.
    • Attribution required. Cite the canonical PSP reference (Hornbeck PV et al., Nucleic Acids Res. 2015, doi:10.1093/nar/gku1267) in any work that uses a PSP-enabled phosphograph build.
  2. Access via a third-party mirror: pypath's PSP downloaders point at https://rescued.omnipathdb.org/phosphosite/... rather than the official phosphosite.org endpoint (which requires registration and a manual web download). This mirror is maintained by the OmniPath team as a courtesy and is not endorsed by PhosphoSitePlus. The redistribution itself sits in a gray area of PSP's terms — by opting into PSP via phosphograph, you accept that:
    • the mirror may disappear at any time, in which case --sources ...,psp builds will start failing with a clear error;
    • you are choosing to obtain PSP data through this informal channel rather than the canonical one;
    • the PSP edges in your graph carry database="psp" and a derivation trail back to the rescued-mirror files (the cache is at ~/.cache/phosphograph/raw/psp/edges.parquet).

The --sources ...,psp opt-in encodes informed consent to both points; phosphograph does not silently pull PSP under any default configuration.

Opting in via environment variable

For environments where you want PSP to be on for every invocation without having to remember --sources signor,omnipath,psp each time, set PHOSPHOGRAPH_ENABLE_PSP=1 (or true, yes, on — case-insensitive). When that variable is set, psp is automatically appended to DEFAULT_SOURCES, so both phosphograph build (no --sources flag) and the MCP server's auto-build on first boot will include PSP. An explicit --sources flag still overrides whatever the default resolves to.

# one-shot
PHOSPHOGRAPH_ENABLE_PSP=1 phosphograph build

# permanent for this shell
export PHOSPHOGRAPH_ENABLE_PSP=1
phosphograph mcp --transport stdio

For Claude Desktop, put the env var into claude_desktop_config.json (the standard MCP server spec supports env per-server):

{
  "mcpServers": {
    "phosphograph": {
      "command": "phosphograph",
      "args": ["mcp", "--transport", "stdio"],
      "env": { "PHOSPHOGRAPH_ENABLE_PSP": "1" }
    }
  }
}

For an .mcpb bundle (Claude Desktop's MCP bundle format) the same env block goes inside server.mcp_config in manifest.json. The bundled manifest.json in this repository ships with PHOSPHOGRAPH_ENABLE_PSP=1 pre-set, which means installing the bundled MCP server implicitly accepts PSP's CC BY-NC-SA 3.0 terms on behalf of whoever runs it. If you redistribute the bundle to others, ensure they understand the license implication or remove the env block before redistribution.

For a remote HTTP deployment, set the env var on the server process (systemd unit, container env, etc.) — the operator is the party accepting PSP's terms, not the end user calling the MCP tools.

Schema

Strongly typed via pydantic>=2. Node IDs are deterministic strings; v0 is human-only so the taxid is implicit.

from pydantic import BaseModel, Field
from typing import Literal, Optional

Residue = Literal["S", "T", "Y", "H"]
TAXID_HUMAN = 9606

class ProteinNode(BaseModel):
    kind: Literal["protein"] = "protein"
    uniprot_ac: str
    protein_symbol: str

class PhosphoSiteNode(BaseModel):
    kind: Literal["phosphosite"] = "phosphosite"
    uniprot_ac: str
    protein_symbol: str
    residue: Residue
    position: int = Field(ge=1)              # 1-based, UniProt canonical

Mechanism = Literal[
    "phosphorylation",
    "dephosphorylation",
    "autophosphorylation",
    "binding",                               # protein-protein, no site coordinate
]
Effect = Literal["activates", "inhibits", "unknown"]

class SourceRef(BaseModel):
    database: Literal["omnipath", "signor"]
    record_id: Optional[str] = None
    pmid: Optional[str] = None
    direct: Optional[bool] = None            # SIGNOR DIRECT column

class PhosphoEdge(BaseModel):
    source_id: str
    target_id: str
    mechanism: Mechanism
    effect: Effect
    references: list[SourceRef]
    n_sources: int = Field(ge=1)             # distinct curated databases
    n_references: int = Field(ge=0)          # distinct PMIDs across references

Node ID conventions:

  • protein:P28482
  • site:P28482:T185

Natural-language resolver

harmonize/resolver.py converts free-text input ("p-ERK", "phospho-c-Jun S63", "p38 alpha") to UniProt entries. Pipeline:

  1. phospho_parser.py: regex strips phospho-, p-, pS\d+, pT\d+, pY\d+; returns cleaned name and optional (residue, position).
  2. Query mygene.info (Python mygene client) for species="human". Matches on official symbol, alias, previous symbol, name.
  3. Rank by mygene Lucene score. Both the normalized score (top hit = 1.0) and the raw score are returned — the normalization makes the top hit always 1.0 even when it's actually a poor match, so the raw score (and the low_confidence flag derived from it) is what tells you whether to trust the top pick at all.
class ResolutionCandidate(BaseModel):
    uniprot_ac: str
    protein_symbol: str
    matched_via: Literal["symbol", "alias", "previous_symbol", "name"]
    score: float = Field(ge=0.0, le=1.0)     # normalized within this query
    raw_score: float = 0.0                   # mygene Lucene score verbatim

class ResolutionResult(BaseModel):
    query: str
    parsed_site: Optional[tuple[Residue, int]] = None
    parsed_phospho_prefix: bool = False
    candidates: list[ResolutionCandidate]    # sorted by score desc
    ambiguous: bool = False                  # top1 - top2 < AMBIGUITY_THRESHOLD
    low_confidence: bool = False             # top raw_score < LOW_CONFIDENCE_RAW_SCORE

Never auto-pick. Caller decides.

Graph model

networkx.MultiDiGraph. Edge conventions:

  • Kinase to substrate site: protein (kinase) → phosphosite (substrate), mechanism="phosphorylation".
  • Phosphatase to substrate site: same shape, mechanism="dephosphorylation".
  • Autophosphorylation: protein:Xsite:X:Y, re-tagged at build time when source AC equals target AC. Self-loops at the protein level are avoided.
  • Site-to-host "consequence" edge: phosphositeprotein of the same UniProt AC. Synthesized at build time as a structural propagation hop so walks can traverse from a phospho-event to the host protein's activity. Effect is the consensus across the site's phosphorylation/autophosphorylation parents (dephos parents excluded — see Algorithmic pipeline / Build).
  • Binding: proteinprotein (no site coordinate). From SIGNOR's binding mechanism rows.

Walks

Two primary entry points:

upstream(target: str, k: int = 2, *,
         include_phosphatases: bool = True,
         include_binding: bool = True,
         min_sources: int = 1,
         require_signed: bool = False,
         max_nodes: int | None = None,
         sources: Iterable[str] | None = None) -> Walk

downstream(source: str, k: int = 2, *,
           include_binding: bool = True,
           min_sources: int = 1,
           require_signed: bool = False,
           max_nodes: int | None = None,
           sources: Iterable[str] | None = None) -> Walk

Walk returns the (filtered) induced subgraph, the enumerated simple paths up to length k, and a per-path propagated sign.

Filters use factual provenance: min_sources=N keeps only edges asserted by at least N curated databases (min_sources=2 is the "consensus only" view the interactive walkthrough prompts for); require_signed=True drops effect="unknown" edges. There is no synthetic confidence score.

Query-time source filter: sources={"signor", "psp"} (a subset of SUPPORTED_SOURCES) restricts the walk to edges whose references include at least one SourceRef with database in the allowed set. This is distinct from build-time --sources — it carves a per-call subset out of the already-built graph.pkl without rebuilding, and pairs naturally with min_sources (e.g. "edges with PSP AND SIGNOR both asserting" = sources={"signor","psp"}, min_sources=2).

Sign propagation: product of edge effects along the path. activates=+1, inhibits=-1, unknown sets propagated_sign=None for that path. Never aggregated to a single per-node sign — the same node can sit on + and paths from different starting points.

Hub blow-up: around hubs (AKT, ERK, MTOR) k≥2 neighborhoods can easily exceed the default max_nodes cap. max_nodes triggers best-first expansion ordered by edge n_sources, so when the cap fires the strongest edges are kept. A MaxNodesPruned warning carries the cap, visited count, and remaining-frontier size so the CLI can suggest raising --max-nodes.

Conflict resolution and provenance

harmonize/merge.py. For the same (source_id, target_id, mechanism) triple from multiple databases or multiple rows:

  1. Union references into one PhosphoEdge.
  2. Effect resolution:
    • All sources agree → that effect.
    • One says X, the rest say unknownX (silence is not contradiction).
    • Genuine disagreement → effect = "unknown", conflict logged to conflicts.tsv.
  3. Provenance counts (factual, not heuristic):
    • n_sources = number of distinct databases asserting the edge.
    • n_references = number of distinct PMIDs across all unioned references.

No synthetic "confidence score" is produced. Walk filters use n_sources directly (--min-sources N) and the boolean --require-signed flag for effect direction.

Source precedence (for downstream consumers picking a representative reference): SIGNOR > PSP > OmniPath. SIGNOR ranks highest because its rows carry explicit per-edge signed effect; PSP is next because regulatory-site annotations recover signed effect for a meaningful subset of K-S edges; OmniPath's enz_sub aggregation carries no per-row direction and ranks last.

Orthology

Not in v0. v0 is human only. Mouse would require sequence-aligned site coordinate translation between orthologs, which v0 does not implement honestly; the prior "copy residue+position verbatim" inheritance was biologically unreliable and has been removed. Mouse may return in v0.1 with proper alignment-aware site translation.

Output formats

graph/io.py:

to_graphml(g, path)            # interchange, Cytoscape desktop, yEd
to_gexf(g, path)               # Gephi
to_cytoscape_json(g, path)     # web viewers, .cyjs
to_graphviz(g, path, layout="dot")  # SVG/PDF/PNG via system Graphviz
to_pickle(g, path)             # full round-trip with typed attributes
to_parquet_edges(g, path)      # pandas-friendly edge list

Format inferred from file extension unless explicit. Graphviz requires the system binary; layouts: dot for hierarchical (upstream/downstream views), sfdp for large neighborhoods.

Module layout

phosphograph/
  __init__.py
  config.py                # paths, species toggles, source toggles, cache dir, weights
  models.py                # pydantic schemas above
  util/
    node_id.py             # deterministic node-ID helpers
  ingest/
    base.py                # Ingestor protocol -> Iterator[PhosphoEdge]
    omnipath_src.py        # enz_sub via pypath (opt-in, human only)
    signor_src.py          # SIGNOR bulk TSV
  harmonize/
    ids.py                 # UniProt canonical resolution
    sites.py               # residue+position normalization, isoform handling
    merge.py               # consensus-effect merge + conflict logging
    resolver.py            # mygene-backed free-text -> UniProt resolver
    phospho_parser.py      # regex parser for "p-X S123"-style input
  graph/
    build.py               # MultiDiGraph assembly
    io.py                  # all exports
    invariants.py          # property tests (no orphan sites, valid node IDs, etc.)
    collapse.py            # protein-only collapsed view for high-level overview
  walk/
    neighborhood.py        # bidirectional k-hop BFS
    paths.py               # all simple paths up to length k
    sign.py                # per-path sign accumulation
  query/
    upstream.py
    downstream.py
  mcp/
    server.py              # FastMCP server: tools, resources, prompts, run()
    resolution.py          # free-text -> node ID with MCP elicitation
    payload.py             # Walk / paths -> MCP wire payload (cytoscape + summary + structured)
    view.py                # ui://phosphograph/view.html Cytoscape app
  cli.py                   # click entry point (including `phosphograph mcp`)
tests/                     # pytest + hypothesis

Dependencies

Required: pypath-omnipath, httpx, pandas, pydantic>=2, networkx>=3, click>=8, mygene, graphviz (Python wrapper), pyarrow, fastmcp (powers the MCP server), pytest, hypothesis.

System: Graphviz binaries (apt install graphviz or equivalent).

Optional extras: pyvis (interactive HTML preview).

CLI

phosphograph build [--sources signor[,omnipath][,psp]] [--force]
phosphograph resolve <query> [--top-k 5]
phosphograph upstream <gene_or_ac>   [--depth 2] [--include-phosphatases] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph downstream <gene_or_ac> [--depth 2] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph neighborhood <gene_or_ac> [--upstream-depth N] [--downstream-depth N] [--upstream-max-nodes N] [--downstream-max-nodes N] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph paths <source> <target> [--max-length 4] [--sources signor,psp] [--output FILE]
phosphograph export [--format graphml|gexf|cyjs|svg|pdf|parquet] <output>
phosphograph info <gene_or_ac>
phosphograph conflicts [--output conflicts.tsv]
phosphograph walkthrough

Output format is inferred from the file extension unless --format is set. --orientation horizontal|vertical controls Graphviz layout direction (LR vs TB) for upstream, downstream, neighborhood, paths, export; ignored for non-Graphviz formats.

Note on --sources semantics. On build, --sources is a build-time directive that decides which curated databases get merged into the cached graph.pkl. On the walk subcommands (upstream, downstream, neighborhood, paths), --sources is a query-time filter that carves a per-call subset out of the existing cached graph: an edge passes the filter iff at least one of its references has database in the allowed set. The walk filter never triggers a rebuild.

MCP server

phosphograph mcp runs a FastMCP-based Model Context Protocol server so the walks are callable directly from LLM agents (Claude.ai, Claude Desktop, custom hosts). The same query semantics as the CLI, but with an inline interactive Cytoscape viewer rendered in the chat window and MCP elicitation for ambiguous protein names.

phosphograph mcp                              # streamable HTTP on 127.0.0.1:8765/mcp (default)
phosphograph mcp --transport stdio            # Claude Desktop / subprocess hosts
phosphograph mcp --host 0.0.0.0 --port 8765   # autodeploy / container

Transports. http (alias streamable-http) is the default and the modern MCP HTTP transport — use it for Claude.ai and most autodeploy setups. stdio is for hosts that spawn the server as a subprocess (Claude Desktop).

Auto-build on first boot. If the cached graph is missing, the server runs the build step automatically before accepting tool calls, so a freshly deployed container is usable without a manual phosphograph build. Disable with --no-auto-build; override sources on cache miss with --sources signor,omnipath (the default — includes both), --sources signor for the signed-only subset, or --sources signor,omnipath,psp to opt into PhosphoSitePlus (CC BY-NC-SA 3.0; see PhosphoSitePlus (opt-in)). Note that PSP must be opted in by whoever runs the server — for a hosted MCP deployment that means the server operator, not the end user, accepts PSP's license terms.

Tools (all read-only, annotated for hosts):

Tool Purpose
upstream Walk upstream from a query (gene symbol / UniProt AC / SYMBOL:T185).
downstream Walk downstream from a query.
neighborhood Bidirectional neighborhood with independent up/down depth and node caps.
paths Enumerate signed simple paths between two proteins.
resolve_protein Free-text → ranked UniProt candidates (fallback when the client doesn't support elicitation).
node_info Attributes + in/out degrees for a single node.

Walk-tool parameters at parity with the CLI walkthrough. Every walk tool (upstream, downstream, neighborhood) takes the full filter set the interactive walkthrough prompts for: depth, max_nodes (and per-direction variants for neighborhood), include_phosphatases, include_binding, min_sources (a.k.a. the consensus knob — min_sources=2 keeps only edges asserted by ≥2 curated DBs), require_signed, plus two flags unique to v0:

  • sources: list[str] | None — query-time database filter (["signor", "psp"] etc.). Restricts to edges with at least one reference from the named databases. Does not trigger a rebuild — pairs with build-time --sources (the latter decides what lands in the cache; this one carves a subset out at query time).
  • collapse: bool — return the protein-only aggregated view (phosphosites hidden, parallel edges merged by (source, target, mechanism)). Path enumeration is omitted under collapse=True because paths reference site nodes that the protein-only view hides. paths does not take collapse (path enumeration is inherently node-level) but does take sources.

The MCP tool surface for filters matches the CLI walkthrough 1:1, so anything a user can do interactively is also reachable from an LLM agent.

Resources. ui://phosphograph/view.html — the Cytoscape viewer (loaded into a sandboxed iframe by the host). phosphograph://stats — graph statistics as JSON.

Prompts. Canonical query templates the LLM (and slash-command UIs) can discover and invoke: kinase_network, regulators_of, path_between.

Cytoscape rendering. Each walk tool returns three things in its result: a short text summary (for the LLM), a Cytoscape elements JSON blob (picked up by the bundled viewer via app.ontoolresult and rendered as an interactive graph in the chat window), and a structured payload (focus node, counts, full path list, prune warnings) for programmatic consumption. The viewer styles activating edges green, inhibitory red, and binding edges dashed; protein nodes are ellipses, phosphosite nodes are boxes. Toolbar buttons: fit, re-layout, toggle phosphosites, PNG export.

Interactive disambiguation. When a free-text query maps to multiple candidates in the graph, the tool issues an MCP elicitation so the user picks one inline. If the client does not support elicitation, the tool raises a ToolError pointing the agent at resolve_protein to do an explicit candidate listing first.

Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "phosphograph": {
      "command": "phosphograph",
      "args": ["mcp", "--transport", "stdio"]
    }
  }
}

Claude.ai or other streamable-HTTP clients: point them at http://<host>:<port>/mcp.

For programmatic use:

from phosphograph.mcp import build_server, run

run(transport="http", host="0.0.0.0", port=8765)          # autodeploy entry point
mcp = build_server(graph=g)                                # inject a pre-loaded graph (tests/scripts)

Caching

~/.cache/phosphograph/ (override via PHOSPHOGRAPH_CACHE_DIR env var):

  • raw/: source-version-stamped JSON/TSV downloads
  • edges/: parquet edge lists per source
  • graph/: built MultiDiGraph pickle keyed by (sources, species, build-timestamp)

Idempotent rebuild: phosphograph build --force.

Implementation invariants (enforced by graph/invariants.py)

Asserted after every build_graph(..., merge=True):

  1. Every phosphosite node has an incoming phosphorylation/dephosphorylation edge from a protein node, OR an autophosphorylation edge whose source is exactly its own host protein.
  2. Every site:X:Y node has a matching protein:X node in the graph.
  3. All node IDs validate (structural regex + canonical UniProt AC).
  4. Post-merge: no two parallel edges sharing the same (source, target, mechanism) carry disagreeing signed effects. Different mechanisms between the same protein pair (e.g. phos:activates + binding:inhibits) are not flagged — they describe distinct biological events, not a contradiction.

Out of scope for v0

  • CollecTRI / transcription factor regulatory edges (TFs are gene-level, off-mission for a phospho-signaling tool)
  • Antibody catalog / Antibody Registry integration
  • INDRA / text-mined statements
  • Panel optimization / set-cover suggestions
  • Kinetic or quantitative modeling
  • KEGG, Reactome, iPTMnet, DEPOD as separate ingestors

Project state and continuation notes

This appendix documents the actual runtime quirks future contributors (or future sessions) need to know — things not derivable from the code alone.

Locked dependency pins (do not bump without testing pypath end-to-end)

Pin Reason
paramiko<3 paramiko 3.x removed DSSKey; the unmaintained pysftp (which pypath-omnipath imports unconditionally in pypath/share/curl.py) crashes on import. Pinning to 2.x is the cleanest workaround.
pandas>=2.2,<3 pypath.inputs.uniprot_idmapping.idtypes() calls groups.fillna(-1.0, inplace=True) on a string column. Pandas 3.x uses Arrow-backed string arrays that reject float fill values.

If pypath upstream fixes either, both pins can be relaxed. Verify with uv run python -c "from pypath import omnipath; omnipath.db.get_db('enz_sub').make_df(tax_id=True)" after any bump.

pypath API surface actually used

from pypath import omnipath
es = omnipath.db.get_db('enz_sub')   # EnzymeSubstrateAggregator
es.make_df(tax_id=True)              # populates es.df
df = es.df                           # pd.DataFrame

DataFrame columns (verified against the pinned pypath-omnipath): enzyme, enzyme_genesymbol, substrate, substrate_genesymbol, isoforms, residue_type, residue_offset, modification, sources, references, curation_effort, ncbi_tax_id.

We rename residue_typeresidue_letter in phosphograph/ingest/omnipath_src.py:_fetch_enz_sub_live so the rest of the pipeline keeps a single column contract.

There is no to_dataframe() method — earlier docs hinted at one but the supported API is make_df() + .df.

SIGNOR API

SIGNOR ships its full corpus as a single TSV at https://signor.uniroma2.it/releases/getLatestRelease.php.

Columns we use: IDA, IDB, DATABASEA, DATABASEB, EFFECT, MECHANISM, RESIDUE, TAX_ID, PMID, DIRECT, SIGNOR_ID.

  • EFFECT collapses to activates|inhibits|unknown via the up-regulates* / down-regulates* prefixes (see signor_src._effect_to_enum).
  • MECHANISM is kept iff one of phosphorylation, dephosphorylation, binding. Phos/dephos rows go protein → site:residue:position; binding rows go protein → protein (no site coordinate).
  • DIRECT is propagated to SourceRef.direct (True for t, False for f, None when blank). This is a real per-row signal distinguishing directly observed interactions from inferred ones.

SIGNOR trust score is NOT in the bulk TSV. The published per-edge score combines several features (PMID count, pathway co-occurrence, Reactome cross-reference, UniProt co-mention) but the bulk download omits it. Recomputing locally would require pulling Reactome and UniProt sidecars. v0 uses the readily-available signals (n_sources, n_references, direct) instead.

OmniPath REST endpoint naming

The query type for enzyme-substrate is /enz_sub (with underscore), not /enzsub. The /ptms alias also works. Metadata at /queries/enz_sub returns the parameter dictionary. Both /enzsub and /enz-sub 502. (phosphograph itself uses pypath-omnipath rather than the REST endpoint directly; this note is for orientation if you ever need to verify column shapes against the web service.)

Build pipeline non-obvious behaviors

  • Protein-symbol plumbing: ingestors do not contribute labels. After build_graph assembles the graph, a single post-build pass in phosphograph.harmonize.symbols.apply_protein_symbols(g) collects every UniProt AC on the graph and asks pypath.utils.mapping.label(ac, id_type='uniprot', ncbi_tax_id=9606) for the HGNC primary symbol, writing it onto each node as protein_symbol. This runs unconditionally when the CLI / MCP autobuild constructs the graph (build_graph(..., enrich_symbols=True)) and is opt-in elsewhere — tests that pre-set protein_symbol on synthetic graphs leave the default enrich_symbols=False so the pypath touch stays out of unit tests. The node_label(data) helper in graph/io.py reads protein_symbol first, falling back to uniprot_ac when missing, and produces the human-readable label used everywhere (MAPK1, MAPK1:T185).
  • Consequence edges: synthesize_consequence_edges emits one site → host_protein edge per site with at least one kinase/phosphatase parent. These are structural propagation hops required for upstream walks (gating them on known effect would break reachability). Effect is the consensus across phosphorylation/autophosphorylation parents only — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. Consensus rule mirrors merge_edges. References are unioned across phos parents; n_sources and n_references recomputed from the union.
  • Autophosphorylation: detected at build time by AC equality between source protein and target site (detect_autophosphorylation). Re-tags the mechanism but leaves source/target IDs as protein:Xsite:X:Y so self-loops at the protein level are avoided. Invariant 1 verifies that any autophosphorylation edge originates at the site's host protein.
  • Merge produces factual provenance counts (n_sources = distinct curated databases; n_references = distinct PMIDs). No synthetic confidence score is computed — earlier heuristic weights (manual-curation flag, LTP boost) were not grounded in real evidence quality and have been removed. Walk filters operate directly on n_sources and on whether the effect is signed.
  • Walks and signs share one filtered view: query/{downstream,upstream}.py build a single filtered_subgraph from the induced subgraph and pass it to BOTH path enumeration and sign reading. The two cannot disagree about which parallel edge "exists." Path enumeration runs a single DFS via nx's container-target overload.
  • K-hop pruning is best-first, ordered by -n_sources of the next edge. When max_nodes fires, a MaxNodesPruned warning is emitted with {max_nodes, visited_count, remaining_candidates, direction, source} for the CLI to surface.

Tests

  • pytest + hypothesis, organized under tests/. Run with uv run pytest tests/.
  • No test hits the network: OmniPath / SIGNOR / mygene are mocked or fixtured; the MCP layer uses FastMCP's in-process client (fastmcp.utilities.tests.run_server_async for HTTP roundtrips).
  • The pypath_log/ directory at the repo root is created by pypath the first time it runs from any cwd. Gitignored.

Cache layout

~/.cache/phosphograph/ (override via PHOSPHOGRAPH_CACHE_DIR):

raw/omnipath/enz_sub.parquet         # cached pypath dataframe (only if opted in)
raw/signor/all_data.tsv              # SIGNOR bulk dump
raw/psp/edges.parquet                # cached PSP K-S × regsites join (only if --sources ...,psp)
raw/mygene/                          # resolver responses
edges/                               # per-source parquet (currently unused)
graph/graph.pkl                      # merged MultiDiGraph pickle (consumed by CLI)
graph/graph.meta.json                # sidecar recording which sources produced the pickle
conflicts.tsv                        # merge conflict log

phosphograph build always rebuilds and replaces graph/graph.pkl so that changes to --sources (or to PHOSPHOGRAPH_ENABLE_PSP) take effect immediately. The raw per-source caches under raw/ are reused on rebuild unless you pass --force, which re-fetches every source from upstream. The MCP server's auto-build (ensure_graph_cached) keeps cache-hit behavior idempotent, but invalidates the pickle when the requested source set differs from the one recorded in graph.meta.json — so restarting the server after flipping PHOSPHOGRAPH_ENABLE_PSP=1 triggers a clean rebuild on first tool call.

Live build expectations

Build Signed coverage First-build cost
SIGNOR + OmniPath (default) lower — OmniPath dilutes the signed share because its enz_sub rows are unsigned slower — pypath downloads and caches OmniPath on first use
SIGNOR only (--sources signor) high — the large majority of edges are signed fast — no pypath import
SIGNOR + OmniPath + PSP (--sources signor,omnipath,psp) meaningfully higher signed-edge count than the default — PSP regulatory-sites annotations recover signed direction for an annotated subset of K-S edges; the rest land as unknown slowest on first build — phosphosite_regsites_one_organism pulls multiple PSP files plus SwissProt and runs orthology translation. The joined edge frame is then cached at raw/psp/edges.parquet so subsequent builds skip the pypath roundtrip

Run phosphograph build once to see the actual node/edge counts and signed percentage for your build (printed on stdout). Every invocation rebuilds the merged pickle from the raw caches; subsequent builds are seconds when the raw caches are warm. Pass --force to also re-fetch the raw upstream downloads (equivalent to deleting ~/.cache/phosphograph/raw/ first).

CLI surface

Subcommands: build, resolve, upstream, downstream, neighborhood, paths, export, info, conflicts, walkthrough, mcp. Run phosphograph --help for the canonical list. walkthrough is an interactive wizard that auto-builds when no cache exists, then loops a menu — the recommended entry point for new users. mcp starts the FastMCP server (see MCP server).

--orientation horizontal|vertical controls Graphviz layout (rankdir=LR vs rankdir=TB) on upstream, downstream, neighborhood, paths, export. Ignored for non-Graphviz formats.

Things that bit us, briefly

  • omnipath (lighter REST-only client at https://github.com/saezlab/omnipath) is a different package from pypath-omnipath. The lock-in is on pypath because of its richer database building.
  • The omnipath PyPI package on RTD docs (omnipath.readthedocs.io) is the other one — don't follow that for pypath's API.
  • The parquet PyPI package is unmaintained — we use pyarrow for parquet I/O. The parquet entry in pyproject is a vestige and can be removed.
  • pyproject originally declared pandas>=3.0.3 and pypath-omnipath>=0.16.20 without pinning paramiko or pandas upper bounds. Both turned out to be wrong; fixed.

License

phosphograph is released under the GNU General Public License v3.0 or later (GPL-3.0-or-later). The full license text is in LICENSE.

The GPL choice is dictated by a runtime dependency: pypath-omnipath is GPL-3.0, and importing it as a library makes the combined work a derivative work under GPL terms. Anyone redistributing phosphograph — or a program that imports it — must therefore comply with the GPL (source availability, same-license redistribution, no additional restrictions). Internal academic use and modification are unrestricted; the obligations only kick in on distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phosphograph-0.1.2.tar.gz (246.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phosphograph-0.1.2-py3-none-any.whl (100.7 kB view details)

Uploaded Python 3

File details

Details for the file phosphograph-0.1.2.tar.gz.

File metadata

  • Download URL: phosphograph-0.1.2.tar.gz
  • Upload date:
  • Size: 246.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phosphograph-0.1.2.tar.gz
Algorithm Hash digest
SHA256 14121fe87ee18bf21f793141f5b7e4473d6dd87622887b638c5db6929ddb4502
MD5 406a9edb1c3d38e2dc2a2df8503168f2
BLAKE2b-256 d736baca62a07720650bf2fec67c7471d2bd905606c5abdc268d372a0323d669

See more details on using hashes here.

Provenance

The following attestation bundles were made for phosphograph-0.1.2.tar.gz:

Publisher: release.yaml on complextissue/phosphograph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phosphograph-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: phosphograph-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 100.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phosphograph-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 78c121595ef6575913783e7ccbe107f4eeb0d8b0b930af89779f9ff6cc655fe2
MD5 275d9887160bb0725fde265210589807
BLAKE2b-256 0b4ccdd02ef253e6b08cdaea0d452a77bea9af62d56ee98403db9faca1ee709d

See more details on using hashes here.

Provenance

The following attestation bundles were made for phosphograph-0.1.2-py3-none-any.whl:

Publisher: release.yaml on complextissue/phosphograph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page