Skip to main content

Directed, signed, provenance-annotated phospho-signaling graph builder and query CLI.

Project description

phosphograph

What phosphograph is

phosphograph is a Python library that builds and queries a directed, signed, provenance-annotated graph of phospho-signaling relationships among human proteins. Nodes are proteins and individual phosphosites; edges are kinase, phosphatase, autophosphorylation, and protein-protein binding relationships drawn from manually curated public databases (SIGNOR 4.0 by default; OmniPath opt-in for broader structural coverage at the cost of unsigned edges).

Why it exists

Spatial proteomics with phospho-specific stainings (e.g. p-ERK1/2, p-c-Jun, p-AKT, p-STAT3) reports on the activity state of signaling pathways at single-cell, in-tissue resolution. A single observed phospho-state is rarely interpretable on its own: the relevant questions are always what upstream input produced it and what downstream events it predicts. Designing a multiplexed-IF panel that resolves these questions requires knowing, for any given phospho-target, which other phosphorylation events are mechanistically coupled to it and could be co-stained to corroborate or refute the inferred pathway state.

Existing pathway resources address parts of this problem but require manual cross-referencing. KEGG encodes topology but not consistent effect direction at the phospho level. PhosphoSitePlus has site-level kinase-substrate data but no network view. SIGNOR has signed phospho-edges. OmniPath integrates many sources but exposes them as a general signaling network rather than as a phospho-measurable subgraph. None of them directly answer "given p-ERK1/2 T202/Y204 is elevated in this region, which other antibodies would test or extend my inference of MAPK pathway state in the same section?"

phosphograph exists to make that question scriptable and reproducible.

What phosphograph does

  1. Ingests phospho-relevant edges from SIGNOR and (optionally) OmniPath.
  2. Harmonizes identifiers to UniProt canonical accessions and normalizes site nomenclature (residue letter + 1-based position on UniProt canonical).
  3. Merges edges across sources with consensus-effect resolution, conflict logging, and factual per-edge provenance counts.
  4. Detects autophosphorylation, synthesizes site-to-host "consequence" propagation edges, and assembles a networkx.MultiDiGraph of (protein, phosphosite) nodes.
  5. Resolves free-text protein names ("p-ERK", "phospho-c-Jun S63") to ranked UniProt candidates.
  6. Runs bidirectional k-hop walks from a query node with best-first source-count pruning and returns the induced subgraph, enumerated paths, and per-path signed predictions.
  7. Optionally collapses the result into a protein-only view with aggregated effect counts ("3 activating, 1 inhibiting").
  8. Exports to GraphML, Cytoscape JSON, GEXF, parquet edge lists, and Graphviz-rendered SVG/PDF/PNG.
  9. Exposes all functionality through a click CLI (with an interactive walkthrough wizard), a Python API, and a FastMCP server (phosphograph mcp) that surfaces the walks as MCP tools with an inline Cytoscape viewer for LLM clients.

What phosphograph is not

  • Not an image analysis tool for spatial proteomics data.
  • Not a predictor of phospho-state magnitude or kinetics.
  • Not a panel optimizer in v0; walks inform manual panel decisions but do not solve set-cover automatically.
  • Not a quantitative or mechanistic model of signaling.
  • Not a substitute for experimental validation of any kinase-substrate relationship.

Intended users

Bioinformaticians and computational biologists designing multiplexed-IF panels for spatial proteomics, who already work with phospho-target stainings and want a scriptable, license-clean, reproducible way to retrieve the mechanistic neighborhood around a phospho-target as a queryable graph.

Algorithmic pipeline

These are the steps from raw curated data to a query result. Each is implemented in one small module and documented inline.

1. Ingest (ingest/signor_src.py, ingest/omnipath_src.py)

  • SIGNOR: bulk TSV download parsed row-by-row. Each row becomes one or more PhosphoEdges. Filtered to human (TAX_ID==9606) and to mechanisms we model (phosphorylation, dephosphorylation, binding). The EFFECT column collapses to activates|inhibits|unknown. The DIRECT column ("t" = directly observed, "f" = inferred) flows through to SourceRef.direct as real per-row provenance.
  • OmniPath: lazy import of pypath-omnipath, pulled only when the user opts in. Adds enzyme-substrate coverage. OmniPath's aggregated enz_sub table carries no per-row effect direction, so OmniPath-only edges are effect="unknown" by construction.

2. Resolve (harmonize/resolver.py, harmonize/phospho_parser.py)

Free-text input like "p-ERK" or "phospho-c-Jun S63" is normalized:

  1. A regex strips phospho prefixes/suffixes and extracts an optional (residue, position).
  2. The cleaned symbol is sent to mygene.info (human only, cached).
  3. Candidates are ranked by mygene's Lucene score. Both the normalized score (top hit = 1.0) AND the raw score are returned so the caller can distinguish "top of a strong field" from "top of nothing."
  4. low_confidence=True when the top hit's raw score is below a threshold; ambiguous=True when the gap between top-1 and top-2 normalized scores is below AMBIGUITY_THRESHOLD. Never auto-pick — the caller decides.

3. Merge (harmonize/merge.py)

For each (source_id, target_id, mechanism) triple seen across sources:

  • Union the references from contributing edges.
  • Effect consensus: all agree → that effect; one says X and the rest say unknown → X (silence is not contradiction); two distinct signed effects → unknown and the disagreement is logged to conflicts.tsv.
  • Factual provenance counts (no synthetic confidence): n_sources = distinct curated databases; n_references = distinct PMIDs. These drive the --min-sources and --require-signed walk filters directly.

4. Build the graph (graph/build.py)

  • Autophosphorylation detection: any phosphorylation edge whose kinase and substrate share a UniProt AC is re-tagged mechanism="autophosphorylation". Source/target stay protein:X → site:X:Y so the graph never grows a self-loop at the protein level.
  • Consequence edges (site → host protein): for every site with at least one phos/dephos parent, emit one synthetic edge that lets walks traverse from a phospho-event to "the host protein is now active/inactive." Effect is the consensus across phosphorylation/autophosphorylation parents only — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. References are unioned across phos parents; n_sources / n_references recomputed from that union.
  • Add to MultiDiGraph: nodes are created on demand; every site node gets its host protein materialized if not already present (invariant 2).

5. K-hop neighborhood walk (walk/neighborhood.py)

Best-first expansion using a heap keyed by -n_sources of the next edge. Edges supported by more curated databases are explored first, so when max_nodes is hit we have kept the strongest edges. Filters happen during expansion (min_sources, allow_dephosphorylation, allow_binding, require_signed), never post-hoc. When the cap fires, a MaxNodesPruned warning is emitted with the visited count and remaining-frontier size so the CLI can surface "you hit the cap; raise --max-nodes to see more."

6. Path enumeration and sign propagation (walk/paths.py, walk/sign.py)

  • The caller builds one filtered_subgraph(induced_subgraph(g, visited), ...) and passes it to both path enumeration AND sign reading — so the two cannot disagree about which parallel edge "exists."
  • all_simple_paths_up_to(g, source, cutoff) runs a single DFS via nx's container-target overload and yields each simple path once.
  • For each path, the sign is the product of per-step effects (activates=+1, inhibits=-1). Any unknown step makes the whole path's sign None. Per-path only: the same node can sit on + and paths from different starting points, so we never collapse to per-node sign.

7. Protein-collapsed view (graph/collapse.py)

A high-level overview for visualization. Rules:

  • protein → site:X:Y routes through to protein:X (host materialized if missing).
  • site → protein (consequence) is dropped; already accounted for via the kinase→site that produced it.
  • protein → protein (e.g. binding) kept as-is.

Per (source, target, mechanism) bucket: effect counts {activates, inhibits, unknown}, aggregated effect ∈ {activates, inhibits, mixed, unknown}, n_underlying_edges, and a summary_label like "3 activating, 2 inhibiting". References are intentionally dropped in the collapsed view — switch back to the full graph if you need PMIDs.

8. Invariants (graph/invariants.py)

Checked after every build:

  1. Every phosphosite has an incoming kinase/phosphatase edge from a protein, OR an autophosphorylation edge from its own host protein.
  2. Every site:X:Y has a matching protein:X node.
  3. All node IDs validate (structural regex + canonical UniProt AC).
  4. Post-merge: no two parallel edges with the same (source, target, mechanism) carry disagreeing signed effects. Different mechanisms between the same protein pair are not flagged (phos can activate while binding inhibits — these are two distinct biological events, not a contradiction).

Scope (v0)

Item Decision
Species Human only (taxid 9606). Mouse deferred to v0.1; cross-species inheritance has biological caveats around residue translation that v0 does not solve.
Node resolution Protein-level required; phosphosite-level where annotated
Antibody filter None in v0
Use case Academic
Secondary scope Autophosphorylation detection
Deliverable Python package + click CLI + graph export (GraphML, Cytoscape JSON, GEXF, SVG/PDF via Graphviz, parquet)

Data sources

SIGNOR is the only default source. OmniPath is available as opt-in (--sources omnipath,signor) for users who want broader structural coverage at the cost of unsigned edges. PhosphoSitePlus and CollecTRI are not used.

Source Default? Role Access
SIGNOR yes Manually-curated, signed phospho/dephospho edges with explicit mechanism, effect direction, PMID, and SIGNOR record ID. The large majority of edges carry a signed effect. TSV bulk dump via https://signor.uniroma2.it/releases/getLatestRelease.php
OmniPath opt-in Aggregated enzyme-substrate (PTM) network from many underlying resources. Adds broader site and kinase coverage but contributes zero signed edges in v0 — OmniPath's aggregated enz_sub table doesn't expose per-row effect direction. pypath-omnipath Python client (heavy; downloads on first use)

Why this asymmetry: OmniPath adds substantial structural coverage (extra unsigned edges and substrate sites) but contributes no signed edges. SIGNOR alone gives signed direction for the large majority of its phospho edges. Going SIGNOR-only roughly halves the edge count but lifts the signal-to-noise ratio significantly. KEGG, Reactome, iPTMnet, DEPOD, INDRA, PSP, and CollecTRI are not used.

Schema

Strongly typed via pydantic>=2. Node IDs are deterministic strings; v0 is human-only so the taxid is implicit.

from pydantic import BaseModel, Field
from typing import Literal, Optional

Residue = Literal["S", "T", "Y", "H"]
TAXID_HUMAN = 9606

class ProteinNode(BaseModel):
    kind: Literal["protein"] = "protein"
    uniprot_ac: str
    gene_symbol: str

class PhosphoSiteNode(BaseModel):
    kind: Literal["phosphosite"] = "phosphosite"
    uniprot_ac: str
    gene_symbol: str
    residue: Residue
    position: int = Field(ge=1)              # 1-based, UniProt canonical

Mechanism = Literal[
    "phosphorylation",
    "dephosphorylation",
    "autophosphorylation",
    "binding",                               # protein-protein, no site coordinate
]
Effect = Literal["activates", "inhibits", "unknown"]

class SourceRef(BaseModel):
    database: Literal["omnipath", "signor"]
    record_id: Optional[str] = None
    pmid: Optional[str] = None
    direct: Optional[bool] = None            # SIGNOR DIRECT column

class PhosphoEdge(BaseModel):
    source_id: str
    target_id: str
    mechanism: Mechanism
    effect: Effect
    references: list[SourceRef]
    n_sources: int = Field(ge=1)             # distinct curated databases
    n_references: int = Field(ge=0)          # distinct PMIDs across references

Node ID conventions:

  • protein:P28482
  • site:P28482:T185

Natural-language resolver

harmonize/resolver.py converts free-text input ("p-ERK", "phospho-c-Jun S63", "p38 alpha") to UniProt entries. Pipeline:

  1. phospho_parser.py: regex strips phospho-, p-, pS\d+, pT\d+, pY\d+; returns cleaned name and optional (residue, position).
  2. Query mygene.info (Python mygene client) for species="human". Matches on official symbol, alias, previous symbol, name.
  3. Rank by mygene Lucene score. Both the normalized score (top hit = 1.0) and the raw score are returned — the normalization makes the top hit always 1.0 even when it's actually a poor match, so the raw score (and the low_confidence flag derived from it) is what tells you whether to trust the top pick at all.
class ResolutionCandidate(BaseModel):
    uniprot_ac: str
    gene_symbol: str
    matched_via: Literal["symbol", "alias", "previous_symbol", "name"]
    score: float = Field(ge=0.0, le=1.0)     # normalized within this query
    raw_score: float = 0.0                   # mygene Lucene score verbatim

class ResolutionResult(BaseModel):
    query: str
    parsed_site: Optional[tuple[Residue, int]] = None
    parsed_phospho_prefix: bool = False
    candidates: list[ResolutionCandidate]    # sorted by score desc
    ambiguous: bool = False                  # top1 - top2 < AMBIGUITY_THRESHOLD
    low_confidence: bool = False             # top raw_score < LOW_CONFIDENCE_RAW_SCORE

Never auto-pick. Caller decides.

Graph model

networkx.MultiDiGraph. Edge conventions:

  • Kinase to substrate site: protein (kinase) → phosphosite (substrate), mechanism="phosphorylation".
  • Phosphatase to substrate site: same shape, mechanism="dephosphorylation".
  • Autophosphorylation: protein:Xsite:X:Y, re-tagged at build time when source AC equals target AC. Self-loops at the protein level are avoided.
  • Site-to-host "consequence" edge: phosphositeprotein of the same UniProt AC. Synthesized at build time as a structural propagation hop so walks can traverse from a phospho-event to the host protein's activity. Effect is the consensus across the site's phosphorylation/autophosphorylation parents (dephos parents excluded — see Algorithmic pipeline / Build).
  • Binding: proteinprotein (no site coordinate). From SIGNOR's binding mechanism rows.

Walks

Two primary entry points:

upstream(target: str, k: int = 2, *,
         include_phosphatases: bool = True,
         include_binding: bool = True,
         min_sources: int = 1,
         require_signed: bool = False,
         max_nodes: int | None = None) -> Walk

downstream(source: str, k: int = 2, *,
           include_binding: bool = True,
           min_sources: int = 1,
           require_signed: bool = False,
           max_nodes: int | None = None) -> Walk

Walk returns the (filtered) induced subgraph, the enumerated simple paths up to length k, and a per-path propagated sign.

Filters use factual provenance: min_sources=N keeps only edges asserted by at least N curated databases; require_signed=True drops effect="unknown" edges. There is no synthetic confidence score.

Sign propagation: product of edge effects along the path. activates=+1, inhibits=-1, unknown sets propagated_sign=None for that path. Never aggregated to a single per-node sign — the same node can sit on + and paths from different starting points.

Hub blow-up: around hubs (AKT, ERK, MTOR) k≥2 neighborhoods can easily exceed the default max_nodes cap. max_nodes triggers best-first expansion ordered by edge n_sources, so when the cap fires the strongest edges are kept. A MaxNodesPruned warning carries the cap, visited count, and remaining-frontier size so the CLI can suggest raising --max-nodes.

Conflict resolution and provenance

harmonize/merge.py. For the same (source_id, target_id, mechanism) triple from multiple databases or multiple rows:

  1. Union references into one PhosphoEdge.
  2. Effect resolution:
    • All sources agree → that effect.
    • One says X, the rest say unknownX (silence is not contradiction).
    • Genuine disagreement → effect = "unknown", conflict logged to conflicts.tsv.
  3. Provenance counts (factual, not heuristic):
    • n_sources = number of distinct databases asserting the edge.
    • n_references = number of distinct PMIDs across all unioned references.

No synthetic "confidence score" is produced. Walk filters use n_sources directly (--min-sources N) and the boolean --require-signed flag for effect direction.

Source precedence (for downstream consumers picking a representative reference): SIGNOR > OmniPath.

Orthology

Not in v0. v0 is human only. Mouse would require sequence-aligned site coordinate translation between orthologs, which v0 does not implement honestly; the prior "copy residue+position verbatim" inheritance was biologically unreliable and has been removed. Mouse may return in v0.1 with proper alignment-aware site translation.

Output formats

graph/io.py:

to_graphml(g, path)            # interchange, Cytoscape desktop, yEd
to_gexf(g, path)               # Gephi
to_cytoscape_json(g, path)     # web viewers, .cyjs
to_graphviz(g, path, layout="dot")  # SVG/PDF/PNG via system Graphviz
to_pickle(g, path)             # full round-trip with typed attributes
to_parquet_edges(g, path)      # pandas-friendly edge list

Format inferred from file extension unless explicit. Graphviz requires the system binary; layouts: dot for hierarchical (upstream/downstream views), sfdp for large neighborhoods.

Module layout

phosphograph/
  __init__.py
  config.py                # paths, species toggles, source toggles, cache dir, weights
  models.py                # pydantic schemas above
  util/
    node_id.py             # deterministic node-ID helpers
  ingest/
    base.py                # Ingestor protocol -> Iterator[PhosphoEdge]
    omnipath_src.py        # enz_sub via pypath (opt-in, human only)
    signor_src.py          # SIGNOR 4.0 bulk TSV (default)
  harmonize/
    ids.py                 # UniProt canonical resolution
    sites.py               # residue+position normalization, isoform handling
    merge.py               # consensus-effect merge + conflict logging
    resolver.py            # mygene-backed free-text -> UniProt resolver
    phospho_parser.py      # regex parser for "p-X S123"-style input
  graph/
    build.py               # MultiDiGraph assembly
    io.py                  # all exports
    invariants.py          # property tests (no orphan sites, valid node IDs, etc.)
    collapse.py            # protein-only collapsed view for high-level overview
  walk/
    neighborhood.py        # bidirectional k-hop BFS
    paths.py               # all simple paths up to length k
    sign.py                # per-path sign accumulation
  query/
    upstream.py
    downstream.py
  mcp/
    server.py              # FastMCP server: tools, resources, prompts, run()
    resolution.py          # free-text -> node ID with MCP elicitation
    payload.py             # Walk / paths -> MCP wire payload (cytoscape + summary + structured)
    view.py                # ui://phosphograph/view.html Cytoscape app
  cli.py                   # click entry point (including `phosphograph mcp`)
tests/                     # pytest + hypothesis

Dependencies

Required: pypath-omnipath, httpx, pandas, pydantic>=2, networkx>=3, click>=8, mygene, graphviz (Python wrapper), pyarrow, fastmcp (powers the MCP server), pytest, hypothesis.

System: Graphviz binaries (apt install graphviz or equivalent).

Optional extras: pyvis (interactive HTML preview).

CLI

phosphograph build [--sources signor[,omnipath]] [--force]
phosphograph resolve <query> [--top-k 5]
phosphograph upstream <gene_or_ac>   [--depth 2] [--include-phosphatases] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--output FILE]
phosphograph downstream <gene_or_ac> [--depth 2] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--output FILE]
phosphograph neighborhood <gene_or_ac> [--upstream-depth N] [--downstream-depth N] [--upstream-max-nodes N] [--downstream-max-nodes N] [--collapse] [--output FILE]
phosphograph paths <source> <target> [--max-length 4] [--output FILE]
phosphograph export [--format graphml|gexf|cyjs|svg|pdf|parquet] <output>
phosphograph info <gene_or_ac>
phosphograph conflicts [--output conflicts.tsv]
phosphograph walkthrough

Output format is inferred from the file extension unless --format is set. --orientation horizontal|vertical controls Graphviz layout direction (LR vs TB) for upstream, downstream, neighborhood, paths, export; ignored for non-Graphviz formats.

MCP server

phosphograph mcp runs a FastMCP-based Model Context Protocol server so the walks are callable directly from LLM agents (Claude.ai, Claude Desktop, custom hosts). The same query semantics as the CLI, but with an inline interactive Cytoscape viewer rendered in the chat window and MCP elicitation for ambiguous protein names.

phosphograph mcp                              # streamable HTTP on 127.0.0.1:8765/mcp (default)
phosphograph mcp --transport stdio            # Claude Desktop / subprocess hosts
phosphograph mcp --host 0.0.0.0 --port 8765   # autodeploy / container

Transports. http (alias streamable-http) is the default and the modern MCP HTTP transport — use it for Claude.ai and most autodeploy setups. stdio is for hosts that spawn the server as a subprocess (Claude Desktop).

Auto-build on first boot. If the cached graph is missing, the server runs the build step automatically before accepting tool calls, so a freshly deployed container is usable without a manual phosphograph build. Disable with --no-auto-build; control which sources are used on cache miss with --sources signor[,omnipath].

Tools (all read-only, annotated for hosts):

Tool Purpose
upstream Walk upstream from a query (gene symbol / UniProt AC / SYMBOL:T185).
downstream Walk downstream from a query.
neighborhood Bidirectional neighborhood with independent up/down depth and node caps.
paths Enumerate signed simple paths between two proteins.
resolve_protein Free-text → ranked UniProt candidates (fallback when the client doesn't support elicitation).
node_info Attributes + in/out degrees for a single node.

Resources. ui://phosphograph/view.html — the Cytoscape viewer (loaded into a sandboxed iframe by the host). phosphograph://stats — graph statistics as JSON.

Prompts. Canonical query templates the LLM (and slash-command UIs) can discover and invoke: kinase_network, regulators_of, path_between.

Cytoscape rendering. Each walk tool returns three things in its result: a short text summary (for the LLM), a Cytoscape elements JSON blob (picked up by the bundled viewer via app.ontoolresult and rendered as an interactive graph in the chat window), and a structured payload (focus node, counts, full path list, prune warnings) for programmatic consumption. The viewer styles activating edges green, inhibitory red, and binding edges dashed; protein nodes are ellipses, phosphosite nodes are boxes. Toolbar buttons: fit, re-layout, toggle phosphosites, PNG export.

Interactive disambiguation. When a free-text query maps to multiple candidates in the graph, the tool issues an MCP elicitation so the user picks one inline. If the client does not support elicitation, the tool raises a ToolError pointing the agent at resolve_protein to do an explicit candidate listing first.

Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "phosphograph": {
      "command": "phosphograph",
      "args": ["mcp", "--transport", "stdio"]
    }
  }
}

Claude.ai or other streamable-HTTP clients: point them at http://<host>:<port>/mcp.

For programmatic use:

from phosphograph.mcp import build_server, run

run(transport="http", host="0.0.0.0", port=8765)          # autodeploy entry point
mcp = build_server(graph=g)                                # inject a pre-loaded graph (tests/scripts)

Caching

~/.cache/phosphograph/ (override via PHOSPHOGRAPH_CACHE_DIR env var):

  • raw/: source-version-stamped JSON/TSV downloads
  • edges/: parquet edge lists per source
  • graph/: built MultiDiGraph pickle keyed by (sources, species, build-timestamp)

Idempotent rebuild: phosphograph build --force.

Implementation invariants (enforced by graph/invariants.py)

Asserted after every build_graph(..., merge=True):

  1. Every phosphosite node has an incoming phosphorylation/dephosphorylation edge from a protein node, OR an autophosphorylation edge whose source is exactly its own host protein.
  2. Every site:X:Y node has a matching protein:X node in the graph.
  3. All node IDs validate (structural regex + canonical UniProt AC).
  4. Post-merge: no two parallel edges sharing the same (source, target, mechanism) carry disagreeing signed effects. Different mechanisms between the same protein pair (e.g. phos:activates + binding:inhibits) are not flagged — they describe distinct biological events, not a contradiction.

Out of scope for v0

  • Direct PhosphoSitePlus ingestion (deferred to v0.1; PSP content reachable indirectly via opt-in OmniPath, but no signed direction)
  • CollecTRI / transcription factor regulatory edges (TFs are gene-level, off-mission for a phospho-signaling tool)
  • Antibody catalog / Antibody Registry integration
  • INDRA / text-mined statements
  • Panel optimization / set-cover suggestions
  • Kinetic or quantitative modeling
  • KEGG, Reactome, iPTMnet, DEPOD as separate ingestors

Project state and continuation notes

This appendix documents the actual runtime quirks future contributors (or future sessions) need to know — things not derivable from the code alone.

Locked dependency pins (do not bump without testing pypath end-to-end)

Pin Reason
paramiko<3 paramiko 3.x removed DSSKey; the unmaintained pysftp (which pypath-omnipath imports unconditionally in pypath/share/curl.py) crashes on import. Pinning to 2.x is the cleanest workaround.
pandas>=2.2,<3 pypath.inputs.uniprot_idmapping.idtypes() calls groups.fillna(-1.0, inplace=True) on a string column. Pandas 3.x uses Arrow-backed string arrays that reject float fill values.

If pypath upstream fixes either, both pins can be relaxed. Verify with uv run python -c "from pypath import omnipath; omnipath.db.get_db('enz_sub').make_df(tax_id=True)" after any bump.

pypath API surface actually used

from pypath import omnipath
es = omnipath.db.get_db('enz_sub')   # EnzymeSubstrateAggregator
es.make_df(tax_id=True)              # populates es.df
df = es.df                           # pd.DataFrame

DataFrame columns (verified against the pinned pypath-omnipath): enzyme, enzyme_genesymbol, substrate, substrate_genesymbol, isoforms, residue_type, residue_offset, modification, sources, references, curation_effort, ncbi_tax_id.

We rename residue_typeresidue_letter in phosphograph/ingest/omnipath_src.py:_fetch_enz_sub_live so the rest of the pipeline keeps a single column contract.

There is no to_dataframe() method — earlier docs hinted at one but the supported API is make_df() + .df.

SIGNOR API

SIGNOR ships its full corpus as a single TSV at https://signor.uniroma2.it/releases/getLatestRelease.php.

Columns we use: IDA, IDB, DATABASEA, DATABASEB, EFFECT, MECHANISM, RESIDUE, TAX_ID, PMID, DIRECT, SIGNOR_ID.

  • EFFECT collapses to activates|inhibits|unknown via the up-regulates* / down-regulates* prefixes (see signor_src._effect_to_enum).
  • MECHANISM is kept iff one of phosphorylation, dephosphorylation, binding. Phos/dephos rows go protein → site:residue:position; binding rows go protein → protein (no site coordinate).
  • DIRECT is propagated to SourceRef.direct (True for t, False for f, None when blank). This is a real per-row signal distinguishing directly observed interactions from inferred ones.

SIGNOR trust score is NOT in the bulk TSV. The published per-edge score combines several features (PMID count, pathway co-occurrence, Reactome cross-reference, UniProt co-mention) but the bulk download omits it. Recomputing locally would require pulling Reactome and UniProt sidecars. v0 uses the readily-available signals (n_sources, n_references, direct) instead.

OmniPath REST endpoint naming

The query type for enzyme-substrate is /enz_sub (with underscore), not /enzsub. The /ptms alias also works. Metadata at /queries/enz_sub returns the parameter dictionary. Both /enzsub and /enz-sub 502. (phosphograph itself uses pypath-omnipath rather than the REST endpoint directly; this note is for orientation if you ever need to verify column shapes against the web service.)

Build pipeline non-obvious behaviors

  • Gene-symbol plumbing: ingestors expose a gene_symbols: dict[str, str] attribute populated as a side effect of iter_edges. build_graph aggregates these across ingestors and calls apply_gene_symbols(g, map) to set node gene_symbol attributes. The node_label(data) helper in graph/io.py produces the human-readable label used everywhere (MAPK1, MAPK1:T185).
  • Consequence edges: synthesize_consequence_edges emits one site → host_protein edge per site with at least one kinase/phosphatase parent. These are structural propagation hops required for upstream walks (gating them on known effect would break reachability). Effect is the consensus across phosphorylation/autophosphorylation parents only — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. Consensus rule mirrors merge_edges. References are unioned across phos parents; n_sources and n_references recomputed from the union.
  • Autophosphorylation: detected at build time by AC equality between source protein and target site (detect_autophosphorylation). Re-tags the mechanism but leaves source/target IDs as protein:Xsite:X:Y so self-loops at the protein level are avoided. Invariant 1 verifies that any autophosphorylation edge originates at the site's host protein.
  • Merge produces factual provenance counts (n_sources = distinct curated databases; n_references = distinct PMIDs). No synthetic confidence score is computed — earlier heuristic weights (manual-curation flag, LTP boost) were not grounded in real evidence quality and have been removed. Walk filters operate directly on n_sources and on whether the effect is signed.
  • Walks and signs share one filtered view: query/{downstream,upstream}.py build a single filtered_subgraph from the induced subgraph and pass it to BOTH path enumeration and sign reading. The two cannot disagree about which parallel edge "exists." Path enumeration runs a single DFS via nx's container-target overload.
  • K-hop pruning is best-first, ordered by -n_sources of the next edge. When max_nodes fires, a MaxNodesPruned warning is emitted with {max_nodes, visited_count, remaining_candidates, direction, source} for the CLI to surface.

Tests

  • pytest + hypothesis, organized under tests/. Run with uv run pytest tests/.
  • No test hits the network: OmniPath / SIGNOR / mygene are mocked or fixtured; the MCP layer uses FastMCP's in-process client (fastmcp.utilities.tests.run_server_async for HTTP roundtrips).
  • The pypath_log/ directory at the repo root is created by pypath the first time it runs from any cwd. Gitignored.

Cache layout

~/.cache/phosphograph/ (override via PHOSPHOGRAPH_CACHE_DIR):

raw/omnipath/enz_sub.parquet         # cached pypath dataframe (only if opted in)
raw/signor/all_data.tsv              # SIGNOR bulk dump
raw/mygene/                          # resolver responses
edges/                               # per-source parquet (currently unused)
graph/graph.pkl                      # merged MultiDiGraph pickle (consumed by CLI)
conflicts.tsv                        # merge conflict log

phosphograph build --force invalidates only graph/graph.pkl. To force re-download from upstream sources, delete ~/.cache/phosphograph/raw/.

Live build expectations

Build Signed coverage First-build cost
SIGNOR only (default) high — the large majority of edges are signed fast — no pypath import
SIGNOR + OmniPath low — OmniPath dilutes signed share because its enz_sub rows are unsigned slower — pypath downloads and caches OmniPath on first use

Run phosphograph build once to see the actual node/edge counts and signed percentage for your build (printed on stdout). Subsequent builds are seconds when the cache is warm. phosphograph build --force invalidates only the merged pickle, not the raw downloads — delete ~/.cache/phosphograph/raw/ to force re-pull from upstream.

CLI surface

Subcommands: build, resolve, upstream, downstream, neighborhood, paths, export, info, conflicts, walkthrough, mcp. Run phosphograph --help for the canonical list. walkthrough is an interactive wizard that auto-builds when no cache exists, then loops a menu — the recommended entry point for new users. mcp starts the FastMCP server (see MCP server).

--orientation horizontal|vertical controls Graphviz layout (rankdir=LR vs rankdir=TB) on upstream, downstream, neighborhood, paths, export. Ignored for non-Graphviz formats.

Things that bit us, briefly

  • omnipath (lighter REST-only client at https://github.com/saezlab/omnipath) is a different package from pypath-omnipath. The lock-in is on pypath because of its richer database building.
  • The omnipath PyPI package on RTD docs (omnipath.readthedocs.io) is the other one — don't follow that for pypath's API.
  • The parquet PyPI package is unmaintained — we use pyarrow for parquet I/O. The parquet entry in pyproject is a vestige and can be removed.
  • pyproject originally declared pandas>=3.0.3 and pypath-omnipath>=0.16.20 without pinning paramiko or pandas upper bounds. Both turned out to be wrong; fixed.

License

phosphograph is released under the GNU General Public License v3.0 or later (GPL-3.0-or-later). The full license text is in LICENSE.

The GPL choice is dictated by a runtime dependency: pypath-omnipath is GPL-3.0, and importing it as a library makes the combined work a derivative work under GPL terms. Anyone redistributing phosphograph — or a program that imports it — must therefore comply with the GPL (source availability, same-license redistribution, no additional restrictions). Internal academic use and modification are unrestricted; the obligations only kick in on distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phosphograph-0.1.0.tar.gz (229.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phosphograph-0.1.0-py3-none-any.whl (87.7 kB view details)

Uploaded Python 3

File details

Details for the file phosphograph-0.1.0.tar.gz.

File metadata

  • Download URL: phosphograph-0.1.0.tar.gz
  • Upload date:
  • Size: 229.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phosphograph-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1b5907a0b8bd62ff9d897c829b719667a2473ff991453550d0b2f3a1ad3ee927
MD5 0002d25eb4f82b4b0de9e0c5f1329661
BLAKE2b-256 71d72a3ed32ae781d86961a6cbda105a97b252ff53364bc81cb63c64a1acc5ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for phosphograph-0.1.0.tar.gz:

Publisher: release.yaml on complextissue/phosphograph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phosphograph-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: phosphograph-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 87.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phosphograph-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0bdf3f7f5a4b6983d1c736620d97e0dcb26138f575c868433db70d99b45d5aa9
MD5 b9f71ef1a59fca7f37a91e2961e7d309
BLAKE2b-256 c823f9b777d05d479d716aa0c1e82abd9ac7aaae8d2b16840bef6f602d0981a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for phosphograph-0.1.0-py3-none-any.whl:

Publisher: release.yaml on complextissue/phosphograph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page