Skip to main content

ExposoGraph — LLM-powered extraction and manual curation of carcinogen metabolism pathways

Project description

ExposoGraph

CI Python 3.10+ License: MIT

Build, curate, and export carcinogen metabolism knowledge graphs using LLM-powered extraction and manual entry.

Part of the CarcinoGenomic Platform — a 5-layer computational pipeline for individualized carcinogen metabolism risk assessment from germline DNA.

Version: 0.0.3 Developed by: Data analysis team @ KaziLab Contact: exposograph@kazilab.se Copyright: KaziLab

Acknowledgement

Parts of this documentation and code were created with assistance from ChatGPT Codex and Claude Code.

Features

  • Multi-LLM Extraction — Describe a carcinogen metabolism pathway in plain English; OpenAI (GPT-4o) or local Ollama models extract structured nodes and edges automatically
  • Mode-Aware Ingestion — Use exploratory mode to keep provisional entities or strict mode to keep only canonically grounded content
  • Manual Entry — Add and annotate nodes and edges with full provenance and curation tracking
  • Graph Analysis — Shortest path, centrality, metabolism chain traversal, pathway subgraph, variant impact scoring
  • Public DB Integration — KEGG pathway lookups, CTD chemical-gene interactions, IARC carcinogen classifications
  • Interactive Preview — Color-coded Streamlit AGraph visualization with hover metadata, search/filter controls, zoom, and downloadable Plotly HTML exports
  • Advanced Graph Viewer — Dash Cytoscape viewer with sidebar search, legends, carcinogen filters, detail panel, image export, and saved layout JSON
  • Validated vs Exploratory Views — Filter the current graph to all, validated only, or exploratory only in the UI and export pipeline
  • Rich Annotations — Structured provenance records, source manifests, curated KEGG pathway coverage, PubMed IDs, tissue context, pharmacogenomic variants, activity scores
  • Multiple Export Formats — Standalone Plotly HTML, parseable app HTML, JSON, D3.js viewer (graph-data.js), GEXF (Gephi)
  • Viewer Data Contract — Export a Cytoscape-ready JSON bundle and saved preset layout for richer web-style exploration without maintaining custom JavaScript
  • Validation — Referential integrity checks at model level, dangling edge detection, carcinogen context validation
  • Persistent Storage — SQLite-backed revision history with explicit export visibility tracking and atomic operations

Quick Start

Try Without an API Key

A pre-built Benzo[a]pyrene metabolism graph is included:

pip install -e .
python examples/build_bap_graph.py

This loads examples/bap_graph.json (20 nodes, 20 edges covering the full BaP → BPDE → DNA adduct pathway), runs graph analysis, and exports to HTML and JSON.

For a no-API-key demonstration of strict vs exploratory handling and validated-only exports:

python examples/mode_visibility_demo.py
Sample output
Loaded graph: 20 nodes, 20 edges

Shortest path CYP1A1 → BPDE-dG: CYP1A1 → BPDE → BPDE_dG

Top-5 nodes by degree centrality:
  CYP1A1        0.263
  BPDE          0.211
  CYP1B1        0.158
  BPDE_dG       0.158
  BPDE_GSH      0.158

BaP metabolism chain: 16 nodes, 13 edges
  Activation edges:    4
  Detoxification edges: 3
  Adduct edges:        1
  Repair edges:        3

Variant impact score for CYP1A1:
  Activity score:        1.0
  Downstream adducts:    1
  Impact score:          1.00

Sample JSON

{
  "nodes": [
    {"id": "BaP", "label": "Benzo[a]pyrene", "type": "Carcinogen", "group": "PAH", "iarc": "Group 1"},
    {"id": "CYP1A1", "label": "CYP1A1", "type": "Enzyme", "phase": "I", "role": "Activation"},
    {"id": "BPDE", "label": "BPDE", "type": "Metabolite", "reactivity": "High"},
    {"id": "BPDE_dG", "label": "BPDE-N2-dG", "type": "DNA_Adduct"}
  ],
  "edges": [
    {"source": "CYP1A1", "target": "BPDE", "type": "ACTIVATES", "carcinogen": "BaP"},
    {"source": "BPDE", "target": "BPDE_dG", "type": "FORMS_ADDUCT", "carcinogen": "BaP"}
  ]
}

Streamlit App

pip install -e ".[streamlit]"
streamlit run ExposoGraph/app.py

App mode defaults to stateless, which disables server-side saves and is appropriate for public web deployment. To enable local revision history and file saves on your own machine:

export ExposoGraph_MODE=local
streamlit run ExposoGraph/app.py

Jupyter

pip install -e ".[notebook]"
jupyter lab

No notebook file is currently bundled in this repository. Use the installed package from your own notebook, or start from the runnable examples in examples/.

Advanced Viewer

pip install -e ".[viewer]"
from ExposoGraph import (
    GraphVisibility,
    ViewerLayoutMode,
    launch_dash_viewer,
    write_cytoscape_bundle,
)

write_cytoscape_bundle(
    engine,
    "exports/graph_cytoscape.json",
    visibility=GraphVisibility.ALL,
    layout_mode=ViewerLayoutMode.COSE,
)

launch_dash_viewer(
    engine,
    visibility=GraphVisibility.ALL,
    layout_mode=ViewerLayoutMode.COSE,
    port=8050,
)

Python Library

pip install -e .
from ExposoGraph import (
    GraphEngine,
    GraphMode,
    GraphVisibility,
    centrality,
    extract_graph,
    metabolism_chain,
    to_json,
)

# LLM extraction (requires OpenAI API key)
# exploratory: keep unmatched or custom content
kg = extract_graph(
    "Benzo[a]pyrene is activated by CYP1A1...",
    mode=GraphMode.EXPLORATORY,
)
engine = GraphEngine()
engine.merge(kg, mode=GraphMode.EXPLORATORY)

# Analysis
scores = centrality(engine, method="degree")
chain = metabolism_chain(engine, "BaP")

# Visibility-aware export
to_json(engine, "graph_validated.json", visibility=GraphVisibility.VALIDATED_ONLY)

Using Ollama (Local LLM)

from ExposoGraph import GraphMode, extract_graph
from ExposoGraph.llm_backend import OllamaBackend

backend = OllamaBackend(base_url="http://localhost:11434")
kg = extract_graph(
    "BaP is activated by CYP1A1...",
    backend=backend,
    model="llama3.1",
    mode=GraphMode.EXPLORATORY,
)

Public Database Integration

from ExposoGraph import GraphMode
from ExposoGraph.db_clients import IARCClassifier
from ExposoGraph.seeder import seed_from_ctd, seed_from_kegg_pathway

# Seed from KEGG pathway
kg = seed_from_kegg_pathway("hsa05204", mode=GraphMode.STRICT)

# Seed from CTD
kg = seed_from_ctd("Benzo(a)pyrene", mode=GraphMode.EXPLORATORY)

# IARC classification lookup
clf = IARCClassifier()
clf.classify("Benzo[a]pyrene")  # → IARCGroup.GROUP_1

Reference Curation Metadata

from ExposoGraph import CURATION_SOURCE_MANIFEST, REFERENCE_KEGG_PATHWAYS

primary_sources = CURATION_SOURCE_MANIFEST["primary_sources"]
kegg_ids = [entry["pathway_id"] for entry in REFERENCE_KEGG_PATHWAYS]

Manuscript-Aligned Showcase Summary

from ExposoGraph import build_full_legends_architecture_summary

summary = build_full_legends_architecture_summary()

summary.node_count          # 96
summary.edge_count          # 102
summary.node_type_counts    # {'Carcinogen': 15, 'Enzyme': 36, ...}
summary.edge_type_counts    # {'ACTIVATES': 30, 'DETOXIFIES': 23, ...}
summary.carcinogen_classes  # grouped class inventories for section 2.2 rewrites

Optional Androgen Module

from ExposoGraph import build_androgen_module_graph, build_full_legends_graph

androgen_only = build_androgen_module_graph()
showcase_with_androgen = build_full_legends_graph(include_androgen_module=True)

Configuration

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="sk-..."

Or enter it in the Streamlit sidebar when running the app.

For Streamlit Cloud deployment, add the key to .streamlit/secrets.toml:

OPENAI_API_KEY = "sk-..."

Graph Modes and Visibility

Two separate controls now shape how data moves through the system:

  • Graph mode controls ingestion behavior:
    • exploratory keeps unmatched or custom entities and marks them as provisional
    • strict keeps only canonically grounded nodes and edges
  • Graph visibility controls viewing and export behavior:
    • all
    • validated_only
    • exploratory_only

Typical pattern:

from ExposoGraph import (
    GraphEngine,
    GraphMode,
    GraphRepository,
    GraphVisibility,
    ViewerLayoutMode,
    launch_dash_viewer,
    write_cytoscape_bundle,
    extract_graph,
    to_interactive_html,
    to_plotly_html,
)

engine = GraphEngine()
kg = extract_graph(
    "BaP induces CYP1A1 and forms BPDE adducts",
    mode=GraphMode.STRICT,
)
engine.merge(kg, mode=GraphMode.STRICT)

to_interactive_html(
    engine,
    "validated_graph.html",
    visibility=GraphVisibility.VALIDATED_ONLY,
)

to_plotly_html(
    engine,
    "validated_graph_plotly.html",
    visibility=GraphVisibility.VALIDATED_ONLY,
)

write_cytoscape_bundle(
    engine,
    "validated_graph_cytoscape.json",
    visibility=GraphVisibility.VALIDATED_ONLY,
    layout_mode=ViewerLayoutMode.PRESET,
)

with GraphRepository("data/ExposoGraph.sqlite3") as repo:
    repo.save_engine(
        graph_key="bap_validated",
        graph_name="BaP Validated",
        engine=engine,
        visibility=GraphVisibility.VALIDATED_ONLY,
    )

Project Structure

ExposoGraph/
├── __init__.py          # Public API exports
├── app.py               # Streamlit UI orchestrator
├── branding.py          # Version and metadata
├── config.py            # App modes, graph modes, and visibility enums
├── engine.py            # NetworkX-backed graph engine
├── exporter.py          # JSON, D3.js, HTML, GEXF export
├── graph_filters.py     # Validated/exploratory graph filtering helpers
├── graph_analysis.py    # Shortest path, centrality, metabolism chains
├── grounding.py         # Canonical grounding and strict-mode preparation
├── llm_backend.py       # Pluggable LLM backends (OpenAI, Ollama)
├── llm_extractor.py     # LLM prompt + extraction pipeline
├── models.py            # Pydantic data models (Node, Edge, KnowledgeGraph)
├── reference_data.py    # Gene panels and activity scores
├── seeder.py            # DB-to-KnowledgeGraph conversion
├── storage.py           # SQLite revision history
├── db_clients/
│   ├── kegg.py          # KEGG REST API client
│   ├── ctd.py           # CTD chemical-gene interaction client
│   └── iarc.py          # Bundled IARC classification data
├── ui_extract.py        # Tab: LLM extraction
├── ui_manual.py         # Tab: manual node/edge entry
├── ui_preview.py        # Tab: interactive graph preview
└── ui_data.py           # Tab: raw data view
examples/
├── bap_graph.json       # Pre-built BaP metabolism graph (no API key needed)
├── build_bap_graph.py   # Demo script: load → analyze → export
└── mode_visibility_demo.py  # Demo script: strict ingestion + filtered export/save
tests/
├── test_integration.py  # End-to-end pipeline test
├── test_engine.py
├── test_models.py
├── test_exporter.py
├── test_graph_analysis.py
├── test_llm_backend.py
├── test_llm_extractor.py
├── test_db_clients.py
├── test_seeder.py
├── test_reference_data.py
├── test_config.py
└── test_storage.py

Node & Edge Types

Nodes: Carcinogen, Enzyme, Gene, Metabolite, DNA_Adduct, Pathway, Tissue

Edges: ACTIVATES, DETOXIFIES, TRANSPORTS, FORMS_ADDUCT, REPAIRS, PATHWAY, EXPRESSED_IN, INDUCES, INHIBITS, ENCODES, CUSTOM

Development

pip install -e ".[all]"
pytest                           # runs with --cov, 85% threshold gate
ruff check .
mypy ExposoGraph/

Optional dependency groups

Group Install Provides
llm pip install -e ".[llm]" OpenAI API support
ollama pip install -e ".[ollama]" Ollama local LLM support
db pip install -e ".[db]" KEGG/CTD HTTP clients
streamlit pip install -e ".[streamlit]" Streamlit web app
notebook pip install -e ".[notebook]" Jupyter + Plotly/Matplotlib
dev pip install -e ".[dev]" pytest, ruff, mypy
docs pip install -e ".[docs]" Sphinx + Furo
all pip install -e ".[all]" Everything

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exposograph-0.0.3.tar.gz (192.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exposograph-0.0.3-py3-none-any.whl (147.0 kB view details)

Uploaded Python 3

File details

Details for the file exposograph-0.0.3.tar.gz.

File metadata

  • Download URL: exposograph-0.0.3.tar.gz
  • Upload date:
  • Size: 192.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for exposograph-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c2e5e52aa982fa82dcc04160b43ffb0dc02c24dbf36789977e14b360dab67f36
MD5 f48b816aad28c340b56b44994d610e15
BLAKE2b-256 789d5ec8107763ddce1874877d3110994bfc7f4c16d95b8ff98ba50d5cc733c4

See more details on using hashes here.

File details

Details for the file exposograph-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: exposograph-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 147.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for exposograph-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4fa7d45a817c75e93afb792dbf62e759d4eb97ae7af9ebfeeef991164e266c08
MD5 b5470cc3d4a14ff9c62d08a0a80105d9
BLAKE2b-256 5139d6a38945e9d99d87b7835135b61194728917163bdc362ff60340486cdfa0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page