Skip to main content

Transform enrichment outputs into verifiable, auditable pathway claims with calibrated abstention.

Project description

LLM-PathwayCurator

LLM-PathwayCurator Enrichment interpretations → audited, decision-grade pathway claims.

Docs bioRxiv DOI RRID License: MIT Python


🚀 What this is

LLM-PathwayCurator is an interpretation quality-assurance (QA) layer for enrichment analysis.
It does not introduce a new enrichment statistic. Instead, it turns EA outputs into auditable decision objects:

  • Input: enrichment term lists from ORA (e.g., Metascape) or rank-based enrichment (e.g., fgsea, an implementation of the GSEA method)
  • Output: typed, evidence-linked claims + PASS/ABSTAIN/FAIL decisions + reason-coded audit logs
  • Promise: we abstain when claims are unstable, under-supported, contradictory, or context-nonspecific

Selective prediction for pathway interpretation: calibrated abstention is a feature, not a failure.

LLM-PathwayCurator workflow: EvidenceTable → modules → claims → audits

Fig. 1a. Overview of LLM-PathwayCurator workflow:
EvidenceTablemodulesclaimsaudits (bioRxiv preprint)


🧭 Why this is different (and why it matters)

Enrichment tools return ranked term lists. In practice, interpretation breaks because:

  1. Representative terms are ambiguous under study context
  2. Gene support is opaque, enabling cherry-picking
  3. Related terms share / bridge evidence in non-obvious ways
  4. There is no mechanical stop condition for fragile narratives

LLM-PathwayCurator replaces narrative endorsement with audit-gated decisions.
We transform ranked terms into machine-auditable claims by enforcing:

  • Evidence-linked constraints: claims must resolve to valid term/module identifiers and supporting-gene evidence
  • Stability audits: supporting-gene perturbations yield stability proxies (operating point: τ)
  • Context validity stress tests: context swap reveals context dependence without external knowledge
  • Contradiction checks: internally inconsistent claims fail mechanically
  • Reason-coded outcomes: every decision is explainable by a finite audit code set

🔍 What this is not

  • Not an enrichment method; it audits enrichment outputs.
  • Not a free-text summarizer; claims are schema-bounded (typed JSON; no narrative prose as “evidence”).
  • Not a biological truth oracle; it checks internal consistency and evidence integrity, not mechanistic truth.

🧩 Core pipeline (A → B → C)

A) Stability distillation (evidence hygiene)
Perturb supporting genes (seeded) to compute stability proxies (e.g., LOO/jackknife-like survival scores).
Output: distilled.tsv

B) Evidence factorization (modules)
Factorize the term–gene bipartite graph into evidence modules that preserve shared vs distinct support.
Outputs: modules.tsv, term_modules.tsv, term_gene_edges.tsv

C) Claims → audit → report

  • C1 (proposal-only): deterministic baseline or optional LLM proposes typed claims with resolvable evidence links
  • C2 (audit/decider): mechanical rules assign PASS/ABSTAIN/FAIL with precedence (FAIL > ABSTAIN > PASS)
  • C3 (report): decision-grade report + audit log (audit_log.tsv) + provenance

⚡ Quick start (library entrypoint)

llm-pathway-curator run \
  --sample-card examples/demo/sample_card.json \
  --evidence-table examples/demo/evidence_table.tsv \
  --out out/demo/

Key outputs (stable contract)


📊 Rank & visualize ranked terms (rank / plot-ranked)

LLM-PathwayCurator includes two small post-processing commands for ranking and publication-ready visualization of ranked terms/modules:

  • llm-pathway-curator rank — produces a ranked table (claims_ranked.tsv) for downstream plots and summaries.
  • llm-pathway-curator plot-ranked — renders ranked terms/modules as either:
    • bars (Metascape-like horizontal bars), or
    • packed circles (module-level circle packing with term circles inside).

A) Rank (produce claims_ranked.tsv)

Use rank to generate a deterministic ranked table from a run output directory.

llm-pathway-curator rank --help
# Typical workflow: point rank to a run directory and write claims_ranked.tsv
# (See --help for the exact flags supported by your installed version.)

B) Plot (bars or packed circles)

plot-ranked auto-detects claims_ranked.tsv (recommended) or falls back to audit_log.tsv under --run-dir.

Packed circles require an extra dependency: python -m pip install circlify

Bars (Metascape-like)

llm-pathway-curator plot-ranked \
  --mode bars \
  --run-dir out/demo \
  --out-png out/demo/plots/ranked_bars.png \
  --decision PASS \
  --group-by-module \
  --left-strip \
  --strip-labels \
  --bar-color-mode module

Packed circles (modules → terms)

llm-pathway-curator plot-ranked \
  --mode packed \
  --run-dir out/demo \
  --out-png out/demo/plots/ranked_packed.png \
  --decision PASS \
  --term-color-mode module

Packed circles (direction shading)

llm-pathway-curator plot-ranked \
  --mode packed \
  --run-dir out/demo \
  --out-png out/demo/plots/ranked_packed.direction.png \
  --decision PASS \
  --term-color-mode direction

Consistent module labels/colors across plots

plot-ranked assigns a single module display rank (M01, M02, ...) and a stable module color per module_id, so bars and packed circles can be placed side-by-side without label/color drift.


⚖️ Inputs (contracts)

EvidenceTable (minimum required columns)

Each row is one enriched term.

Required columns:

  • term_id, term_name, source
  • stat, qval, direction
  • evidence_genes (supporting genes; TSV uses ; join)

Sample Card (study context)

Structured context record used for proposal and context gating, e.g.:

  • condition/disease, tissue, perturbation, comparison

Adapters for common tools live under src/llm_pathway_curator/adapters/.


🔧 Adapters (Input → EvidenceTable)

Adapters are intentionally conservative:

  • preserve evidence identity (term × genes)
  • avoid destructive parsing
  • keep TSV round-trips stable (contract drift is treated as a bug)

See: src/llm_pathway_curator/adapters/README.md


🛡️ Decisions: PASS / ABSTAIN / FAIL

LLM-PathwayCurator assigns decisions by mechanical audit gates:

  • FAIL: auditable violations (evidence-link drift, schema violations, contradictions, forbidden fields, etc.)
  • ABSTAIN: non-specific, under-supported, or unstable under perturbations / stress tests
  • PASS: survives all enabled gates at the chosen operating point (τ)

Important: the LLM (if enabled) never decides acceptance. It may propose candidates; the audit suite is the decider.


🧪 Built-in stress tests (counterfactuals without external knowledge)

  • Context swap: shuffle study context (e.g., BRCA → LUAD) to test context dependence
  • Evidence dropout: randomly remove supporting genes (seeded; min_keep enforced)
  • Contradiction injection (optional): introduce internally contradictory candidates to test FAIL gates

These are specification-driven perturbations intended to validate that the pipeline abstains for the right reasons, with stress-specific reason codes.


♻️ Reproducibility by default

LLM-PathwayCurator is deterministic by default:

  • fixed seeds (CLI + library defaults)
  • pinned parsing + hashing utilities
  • stable output schemas and reason codes
  • run metadata persisted to run_meta.json (and runner-level manifest.json when used)

Paper-side runners (e.g., paper/scripts/fig2_run_pipeline.py) orchestrate reproducible sweeps and do not implement scientific logic; they call the library entrypoint (llm_pathway_curator.pipeline.run_pipeline).


📦 Installation

PyPI PyPI version Docker (GHCR) Jupyter

Option A: PyPI (recommended)

pip install llm-pathway-curator

(See PyPI project page: https://pypi.org/project/llm-pathway-curator/)

Option B: From source (development)

git clone https://github.com/kenflab/LLM-PathwayCurator.git
cd LLM-PathwayCurator
pip install -e .

🐳 Docker (recommended for reproducibility)

We provide an official Docker environment (Python + R + Jupyter), sufficient to run LLM-PathwayCurator and most paper figure generation.
Optionally includes Ollama for local LLM annotation (no cloud API key required).

  • Option A: Prebuilt image (recommended)

    Use the published image from GitHub Container Registry (GHCR).

    # from the repo root (optional, for notebooks / file access)
    docker pull ghcr.io/kenflab/llm-pathway-curator:official
    

    Run Jupyter:

    docker run --rm -it \
      -p 8888:8888 \
      -v "$PWD":/work \
      -e GEMINI_API_KEY \
      -e OPENAI_API_KEY \
      ghcr.io/kenflab/llm-pathway-curator:official
    

    Open Jupyter: http://localhost:8888

    (Use the token printed in the container logs.)

    Notes:

    For manuscript reproducibility, we also provide versioned tags (e.g., :0.1.0). Prefer a version tag when matching a paper release.

  • Option B: Build locally (development)

    • Option B-1: Build locally with Compose (recommended for dev)
      # from the repo root
      docker compose -f docker/docker-compose.yml build
      docker compose -f docker/docker-compose.yml up
      

      B-1.1) Open Jupyter

      B-1.2) If prompted for "Password or token"

      • Get the tokenized URL from container logs:
        docker compose -f docker/docker-compose.yml logs -f llm-pathway-curator
        
      • Then either:
        • open the printed URL (contains ?token=...) in your browser, or
        • paste the token value into the login prompt.
    • Option B-2: Build locally without Compose (alternative)
      # from the repo root
      docker build -f docker/Dockerfile -t llm-pathway-curator:official .
      

      B-2.1) Run Jupyter

      docker run --rm -it \
        -p 8888:8888 \
        -v "$PWD":/work \
        -e GEMINI_API_KEY \
        -e OPENAI_API_KEY \
        llm-pathway-curator:official
      

      B-2.2) Open Jupyter


🖥️ Apptainer / Singularity (HPC)

  • Option A: Prebuilt image (recommended)

    Use the published image from GitHub Container Registry (GHCR).

    apptainer build llm-pathway-curator.sif docker://ghcr.io/kenflab/llm-pathway-curator:official
    
  • Option B: a .sif from the Docker image (development)

    docker compose -f docker/docker-compose.yml build
    apptainer build llm-pathway-curator.sif docker-daemon://llm-pathway-curator:official
    

Run Jupyter (either image):

apptainer exec --cleanenv \
  --bind "$PWD":/work \
  llm-pathway-curator.sif \
  bash -lc 'jupyter lab --ip=0.0.0.0 --port=8888 --no-browser 

🤖 LLM usage (proposal-only; optional)

If enabled, the LLM is confined to proposal steps and must emit schema-bounded JSON with resolvable EvidenceTable links.

Backends (example):

  • Ollama: LLMPATH_OLLAMA_HOST, LLMPATH_OLLAMA_MODEL
  • Gemini: GEMINI_API_KEY
  • OpenAI: OPENAI_API_KEY

Typical environment:

export LLMPATH_BACKEND="ollama"   # ollama|gemini|openai

Deterministic settings are used by default (e.g., temperature=0), and runs persist prompt/raw/meta artifacts alongside run_meta.json.


📄 Manuscript reproduction

paper/ contains manuscript-facing scripts, Source Data exports, and frozen/derived artifacts (when redistributable).


🧾 Citation

If you use LLM-PathwayCurator, please cite:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_pathway_curator-0.1.0.post1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_pathway_curator-0.1.0.post1-py3-none-any.whl (231.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_pathway_curator-0.1.0.post1.tar.gz.

File metadata

File hashes

Hashes for llm_pathway_curator-0.1.0.post1.tar.gz
Algorithm Hash digest
SHA256 a7d42840befd1e2e7fa12217fe54d2b4fe8f9514a8c6216786e0694f88f2d3a2
MD5 60e322f16a919e1f28c5f2ea9e3b48e0
BLAKE2b-256 d7d1f3b3303da9efa8d5eea7432b08e9c2da702c33d21e4f4b93db297cb6c5b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_pathway_curator-0.1.0.post1.tar.gz:

Publisher: pypi-release.yml on kenflab/LLM-PathwayCurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_pathway_curator-0.1.0.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_pathway_curator-0.1.0.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 49ad7ccb8b01ef03b9dbbe268ae15f8a1185d9bf88f801fca40b62f9ae65dbd2
MD5 50ef5fb85c4bbb1b595fc1e35a34dc44
BLAKE2b-256 0be1de3e8da2aecc7e4f4a3157e023f789d727d86865460f88c9de5c0340183d

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_pathway_curator-0.1.0.post1-py3-none-any.whl:

Publisher: pypi-release.yml on kenflab/LLM-PathwayCurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page