Skip to main content

Transform enrichment outputs into verifiable, auditable pathway claims with calibrated abstention.

Project description

LLM-PathwayCurator

LLM-PathwayCurator Enrichment interpretations → audited, decision-grade pathway claims.

Docs DOI RRID License: MIT Python


🚀 What this is

LLM-PathwayCurator is an interpretation quality-assurance (QA) layer for enrichment analysis.
It does not introduce a new enrichment statistic. Instead, it turns EA outputs into auditable decision objects:

  • Input: enrichment term lists from ORA (e.g., Metascape) or rank-based enrichment (e.g., fgsea, an implementation of the GSEA method)
  • Output: typed, evidence-linked claims + PASS/ABSTAIN/FAIL decisions + reason-coded audit logs
  • Promise: we abstain when claims are unstable, under-supported, contradictory, or context-nonspecific

Selective prediction for pathway interpretation: calibrated abstention is a feature, not a failure.

LLM-PathwayCurator workflow: EvidenceTable → modules → claims → audits

Fig. 1a. Overview of LLM-PathwayCurator workflow:
EvidenceTablemodulesclaimsaudits (bioRxiv preprint: DOI/link will be added)


🧭 Why this is different (and why it matters)

Enrichment tools return ranked term lists. In practice, interpretation breaks because:

  1. Representative terms are ambiguous under study context
  2. Gene support is opaque, enabling cherry-picking
  3. Related terms share / bridge evidence in non-obvious ways
  4. There is no mechanical stop condition for fragile narratives

LLM-PathwayCurator replaces narrative endorsement with audit-gated decisions.
We transform ranked terms into machine-auditable claims by enforcing:

  • Evidence-linked constraints: claims must resolve to valid term/module identifiers and supporting-gene evidence
  • Stability audits: supporting-gene perturbations yield stability proxies (operating point: τ)
  • Context validity stress tests: context swap reveals context dependence without external knowledge
  • Contradiction checks: internally inconsistent claims fail mechanically
  • Reason-coded outcomes: every decision is explainable by a finite audit code set

🔍 What this is not

  • Not an enrichment method; it audits enrichment outputs.
  • Not a free-text summarizer; claims are schema-bounded (typed JSON; no narrative prose as “evidence”).
  • Not a biological truth oracle; it checks internal consistency and evidence integrity, not mechanistic truth.

🧩 Core pipeline (A → B → C)

A) Stability distillation (evidence hygiene)
Perturb supporting genes (seeded) to compute stability proxies (e.g., LOO/jackknife-like survival scores).
Output: distilled.tsv

B) Evidence factorization (modules)
Factorize the term–gene bipartite graph into evidence modules that preserve shared vs distinct support.
Outputs: modules.tsv, term_modules.tsv, term_gene_edges.tsv

C) Claims → audit → report

  • C1 (proposal-only): deterministic baseline or optional LLM proposes typed claims with resolvable evidence links
  • C2 (audit/decider): mechanical rules assign PASS/ABSTAIN/FAIL with precedence (FAIL > ABSTAIN > PASS)
  • C3 (report): decision-grade report + audit log (audit_log.tsv) + provenance

⚡ Quick start (library entrypoint)

llm-pathway-curator run \
  --sample-card examples/demo/sample_card.json \
  --evidence-table examples/demo/evidence_table.tsv \
  --out out/demo/

Key outputs (stable contract)


📊 Rank & visualize ranked terms (rank / plot-ranked)

LLM-PathwayCurator includes two small post-processing commands for ranking and publication-ready visualization of ranked terms/modules:

  • llm-pathway-curator rank — produces a ranked table (claims_ranked.tsv) for downstream plots and summaries.
  • llm-pathway-curator plot-ranked — renders ranked terms/modules as either:
    • bars (Metascape-like horizontal bars), or
    • packed circles (module-level circle packing with term circles inside).

A) Rank (produce claims_ranked.tsv)

Use rank to generate a deterministic ranked table from a run output directory.

llm-pathway-curator rank --help
# Typical workflow: point rank to a run directory and write claims_ranked.tsv
# (See --help for the exact flags supported by your installed version.)

B) Plot (bars or packed circles)

plot-ranked auto-detects claims_ranked.tsv (recommended) or falls back to audit_log.tsv under --run-dir.

Packed circles require an extra dependency: python -m pip install circlify

Bars (Metascape-like)

llm-pathway-curator plot-ranked \
  --mode bars \
  --run-dir out/demo \
  --out-png out/demo/plots/ranked_bars.png \
  --decision PASS \
  --group-by-module \
  --left-strip \
  --strip-labels \
  --bar-color-mode module

Packed circles (modules → terms)

llm-pathway-curator plot-ranked \
  --mode packed \
  --run-dir out/demo \
  --out-png out/demo/plots/ranked_packed.png \
  --decision PASS \
  --term-color-mode module

Packed circles (direction shading)

llm-pathway-curator plot-ranked \
  --mode packed \
  --run-dir out/demo \
  --out-png out/demo/plots/ranked_packed.direction.png \
  --decision PASS \
  --term-color-mode direction

Consistent module labels/colors across plots

plot-ranked assigns a single module display rank (M01, M02, ...) and a stable module color per module_id, so bars and packed circles can be placed side-by-side without label/color drift.


⚖️ Inputs (contracts)

EvidenceTable (minimum required columns)

Each row is one enriched term.

Required columns:

  • term_id, term_name, source
  • stat, qval, direction
  • evidence_genes (supporting genes; TSV uses ; join)

Sample Card (study context)

Structured context record used for proposal and context gating, e.g.:

  • condition/disease, tissue, perturbation, comparison

Adapters for common tools live under src/llm_pathway_curator/adapters/.


🔧 Adapters (Input → EvidenceTable)

Adapters are intentionally conservative:

  • preserve evidence identity (term × genes)
  • avoid destructive parsing
  • keep TSV round-trips stable (contract drift is treated as a bug)

See: src/llm_pathway_curator/adapters/README.md


🛡️ Decisions: PASS / ABSTAIN / FAIL

LLM-PathwayCurator assigns decisions by mechanical audit gates:

  • FAIL: auditable violations (evidence-link drift, schema violations, contradictions, forbidden fields, etc.)
  • ABSTAIN: non-specific, under-supported, or unstable under perturbations / stress tests
  • PASS: survives all enabled gates at the chosen operating point (τ)

Important: the LLM (if enabled) never decides acceptance. It may propose candidates; the audit suite is the decider.


🧪 Built-in stress tests (counterfactuals without external knowledge)

  • Context swap: shuffle study context (e.g., BRCA → LUAD) to test context dependence
  • Evidence dropout: randomly remove supporting genes (seeded; min_keep enforced)
  • Contradiction injection (optional): introduce internally contradictory candidates to test FAIL gates

These are specification-driven perturbations intended to validate that the pipeline abstains for the right reasons, with stress-specific reason codes.


♻️ Reproducibility by default

LLM-PathwayCurator is deterministic by default:

  • fixed seeds (CLI + library defaults)
  • pinned parsing + hashing utilities
  • stable output schemas and reason codes
  • run metadata persisted to run_meta.json (and runner-level manifest.json when used)

Paper-side runners (e.g., paper/scripts/fig2_run_pipeline.py) orchestrate reproducible sweeps and do not implement scientific logic; they call the library entrypoint (llm_pathway_curator.pipeline.run_pipeline).


📦 Installation

PyPI PyPI version Docker (GHCR) Jupyter

Option A: PyPI (recommended)

pip install llm-pathway-curator

(See PyPI project page: https://pypi.org/project/llm-pathway-curator/)

Option B: From source (development)

git clone https://github.com/kenflab/LLM-PathwayCurator.git
cd LLM-PathwayCurator
pip install -e .

🐳 Docker (recommended for reproducibility)

We provide an official Docker environment (Python + R + Jupyter), sufficient to run LLM-PathwayCurator and most paper figure generation.
Optionally includes Ollama for local LLM annotation (no cloud API key required).

  • Option A: Prebuilt image (recommended)

    Use the published image from GitHub Container Registry (GHCR).

    # from the repo root (optional, for notebooks / file access)
    docker pull ghcr.io/kenflab/llm-pathway-curator:official
    

    Run Jupyter:

    docker run --rm -it \
      -p 8888:8888 \
      -v "$PWD":/work \
      -e GEMINI_API_KEY \
      -e OPENAI_API_KEY \
      ghcr.io/kenflab/llm-pathway-curator:official
    

    Open Jupyter: http://localhost:8888

    (Use the token printed in the container logs.)

    Notes:

    For manuscript reproducibility, we also provide versioned tags (e.g., :0.1.0). Prefer a version tag when matching a paper release.

  • Option B: Build locally (development)

    • Option B-1: Build locally with Compose (recommended for dev)
      # from the repo root
      docker compose -f docker/docker-compose.yml build
      docker compose -f docker/docker-compose.yml up
      

      B-1.1) Open Jupyter

      B-1.2) If prompted for "Password or token"

      • Get the tokenized URL from container logs:
        docker compose -f docker/docker-compose.yml logs -f llm-pathway-curator
        
      • Then either:
        • open the printed URL (contains ?token=...) in your browser, or
        • paste the token value into the login prompt.
    • Option B-2: Build locally without Compose (alternative)
      # from the repo root
      docker build -f docker/Dockerfile -t llm-pathway-curator:official .
      

      B-2.1) Run Jupyter

      docker run --rm -it \
        -p 8888:8888 \
        -v "$PWD":/work \
        -e GEMINI_API_KEY \
        -e OPENAI_API_KEY \
        llm-pathway-curator:official
      

      B-2.2) Open Jupyter


🖥️ Apptainer / Singularity (HPC)

  • Option A: Prebuilt image (recommended)

    Use the published image from GitHub Container Registry (GHCR).

    apptainer build llm-pathway-curator.sif docker://ghcr.io/kenflab/llm-pathway-curator:official
    
  • Option B: a .sif from the Docker image (development)

    docker compose -f docker/docker-compose.yml build
    apptainer build llm-pathway-curator.sif docker-daemon://llm-pathway-curator:official
    

Run Jupyter (either image):

apptainer exec --cleanenv \
  --bind "$PWD":/work \
  llm-pathway-curator.sif \
  bash -lc 'jupyter lab --ip=0.0.0.0 --port=8888 --no-browser 

🤖 LLM usage (proposal-only; optional)

If enabled, the LLM is confined to proposal steps and must emit schema-bounded JSON with resolvable EvidenceTable links.

Backends (example):

  • Ollama: LLMPATH_OLLAMA_HOST, LLMPATH_OLLAMA_MODEL
  • Gemini: GEMINI_API_KEY
  • OpenAI: OPENAI_API_KEY

Typical environment:

export LLMPATH_BACKEND="ollama"   # ollama|gemini|openai

Deterministic settings are used by default (e.g., temperature=0), and runs persist prompt/raw/meta artifacts alongside run_meta.json.


📄 Manuscript reproduction

paper/ contains manuscript-facing scripts, Source Data exports, and frozen/derived artifacts (when redistributable).


🧾 Citation

If you use LLM-PathwayCurator, please cite:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_pathway_curator-0.1.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_pathway_curator-0.1.0-py3-none-any.whl (231.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_pathway_curator-0.1.0.tar.gz.

File metadata

  • Download URL: llm_pathway_curator-0.1.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_pathway_curator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 47c5e6a65560df5c7543c1b12e1e1828637a7a5368bb4ef1e783b2cbaaf270ba
MD5 6eb2ac428609107fd62263fc0364d676
BLAKE2b-256 03504b8faaf3e402669c39844b52a97f72b865c3a0a8bee166e4dd4e9b3c84a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_pathway_curator-0.1.0.tar.gz:

Publisher: pypi-release.yml on kenflab/LLM-PathwayCurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_pathway_curator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_pathway_curator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8870e53f7b684206a3fc532c2ff1dae810b762bd4c8c156bd2c3c480335efa8d
MD5 a0c9caff2c0d0e3392bfdf792095ff38
BLAKE2b-256 0beea7294c76d47ddf61ddb1f9648fb22a713d0e54fcd94765dfc886644e49be

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_pathway_curator-0.1.0-py3-none-any.whl:

Publisher: pypi-release.yml on kenflab/LLM-PathwayCurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page