v0.65.1: P(True) NQ corrected from 0.810→0.623 (tie-handling bug in custom auroc()); gate FAIL documented; DomainRouter NQ signal unchanged (mfe_sc_pair 0.667).

These details have not been verified by PyPI

Project links

Project description

llm-guard-kit

Real-time reliability monitoring, A2A trust management, and self-repair for LLM agents.

v0.54.0: SQLSemanticValidator (schema-aware judge, Databricks/Spark/BigQuery/Snowflake/PostgreSQL dialects) + SQLExecutionOracle; ChartValidator (spec-aware judge for wrong axis/units/type); AutoGenAdapter + CrewAIAdapter (multi-agent orchestrator hooks → A2A trust); P(True) judge-model staleness auto-warn; FailureTaxonomist: EXCESSIVE_SEARCH unlocked in production, LOW_RISK pass-through, ANSWER_UNSUPPORTED embedding+word gate, CONFLICTING_EVIDENCE obs-contradiction signal, PREMATURE_STOP multi-hop detection. See CHANGELOG.md.

📋 See CHANGELOG.md for full version history.

What it does

llm-guard-kit wraps any ReAct / tool-calling LLM agent with a reliability stack — no labels required on day one:

Component	What it does
`AgentGuard`	Score completed chains with SC_OLD behavioral signals + optional Sonnet judge. Emit A2A trust objects.
`A2ATrustObject`	Structured confidence envelope for agent-to-agent handoff (answer + risk + tier + failure_mode + hint).
`QueryRewriter`	When Agent A has low confidence, generate 3 diverse query reformulations for Agent B.
`LabelFreeScorer`	Raw behavioral risk scoring in <15 ms. Zero cold-start.
`QppgMonitor`	Drop-in agent monitor. Auto-calibrates, fires alerts, persists to SQLite, exports CSV.
`FailureTaxonomist`	Diagnoses why a chain failed (retrieval failure, excessive search, hallucination, …).
`SelfHealer`	Converts failure diagnosis into prompt injections that repair the agent mid-run. Validated: RETRIEVAL_FAILURE +11.1pp, EXCESSIVE_SEARCH +20pp pass@1 (exp_selfhealer_ab).
`AdversarialChainDetector`	Detects FARL-style "confident_wrong" adversarial chains. Internal CV AUROC 0.9960 (174 self-curated HP chains, post-hoc features). External holdout AUROC 0.9836 [0.965, 0.997] on FARL phase2/MuSiQue chains (exp_adversarial_holdout). See caveat below. Load via `load_default()`.
`MultiTurnGuard`	Scores multi-turn conversations for reliability drift across turns. `score_factual_turn()` uses P(True) factual-correctness path (`is_factual_reliability_signal=True`).
`DeepChainScorer`	Scores 5–8 step multi-hop chains (2Wiki-style) using retrieval trajectory slope. Auto-routes: short chains → SC_OLD; long chains (≥4 Search steps) → `RetrievalCascadeScorer` (AUROC 0.629 on 2Wiki vs SC_OLD 0.545); very long (≥6) + optional Mamba → `MambaRiskScorer`. Zero cost, pure numpy.
`SQLSemanticValidator`	Schema-aware SQL semantic judge (Haiku ~$0.0003/call). Detects wrong JOINs, missing GROUP BY, incorrect filters. Multi-dialect: ANSI, Databricks/Delta, Spark, BigQuery, Snowflake, PostgreSQL, MySQL. Zero-cost heuristic fallback with no API key.
`SQLExecutionOracle`	Run SQL against any DB-API 2.0 connection, check shape + exceptions, blend execution risk with behavioral score. Works with SQLite, Databricks SQL, Snowflake, psycopg2. `check_raw()` for Spark/BigQuery.
`ChartValidator`	Spec-aware chart/table judge: wrong chart type, swapped axes, unit mismatches, field mismatches. Supports Vega-Lite, Chart.js, plain text spec. Zero-cost heuristic fallback.
`AutoGenAdapter`	Converts AutoGen GroupChat message history (tool calls, function calls, replies) → AgentGuard chain scoring + A2A trust handoff. `register_hook()` for real-time monitoring.
`CrewAIAdapter`	Converts CrewAI `TaskOutput` / `CrewOutput` → AgentGuard chain scoring + A2A trust handoff. `wrap_task()` for transparent scoring on task completion.

Namespace note: LabelFreeScorer, QppgMonitor, FailureTaxonomist, and SelfHealer are in the qppg_service package, not llm_guard.
# Correct:
from qppg_service import LabelFreeScorer, QppgMonitor
from qppg_service import FailureTaxonomist, SelfHealer

# This will fail:
from llm_guard import LabelFreeScorer  # ImportError

MultiTurnGuard (v0.38.0): Full MT-Bench evaluation (n=2,575 pairs, lmsys/mt_bench_human_judgments) — AUROC 0.667 [0.652, 0.681] with use_ptrue=True. CI collapsed from ±0.124 (n=50) to ±0.015 (full n). Note: MT-Bench measures response quality preference, not factual reliability — these are correlated but different constructs. See docs/mt_bench_validation.md. Lexical-only mode AUROC 0.581 — not valid.

AdversarialChainDetector caveat: External holdout AUROC of 0.9836 should be interpreted carefully — all adversarial chains are from MuSiQue domain and all correct chains from HotpotQA, so domain differences may partly drive the discrimination. A within-domain holdout (HotpotQA adversarial vs HotpotQA correct, unseen questions) is needed to fully validate. load_default() emits a UserWarning at runtime.

Methodology, dataset construction, and honest limitations: docs/adversarial_methodology.md

Validated AUROC — held-out evaluation (HotpotQA within-domain, TriviaQA cross-domain):

Method	Within-domain	Cross-domain	Cost/chain
MiniJudge (SC_OLD + LogReg, distilled from Sonnet)	0.747 ± 0.10	—	$0
SC_OLD behavioral ensemble (n=0 labels)	0.817	0.703 (2Wiki) / 0.659 (TV)	$0
SC_OLD + Sonnet judge (J5)	0.777	0.741	~$0.007
Conformal alert precision at FPR ≤ 10%	0.908	—	—
Mid-chain Haiku at step 2	0.683	—	~$0.001/step
RetrievalCascadeScorer (long chains ≥4 Search steps)	0.592 (HP)	0.629 (2Wiki) / 0.669 (Musique) / 0.742 (TriviaQA)	$0

v0.17.0: MiniJudge achieves 0.747 AUROC (HP 5-fold CV) at zero inference cost — within 2.7 pp of Sonnet judge. Cross-domain live validation (exp156/156b): 2WikiMultiHop 0.703 [0.628, 0.775] ✅ · TriviaQA 0.659 [0.614, 0.705] ✅ · MuSiQue 0.613 (CI wide) · NQ 0.524 (open-domain factoid, near-random). RetrievalCascadeScorer (v0.52.0): superior on long chains (avg ≥4 Search steps). SC_OLD beats it on short chains (HP AUROC 0.817 vs 0.592). DeepChainScorer auto-selects the right scorer per chain length. All figures from held-out evaluation. See docs/production_integration.md for full methodology.

MechanismFeatureExtractor — Phase A vs Phase B deployment gap:

The MechanismFeatureExtractor (v0.46.0) achieves AUROC 0.964 on HaluEval2-QA in Phase A (curated benchmark, label-matched training). This number does not transfer to real LLM outputs.

Dataset	Phase A (curated)	Phase B (model-generated)	Status
HaluEval2-QA	0.964 [0.947, 0.978]	0.559 [0.509, 0.608]	⚠ Large gap
Medical QA (PubMed)	0.530 [0.446, 0.621]	0.412 [0.338, 0.483]	❌ Worse than random
HaluEval2-Dialogue	0.702 [0.662, 0.743]	—	✓ Gate pass
TruthfulQA	0.668 [0.625, 0.712]	—	✓ Gate pass
Sycophancy	0.503 [0.447, 0.561]	0.413 [0.342, 0.492]	❌ Fail

⚠ Medical domain exclusion: MechanismFeatureExtractor is actively anti-predictive (AUROC 0.412) on clinical/biomedical QA. Do not use for medical, clinical, or health-related agent reliability scoring. Phase A → Phase B gap root cause: Text-level features (semantic similarity, answer length, certainty density) are confounded by the LLM's generation style. A confident-sounding wrong answer and a confident-sounding correct answer are nearly indistinguishable via surface text features alone. Phase A benchmark data is annotation-curated (hallucinations injected artificially), which inflates AUROC vs. naturally occurring errors in model-generated traces.

Install

# Core (no API key needed)
pip install llm-guard-kit

# With specific framework integrations
pip install "llm-guard-kit[langchain]"   # LangChain agents
pip install "llm-guard-kit[openai]"      # OpenAI Assistants
pip install "llm-guard-kit[llamaindex]"  # LlamaIndex
pip install "llm-guard-kit[haystack]"    # Haystack pipelines

# HTTP server + dashboard
pip install "llm-guard-kit[server]"

# Everything
pip install "llm-guard-kit[all]"

Requires Python 3.9+.

Performance

Measured on Apple M-series, behavioral scoring only (v0.48 benchmarks):

Concurrency	p50	p95	p99	Throughput
c=1 serial	17.6 ms	19.1 ms	20.4 ms	—
c=10	179 ms	—	209 ms	56 req/s
c=50	906 ms	—	1008 ms	55 req/s
c=100	1697 ms	—	1887 ms	59 req/s

Note: Throughput is CPU-bound at ~57 req/s regardless of concurrency. Per-request latency scales linearly with concurrency because the behavioral scorer is single-threaded Python. For high-concurrency deployments, run multiple worker processes (e.g. uvicorn --workers 4). Run python experiments/latency_benchmark.py to reproduce on your hardware.

MiniJudge — $0 Local Judge (v0.17.0)
AgentGuard + A2A Trust (v0.6.0)
Quick Start — Drop-in Monitor
Framework Integrations
Full Pipeline — Detect → Diagnose → Repair
Persistence & Auto-Calibration
CLI Reference
HTTP API Server
Monitoring Dashboard
Docker Deployment
SaaS API Key Auth
Drift Detection
Agent Step Format

0. MiniJudge — $0 Local Judge (v0.17.0)

MiniJudge is a logistic regression model trained on 11 behavioral features (SC_OLD), distilled from Sonnet soft labels. AUROC 0.818 on HotpotQA (14-feature model, 5-fold CV), zero inference cost.

See docs/sc_old_features.md for the complete, reproducible feature reference with ablation results.

from llm_guard import MiniJudge

judge = MiniJudge()  # loads pre-trained weights automatically
risk = judge.score(question, steps, final_answer)  # float in [0, 1]

SC_OLD Feature Table (11 behavioral features)

All features computed from chain dict in < 1 ms. No API calls. Full definitions and ablation in docs/sc_old_features.md.

#	Feature	What it measures	Standalone AUROC
1	`sc1_loop_rate`	Fraction of repeated action types	0.571
2	`sc2_steps_norm`	Normalized step count (÷ 6)	0.622
3	`sc3_empty_obs_rate`	Fraction of observations < 15 chars	0.733
4	`sc5_repeated_search`	Query repetition rate	0.500
5	`sc6_answer_gap`	Short answer relative to question length	0.624
6	`sc8_backtrack_rate`	Fraction of consecutive same-type steps	0.575
7	`sc9_obs_util`	Inverse average observation length	0.605
8	`sc10_coherence_drop`	Thought length variance (pstdev/mean)	0.596
9	`sc11_ans_obs_mismatch`	1 − Jaccard(answer, last_observation)	0.691
10	`search_count_norm`	Search steps / total steps	0.733
11	`avg_obs_norm`	Normalized average observation length	0.608

Most valuable single contributor (drop-one ablation): f12_ans_question_sim (−0.010 AUROC when dropped), sc9_obs_util (−0.008), sc10_coherence_drop (−0.007).

For maximum accuracy, blend MiniJudge with P(True):

from llm_guard import MiniJudge, AgentGuard, probe_ensemble_blend

guard = AgentGuard(api_key="sk-ant-...")
result = guard.score_with_ptrue(question, steps, final_answer)
mini_risk = MiniJudge().score(question, steps, final_answer)
blended = probe_ensemble_blend(mini_risk, result.ptrue_risk, alpha=0.25)  # AUROC ~0.74

Opt-in telemetry — feeds your data back into the retrain pipeline:

guard = AgentGuard(
    api_key="sk-ant-...",
    contribute_labels=True,
    telemetry_token="ghp_...",         # GitHub PAT with repo scope
    telemetry_repo="your-org/llm-guard-labels",
)
# Every score_chain() + update_isotonic(feedback) call sends 11 floats + 1 bit

1. AgentGuard + A2A Trust (v0.6.0)

Chain scoring with validated behavioral signals

from llm_guard import AgentGuard

# Zero cost — behavioral SC_OLD signals only (~0.81 AUROC within-domain)
guard = AgentGuard()

# With Sonnet judge (~$0.007/chain, ~0.74 AUROC cross-domain)
guard = AgentGuard(api_key="sk-ant-...", use_judge=True)

result = guard.score_chain(
    question="When was the Eiffel Tower built?",
    steps=[
        {"thought": "Search for construction date",
         "action_type": "Search", "action_arg": "Eiffel Tower construction year",
         "observation": "The Eiffel Tower was built from 1887 to 1889..."},
        {"thought": "Completed in 1889",
         "action_type": "Finish", "action_arg": "1889", "observation": ""},
    ],
    final_answer="1889",
)

print(result.confidence_tier)   # "HIGH" / "MEDIUM" / "LOW"
print(result.risk_score)        # 0.0–1.0, higher = more likely wrong
print(result.needs_alert)       # True when risk >= 0.70 (Precision=0.908)
print(result.failure_mode)      # "retrieval_fail" | "long_chain" | None
print(result.judge_label)       # "GOOD" | "BORDERLINE" | "POOR" | None

A2A trust handoff

# Agent A produces a trust object
trust = guard.generate_trust_object(question, steps, final_answer)
payload = trust.to_dict()   # JSON-serialisable for queue/API transport

# Agent B receives it and conditions its strategy
from llm_guard import A2ATrustObject, QueryRewriter

trust = A2ATrustObject.from_dict(payload)
print(trust.downstream_hint)  # "proceed" / "proceed_with_caution" /
                               # "rewrite_query" / "escalate_to_human"

# When Agent A had low confidence, diversify Agent B's queries
rewriter = QueryRewriter(api_key="sk-ant-...")
variants = rewriter.rewrite_if_needed(question, trust)
# variants = [paraphrase, decomposed_sub_question, alternative_angle]
# Returns [] when no rewrite needed (HIGH/MEDIUM confidence)

Tier degradation

Without a judge configured (AgentGuard(api_key=None)), generate_trust_object() uses 2-tier routing: PROCEED (risk < 0.50) or ESCALATE (risk ≥ 0.50). The trust object's routing_mode field will be "behavioral_only_2tier".

With a judge configured, full 4-tier routing is active: PROCEED → REVIEW → ESCALATE → REJECT, and routing_mode is "full_4tier".

Downstream agents must check routing_mode before applying 4-tier routing logic:

trust = guard.generate_trust_object(question, steps, final_answer)

if trust.routing_mode == "behavioral_only_2tier":
    # Only PROCEED / ESCALATE are meaningful
    if trust.confidence_tier == "ESCALATE":
        route_to_human_review(trust)
else:
    # Full 4-tier routing available
    if trust.confidence_tier == "LOW":
        rewrite_and_retry(trust)
    elif trust.confidence_tier == "MEDIUM":
        proceed_with_monitoring(trust)

Mid-chain monitoring

# Call inside your agent loop BEFORE each step executes
step = guard.monitor_step(
    question="When was the Eiffel Tower built?",
    steps_so_far=[step1],
    current_action="Search[Eiffel Tower date]",
)
if step.risk == "high":
    pass  # intervene early — AUROC 0.683 at step 2 (Δ+0.156 vs behavioral)

1. Quick Start

Zero setup — monitor from query 1

from qppg_service import QppgMonitor

monitor = QppgMonitor(threshold=0.65)   # alert above this risk score

# Call after every agent run
alert = monitor.track(
    question    = "Which city is older, Rome or Athens?",
    steps       = agent_steps,           # see step format below
    final_answer= "Athens",
    finished    = True,
)

if alert:
    print(f"HIGH RISK ({alert.risk_score:.2f}): {alert.recommendation}")

# Get a stats report
print(monitor.export_report())
monitor.export_csv("agent_risk_log.csv")

Works on query 1. No training. No labels. AUROC 0.817 within-domain out of the box (HotpotQA, exp156).

With SQLite persistence (recommended for production)

monitor = QppgMonitor(
    threshold   = 0.65,
    db_path     = "~/.qppg/chains.db",   # auto-creates on first run
    domain      = "prod",                 # namespace for multi-domain setups
    model_name  = "claude-opus-4-6",
    recal_every = 100,                    # re-calibrate GMM every N new chains
    check_drift = True,                   # auto-detect distributional drift
)

2. Framework Integrations

LangChain

Drop a callback into any LangChain AgentExecutor — no code changes to your agent:

pip install "llm-guard-kit[langchain]"

from langchain.agents import AgentExecutor, create_react_agent
from qppg_service.integrations.langchain_callback import QppgLangChainCallback
from qppg_service import QppgMonitor

# With persistence
monitor  = QppgMonitor(db_path="~/.qppg/prod.db", domain="prod")
callback = QppgLangChainCallback(monitor=monitor, threshold=0.65)

# Attach to any AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools)
result = agent_executor.invoke(
    {"input": "What year was the Eiffel Tower built?"},
    config={"callbacks": [callback]},
)

# Check result
score = callback.get_last_result()
if score and score.needs_review:
    print(f"HIGH RISK: {score.risk_score:.3f}")
    print(f"Behavioral: {score.behavioral_score:.3f}")

What gets captured automatically:

on_agent_action → thought + tool name + tool input → Search/Finish step
on_tool_end → tool output → observation
on_agent_finish → final answer + full chain scoring

Tool name mapping: tavily_search, duckduckgo_search, wikipedia, retriev* → "Search".

OpenAI Assistants API

pip install "llm-guard-kit[openai]"

from openai import OpenAI
from qppg_service.integrations.openai_adapter import score_assistants_run
from qppg_service import QppgMonitor

client  = OpenAI()
monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="assistants")

# Run your assistant normally
thread = client.beta.threads.create()
client.beta.threads.messages.create(thread.id, role="user", content="When was Python created?")
run = client.beta.threads.runs.create_and_poll(thread.id, assistant_id="asst_xxx")

# Score the completed run
result = score_assistants_run(
    client,
    thread_id = thread.id,
    run_id    = run.id,
    monitor   = monitor,                  # optional: persists to SQLite
    question  = "When was Python created?",  # optional: auto-extracted from thread
)
print(f"Risk: {result.risk_score:.3f}  Review: {result.needs_review}")

LlamaIndex

pip install "llm-guard-kit[llamaindex]"

from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.callbacks import CallbackManager
from qppg_service.integrations.llamaindex_callback import QppgLlamaIndexCallback
from qppg_service import QppgMonitor

monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="llamaindex")
qppg_cb = QppgLlamaIndexCallback(monitor=monitor, threshold=0.65)
Settings.callback_manager = CallbackManager([qppg_cb])

# Your index and query engine work unchanged
index        = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response     = query_engine.query("What is RAG?")

result = qppg_cb.get_last_result()
if result and result.needs_review:
    print(f"HIGH RISK: {result.risk_score:.3f}")

Haystack

pip install "llm-guard-kit[haystack]"

from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
from qppg_service.integrations.haystack_callback import QppgHaystackMonitor
from qppg_service import QppgMonitor

# Build your pipeline normally
pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=store))
pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o-mini"))
pipeline.connect("retriever", "generator")

monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="haystack")
qppg    = QppgHaystackMonitor(pipeline, monitor=monitor)

# Run through the wrapper instead of pipeline.run()
outputs, result = qppg.run({"retriever": {"query": "Who invented Python?"}})

if result and result.needs_review:
    print(f"HIGH RISK: {result.risk_score:.3f}")

3. Full Pipeline

Detect → Diagnose → Repair in one block:

from qppg_service import QppgMonitor, FailureTaxonomist, SelfHealer

monitor = QppgMonitor(threshold=0.65, db_path="~/.qppg/prod.db")
tx      = FailureTaxonomist()
healer  = SelfHealer()

alert = monitor.track(question, steps, final_answer, finished=True)

if alert:
    # Diagnose WHY it failed
    failure = tx.classify(question, steps, final_answer, finished=True)
    print(failure.primary_mode)    # "EXCESSIVE_SEARCH" | "RETRIEVAL_FAILURE" | ...
    print(failure.explanation)     # human-readable explanation
    print(failure.confidence)      # 0–1

    # Get a repair prompt to inject back into the agent
    action = healer.suggest(failure, question, steps, final_answer)
    print(action.action_type)        # "FORCE_FINISH" | "REPHRASE_QUERY" | ...
    print(action.prompt_injection)   # ready to inject as next agent message
    print(action.urgency)            # "HIGH" | "MEDIUM" | "LOW"

Failure modes detected:

Production-validated modes (FailureTaxonomist(), default):

Mode	When triggered	Suggested repair	Validation
`RETRIEVAL_FAILURE`	`mean cosine(obs, question) < 0.35`	`REPHRASE_QUERY`	F1=0.701 (exp_failure_taxonomy_llm)
`EXCESSIVE_SEARCH`	`> 4 search steps`	`CONSOLIDATE` or `FORCE_FINISH`	+20pp pass@1 (exp_selfhealer_ab)
`LOW_RISK`	No failure mode triggered	— (pass-through)	Production sentinel

Experimental modes (FailureTaxonomist(experimental=True) — improved heuristics in v0.54, no published AUROC):

Mode	When triggered
`CONFLICTING_EVIDENCE`	Cosine similarity drop between first/second-half observations
`INSUFFICIENT_EVIDENCE`	Low retrieval coverage + uncertainty density
`ANSWER_UNSUPPORTED`	High answer-gap AND low answer-chain similarity
`PREMATURE_STOP`	Multi-hop question with only 1 search step

4. Persistence & Auto-Calibration

SQLite store (ChainStore)

from qppg_service import ChainStore

store = ChainStore("~/.qppg/chains.db")

# Query your stored data
domains = store.get_domains()                        # ["prod", "staging"]
pool    = store.get_calibration_pool("prod", n=200) # last 200 chains as dicts
stats   = store.get_domain_stats("prod")
# {"n_chains": 523, "n_alerts": 41, "avg_risk": 0.47, "last_auroc": 0.83}

# Export audit log
csv_str  = store.export_audit("prod", fmt="csv")
json_str = store.export_audit("prod", fmt="json")

# Clear a domain (e.g. after model upgrade)
store.clear_domain("prod")

Mixed-domain warm-up

If you have an existing domain with chains, you can bootstrap a new domain from it:

from qppg_service import ChainStore, LabelFreeScorer

store  = ChainStore("~/.qppg/chains.db")
scorer = LabelFreeScorer()

# Copy 25 chains from "staging" to "prod" as cross-domain calibration
# (boosts cross-domain AUROC from ~0.50 to ~0.81)
pool = store.get_calibration_pool("staging", n=25)
scorer.calibrate(pool)
# Now scorer works on "prod" questions with much better accuracy

5. CLI Reference

pip install "llm-guard-kit[server]"

Status — see all monitored domains

llm-guard-kit status
llm-guard-kit status --domain prod

Output:

DOMAIN               CHAINS   ALERTS AVG RISK    AUROC DRIFT
--------------------------------------------------------------------
prod                    523       41    0.467    0.883 OK
staging                  89        7    0.512      n/a OK

Score — score a single chain from a JSON file

# Prepare a chain file
cat > chain.json << 'EOF'
{
  "question": "Who invented the telephone?",
  "steps": [
    {"thought": "I should search.", "action_type": "Search",
     "action_arg": "telephone inventor", "observation": "Alexander Graham Bell..."},
    {"thought": "Found it.", "action_type": "Finish",
     "action_arg": "Alexander Graham Bell", "observation": ""}
  ],
  "final_answer": "Alexander Graham Bell",
  "finished": true
}
EOF

llm-guard-kit score --steps-file chain.json --domain prod

Output:

Risk score   : 0.312  (OK)
Behavioral   : 0.287
GMM score    : 0.291
Retrieval    : mean_sim=0.612  min_sim=0.489  (GOOD)
Search steps : 1

Calibrate — warm up a domain

# From a JSON file of chain dicts
llm-guard-kit calibrate --domain prod --chains-file my_chains.json

# Mixed-domain warm-up (copy from another domain)
llm-guard-kit calibrate --domain prod --source-domain staging --chains 25

Recalibrate — after a model upgrade

# IMPORTANT: calibration is model-specific. Cross-model AUROC ≈ 0.508 (chance).
llm-guard-kit recalibrate --domain prod --new-model claude-opus-4-6

Export — audit log

llm-guard-kit export --domain prod --format csv > audit.csv
llm-guard-kit export --domain prod --start 2026-01-01 --end 2026-03-01 --format json

Serve — launch the API server

llm-guard-kit serve --domain prod --port 8000 --host 0.0.0.0

Dashboard — launch with monitoring UI

llm-guard-kit dashboard --domain prod --port 8000
# Open http://localhost:8000/dashboard

6. HTTP API Server

For multi-language / microservice deployments:

pip install "llm-guard-kit[server]"
llm-guard-kit serve --port 8000 --host 0.0.0.0

Interactive docs: http://localhost:8000/docs

Score a chain

curl -X POST http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Who invented Python?",
    "steps": [
      {"thought": "Search for it.", "action_type": "Search",
       "action_arg": "Python creator", "observation": "Guido van Rossum..."},
      {"thought": "Done.", "action_type": "Finish",
       "action_arg": "Guido van Rossum", "observation": ""}
    ]
  }'

Response:

{
  "confidence": 0.72,
  "needs_review": false,
  "deployment_status": "DEPLOYED",
  "risk_score": 0.31,
  "behavioral_score": 0.29,
  "gmm_score": 0.33
}

Calibrate (add a verified chain)

curl -X POST http://localhost:8000/calibrate \
  -H "Content-Type: application/json" \
  -d '{"question": "...", "steps": [...], "correct": true}'

Bulk calibrate (seed from existing logs)

curl -X POST http://localhost:8000/bulk-calibrate \
  -H "Content-Type: application/json" \
  -d '{"chains": [{"question":"...", "steps":[...], "correct":true}, ...]}'

Deployment status

curl http://localhost:8000/status

Reset (destructive — clears calibration)

curl -X POST "http://localhost:8000/reset?confirm=YES_RESET"

7. Monitoring Dashboard

Start the server with dashboard enabled:

llm-guard-kit serve --domain prod --port 8000 --db ~/.qppg/chains.db --dashboard
# or:
python -m qppg_service.server --domain prod --port 8000 --db ~/.qppg/chains.db --dashboard

Open http://localhost:8000/dashboard

Dashboard features:

Deployment status banner — COLD START / WARMING / DEPLOYED + Est. AUROC
Drift alert banner — fires when mean risk shifts > 0.10 over a 7-day window
5 KPI cards — Total queries, Alerts, Avg risk, Calibration pool size, Avg search steps
Risk timeline — 24h / 7d / 30d selector, alert threshold line, color-coded points
Failure mode breakdown — horizontal bar chart of classified failure types
Chain log — paginated table, sortable, filter All/Alerts-only, search, model column
Failure Modes section — detailed view with descriptions
Domain switcher — switch between domains without restarting
Export CSV — download the current view as CSV

8. Docker Deployment

Single command

git clone https://github.com/avighan/qppg
cd qppg
docker compose up -d

The server starts on http://localhost:8000.

docker-compose.yml (included)

version: "3.9"
services:
  qppg:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - qppg_data:/data
    environment:
      QPPG_DOMAIN: "default"
      QPPG_DB:     "/data/chains.db"
      QPPG_PORT:   "8000"
      QPPG_HOST:   "0.0.0.0"
      # QPPG_ADMIN_KEY: "your-secret-admin-key"  # enable API key creation
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/status"]
      interval: 30s
volumes:
  qppg_data:

Environment variables

Variable	Default	Description
`QPPG_DOMAIN`	`default`	Default scoring domain
`QPPG_DB`	`/data/chains.db`	SQLite database path
`QPPG_PORT`	`8000`	HTTP port
`QPPG_HOST`	`0.0.0.0`	Bind address
`QPPG_ADMIN_KEY`	(unset)	Secret to enable `/api/keys` endpoint

Production with nginx reverse proxy

server {
    listen 443 ssl;
    server_name qppg.yourcompany.com;

    location / {
        proxy_pass         http://127.0.0.1:8000;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
    }
}

9. SaaS API Key Auth

For multi-tenant deployments where different customers score their own agents:

Step 1: Start server with admin key and SQLite

python -m qppg_service.server \
  --port 8000 --db /data/chains.db \
  --admin-key "your-secret-admin-key" \
  --dashboard

Step 2: Create a customer API key

curl -X POST http://localhost:8000/api/keys \
  -H "Content-Type: application/json" \
  -H "X-Admin-Key: your-secret-admin-key" \
  -d '{"customer_id": "acme-corp", "domain_prefix": "prod"}'

Response:

{
  "api_key":       "jN2xR...abc",      ← Save this — shown once only
  "key_id":        "a3f9b2d1c0e4",
  "customer_id":   "acme-corp",
  "domain_prefix": "prod"
}

Step 3: Score with Bearer token auth

curl -X POST http://localhost:8000/api/prod/score \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer jN2xR...abc" \
  -d '{"question": "When was Python created?", "steps": [...]}'

Response:

{
  "risk_score":       0.34,
  "needs_review":     false,
  "behavioral_score": 0.31,
  "gmm_score":        0.37,
  "calibration_size": 47
}

Step 4: Export audit log

curl "http://localhost:8000/api/prod/audit-log?fmt=csv&start=1700000000" \
  -H "Authorization: Bearer jN2xR...abc" \
  > audit.csv

Rate limit: 1,000 requests/hour per API key (in-memory sliding window).

Domain isolation: Each customer's data is stored under {customer_id}:{domain} — fully isolated.

10. Drift Detection

DriftDetector compares mean risk score in the current 7-day window vs the previous 7-day window and fires an alert if the shift exceeds a threshold:

from qppg_service import ChainStore, DriftDetector, DriftAlert

store    = ChainStore("~/.qppg/chains.db")
detector = DriftDetector(threshold=0.10, window_days=7, min_samples=10)

alert = detector.check("prod", store)
if alert:
    print(f"Drift detected!")
    print(f"  Direction:    {alert.direction}")         # "UP" or "DOWN"
    print(f"  Delta:        {alert.delta:+.3f}")        # +0.142
    print(f"  Current mean: {alert.current_mean:.3f}")  # 0.612
    print(f"  Prev mean:    {alert.previous_mean:.3f}") # 0.470
    print(f"  Samples:      {alert.n_current}")         # 47
    print(alert.recommendation)

When drift fires: Run llm-guard-kit recalibrate if you recently upgraded your LLM model (cross-model AUROC ≈ 0.508 — calibration does not transfer across models).

Agent Step Format

Every integration expects steps in this format:

steps = [
    {
        "thought":      "I need to find when the Eiffel Tower was built.",
        "action_type":  "Search",           # "Search" | "Finish" | any custom tool name
        "action_arg":   "Eiffel Tower construction date",
        "observation":  "The Eiffel Tower was built between 1887 and 1889..."
    },
    {
        "thought":      "I now have the answer.",
        "action_type":  "Finish",
        "action_arg":   "1889",
        "observation":  ""
    }
]

Framework adapters (LangChain, OpenAI, LlamaIndex, Haystack) build this format automatically.

Retrieval Quality Diagnostics

A standalone signal you can use independently of the full monitor:

from qppg_service import LabelFreeScorer

scorer = LabelFreeScorer()
rq = scorer.retrieval_quality(question, steps)
# {
#   "mean_sim":     0.41,        # average cosine(observation, question)
#   "min_sim":      0.22,        # worst retrieval step
#   "quality_label":"POOR",      # "GOOD" | "OK" | "POOR"
#   "per_step":     [0.52, 0.22, ...]
# }

Correct agents: mean_sim = 0.554; wrong agents: 0.458 (Δ+0.096, p<0.01).

11. MCP Server — Claude Desktop & Cursor Integration (v0.9.0)

llm-guard-kit ships a Model Context Protocol server that exposes all scoring tools directly inside Claude Desktop and Cursor — no code required.

Setup

pip install "llm-guard-kit[mcp]"

Start the server

# stdio transport — for Claude Desktop and Cursor (default)
llm-guard-mcp

# SSE transport — for web clients
llm-guard-mcp --sse --port 8765

Add to Claude Desktop

Open ~/.claude/claude_desktop_config.json and add:

{
  "mcpServers": {
    "llm-guard-kit": {
      "command": "python3",
      "args": ["-m", "llm_guard_mcp.server"],
      "cwd": "/path/to/your/project",
      "env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
    }
  }
}

Restart Claude Desktop. Eight new tools appear in the tool panel.

Add to Cursor

Add the same block to ~/.cursor/mcp.json, then invoke with @llm-guard-kit in any Cursor chat.

Tool examples (all outputs verified)

`score_chain` — score a completed agent chain

# In Claude / Cursor, call the tool directly. In Python:
import asyncio
from llm_guard_mcp.server import _dispatch

steps = [
    {"thought": "Search for Python creation date.",
     "action_type": "Search", "action_arg": "Python creation year",
     "observation": "Python was created by Guido van Rossum, released in 1991."},
    {"thought": "Found the answer.",
     "action_type": "Finish", "action_arg": "1991", "observation": ""},
]

result = asyncio.run(_dispatch("score_chain", {
    "question": "What year was Python created?",
    "steps": steps,
    "final_answer": "1991",
}))

Response:

{
  "chain_id": 1,
  "risk_score": 0.2958,
  "tier": "HIGH",
  "needs_review": true,
  "needs_alert": true,
  "beh_score": 0.2958,
  "ptrue_score": 0.0,
  "rl_score": null,
  "interpretation": "HIGH risk (0.296): Agent failed, contradicted itself, or gave up. Do not use without verification.",
  "action": "Block answer. Alert human reviewer. Do not send to end user."
}

tier is behavioral-only when no ANTHROPIC_API_KEY is set. With a key, ptrue_score is populated and AUROC improves from 0.682 → 0.775.

`stream_check` — mid-chain abort (call after step 2)

# Check after 2 steps — abort early if chain is failing
result = asyncio.run(_dispatch("stream_check", {
    "question": "What year was Python created?",
    "steps_so_far": steps[:1],   # just the first step
}))

Response (on-track chain):

{
  "should_abort": false,
  "risk_score": 0.5376,
  "step": 1,
  "chain_id": null,
  "message": "CONTINUE: Chain looks on-track so far."
}

Response (failing chain — empty observations, repeated queries):

{
  "should_abort": false,
  "risk_score": 0.4964,
  "step": 2,
  "chain_id": null,
  "message": "CONTINUE: Chain looks on-track so far."
}

should_abort=true fires when risk_score >= 0.65. Abort the agent immediately and surface a failure message to the user.

Note: Mid-chain abort uses SC_OLD at full depth (AUROC 0.683 at step 2). A dedicated partial-chain calibrator is in development and will ship when validated.

`submit_feedback` — human label → RL signal

# After a user confirms an answer is correct or wrong
result = asyncio.run(_dispatch("submit_feedback", {
    "chain_id": 1,
    "correct": True,
    "note": "Verified via Wikipedia",
}))

Response:

{
  "status": "feedback_recorded",
  "chain_id": 1,
  "label": "correct",
  "n_labeled": 1,
  "retrain_ready": false,
  "retrain_reason": "Need 29 more labels before first retrain (have 1)",
  "note": "Verified via Wikipedia"
}

After 30 labels (min 5 per class), retrain_ready becomes true. Call trigger_retrain or let score_chain auto-retrain.

`get_metrics` — summary stats for last N days

result = asyncio.run(_dispatch("get_metrics", {"days": 7}))

Response:

{
  "summary": {
    "days": 7.0,
    "total": 5,
    "aborted": 0,
    "tier_LOW": 0,
    "tier_MEDIUM": 0,
    "tier_HIGH": 5,
    "mean_risk": 0.2952,
    "n_labeled": 2,
    "n_wrong_labeled": 0,
    "label_error_rate": 0.0
  },
  "recent_chains": [
    {"id": 5, "question": "What year was Python created?...", "risk": 0.296, "tier": "HIGH", "labeled": true}
  ],
  "rl_status": {
    "model_trained": false,
    "labels_since_last_train": 2
  }
}

`get_auroc` — rolling AUROC from labeled feedback

result = asyncio.run(_dispatch("get_auroc", {"days": 30}))

Response (insufficient labels):

{
  "auroc": null,
  "window_days": 30.0,
  "n_labeled": 2,
  "drift_alert": null,
  "auroc_history": [],
  "baseline_ref": {
    "behavioral_only": 0.682,
    "ptrue_ensemble": 0.775,
    "integration_target": 0.795
  },
  "status": "INSUFFICIENT DATA (need ≥10 labels)"
}

Once you have ≥10 labels, auroc is populated and drift_alert fires if AUROC drops ≥5pp from its peak.

`trigger_retrain` — retrain RL model from labels

result = asyncio.run(_dispatch("trigger_retrain", {}))

Response (not enough labels yet):

{
  "status": "skipped",
  "reason": "Need 28 more labels before first retrain (have 2)"
}

Response after 30+ labels (balanced classes):

{
  "status": "trained",
  "n_labels": 42,
  "cv_auroc": 0.831,
  "precision": 0.867,
  "recall": 0.714,
  "model": "LogisticRegression",
  "features": ["risk_score", "beh_score", "ptrue_score", "n_steps_norm", "tier_num"]
}

`get_pending_review` — HIGH-risk chains awaiting labels

result = asyncio.run(_dispatch("get_pending_review", {"limit": 5}))

Response:

{
  "pending_count": 3,
  "pending": [
    {
      "chain_id": 3,
      "question": "What year was Python created?",
      "answer": "1991",
      "risk_score": 0.295,
      "tier": "HIGH",
      "how_to_label": "Call submit_feedback with chain_id=3, correct=true/false"
    }
  ],
  "total_labeled": 2,
  "retrain_ready": false,
  "retrain_reason": "Need 28 more labels before first retrain (have 2)"
}

`get_rl_status` — full RL training history and drift

result = asyncio.run(_dispatch("get_rl_status", {}))

Response:

{
  "model_trained": false,
  "total_labels": 2,
  "labels_wrong": 0,
  "labels_correct": 2,
  "labels_since_last_train": 2,
  "retrain_ready": false,
  "retrain_reason": "Need 28 more labels before first retrain (have 2)",
  "training_history": [],
  "auroc_trend": [],
  "drift_alert": null,
  "rl_loop_config": {
    "retrain_every_n_labels": 20,
    "min_labels_for_first_train": 30,
    "min_per_class": 5,
    "features": ["risk_score", "beh_score", "ptrue_score", "n_steps_norm", "tier_num"]
  }
}

RL feedback loop summary

Step	What to call	When
Agent finishes	`score_chain`	Every chain
User checks step 2	`stream_check`	Optional — saves API cost
User marks answer	`submit_feedback`	Every manual review
After 30 labels	`trigger_retrain`	Auto or manual
Monitor quality	`get_auroc`	Daily / weekly
Review queue	`get_pending_review`	When assigning to human reviewers

Labels are stored in SQLite at ~/.llm_guard_mcp/metrics.db. Persists across restarts.

License

MIT — see LICENSE.

Research Background

Built on experiments exp18–53 validating against HotpotQA, NaturalQuestions, TriviaQA, and GSM8K.

Paper draft: docs/research_paper.md

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.66.1

Mar 25, 2026

0.66.0

Mar 25, 2026

0.65.2

Mar 25, 2026

This version

0.65.1

Mar 25, 2026

0.65.0

Mar 24, 2026

0.64.0

Mar 24, 2026

0.63.0

Mar 24, 2026

0.62.0

Mar 23, 2026

0.58.0

Mar 23, 2026

0.57.0

Mar 23, 2026

0.56.0

Mar 23, 2026

0.55.0

Mar 22, 2026

0.54.0

Mar 22, 2026

0.53.0

Mar 22, 2026

0.52.0

Mar 22, 2026

0.51.0

Mar 22, 2026

0.50.0

Mar 22, 2026

0.49.0

Mar 22, 2026

0.48.0

Mar 22, 2026

0.47.0

Mar 21, 2026

0.46.0

Mar 21, 2026

0.45.0

Mar 21, 2026

0.44.0

Mar 21, 2026

0.43.0

Mar 20, 2026

0.42.0

Mar 20, 2026

0.41.0

Mar 20, 2026

0.39.0

Mar 20, 2026

0.38.0

Mar 20, 2026

0.37.0

Mar 20, 2026

0.36.0

Mar 20, 2026

0.35.0

Mar 20, 2026

0.34.0

Mar 19, 2026

0.33.0

Mar 19, 2026

0.32.0

Mar 19, 2026

0.31.0

Mar 19, 2026

0.30.0

Mar 19, 2026

0.29.0

Mar 18, 2026

0.28.0

Mar 18, 2026

0.27.0

Mar 18, 2026

0.26.0

Mar 17, 2026

0.25.0

Mar 15, 2026

0.24.0

Mar 15, 2026

0.23.0

Mar 15, 2026

0.22.0

Mar 15, 2026

0.21.0

Mar 15, 2026

0.20.5

Mar 14, 2026

0.20.4

Mar 14, 2026

0.20.3

Mar 14, 2026

0.20.2

Mar 14, 2026

0.20.1

Mar 14, 2026

0.20.0

Mar 14, 2026

0.19.0

Mar 13, 2026

0.18.0

Mar 13, 2026

0.17.0

Mar 12, 2026

0.16.1

Mar 12, 2026

0.16.0

Mar 12, 2026

0.15.0

Mar 12, 2026

0.13.0

Mar 11, 2026

0.12.0

Mar 11, 2026

0.11.0

Mar 11, 2026

0.10.0

Mar 11, 2026

0.9.2

Mar 11, 2026

0.9.1

Mar 11, 2026

0.9.0

Mar 11, 2026

0.7.1

Mar 9, 2026

0.7.0

Mar 9, 2026

0.6.0

Mar 8, 2026

0.3.0

Mar 5, 2026

0.2.0

Mar 5, 2026

0.1.2

Mar 3, 2026

0.1.1

Mar 3, 2026

0.1.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_guard_kit-0.65.1.tar.gz (739.6 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_guard_kit-0.65.1-py3-none-any.whl (603.6 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file llm_guard_kit-0.65.1.tar.gz.

File metadata

Download URL: llm_guard_kit-0.65.1.tar.gz
Upload date: Mar 25, 2026
Size: 739.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for llm_guard_kit-0.65.1.tar.gz
Algorithm	Hash digest
SHA256	`aae779818edbd84715609df197ee08025618a97b3c8432265722c322d8cc2544`
MD5	`21c42ec40084a4e26bbf2fdf0ec2f23d`
BLAKE2b-256	`479ed6864f434887fbf672f0ff58094e0f9f1e4cc349b69bf32a10f76a6b0add`

See more details on using hashes here.

File details

Details for the file llm_guard_kit-0.65.1-py3-none-any.whl.

File metadata

Download URL: llm_guard_kit-0.65.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 603.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for llm_guard_kit-0.65.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e73fdb1842b4ef26720cb8d9e59a57d78e154cef61ff19237ef4c249ce020ea8`
MD5	`ce70c61b8fcbf0c8183b0e6a2c579a94`
BLAKE2b-256	`e8dc6aca4cd588aebf981a4c74676e0afca937d2d21d66cdc673d97b552a018f`

See more details on using hashes here.

llm-guard-kit 0.65.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-guard-kit

What it does

Install

Performance

Table of Contents

0. MiniJudge — $0 Local Judge (v0.17.0)

SC_OLD Feature Table (11 behavioral features)

1. AgentGuard + A2A Trust (v0.6.0)

Chain scoring with validated behavioral signals

A2A trust handoff

Tier degradation

Mid-chain monitoring

1. Quick Start

Zero setup — monitor from query 1

With SQLite persistence (recommended for production)

2. Framework Integrations

LangChain

OpenAI Assistants API

LlamaIndex

Haystack

3. Full Pipeline

4. Persistence & Auto-Calibration

SQLite store (ChainStore)

Mixed-domain warm-up

5. CLI Reference

Status — see all monitored domains

Score — score a single chain from a JSON file

Calibrate — warm up a domain

Recalibrate — after a model upgrade

Export — audit log

Serve — launch the API server

Dashboard — launch with monitoring UI

6. HTTP API Server

Score a chain

Calibrate (add a verified chain)

Bulk calibrate (seed from existing logs)

Deployment status

Reset (destructive — clears calibration)

7. Monitoring Dashboard

8. Docker Deployment

Single command

docker-compose.yml (included)

Environment variables

Production with nginx reverse proxy

9. SaaS API Key Auth

Step 1: Start server with admin key and SQLite

Step 2: Create a customer API key

Step 3: Score with Bearer token auth

Step 4: Export audit log

10. Drift Detection

Agent Step Format

Retrieval Quality Diagnostics

11. MCP Server — Claude Desktop & Cursor Integration (v0.9.0)

Setup

Start the server

Add to Claude Desktop

Add to Cursor

Tool examples (all outputs verified)

score_chain — score a completed agent chain

stream_check — mid-chain abort (call after step 2)

submit_feedback — human label → RL signal

get_metrics — summary stats for last N days

get_auroc — rolling AUROC from labeled feedback

trigger_retrain — retrain RL model from labels

get_pending_review — HIGH-risk chains awaiting labels

get_rl_status — full RL training history and drift

RL feedback loop summary

License

Research Background

Project details

Verified details

`score_chain` — score a completed agent chain

`stream_check` — mid-chain abort (call after step 2)

`submit_feedback` — human label → RL signal

`get_metrics` — summary stats for last N days

`get_auroc` — rolling AUROC from labeled feedback

`trigger_retrain` — retrain RL model from labels

`get_pending_review` — HIGH-risk chains awaiting labels

`get_rl_status` — full RL training history and drift