v0.65.1: P(True) NQ corrected from 0.810→0.623 (tie-handling bug in custom auroc()); gate FAIL documented; DomainRouter NQ signal unchanged (mfe_sc_pair 0.667).
Project description
llm-guard-kit
Real-time reliability monitoring, A2A trust management, and self-repair for LLM agents.
v0.54.0: SQLSemanticValidator (schema-aware judge, Databricks/Spark/BigQuery/Snowflake/PostgreSQL dialects) + SQLExecutionOracle; ChartValidator (spec-aware judge for wrong axis/units/type); AutoGenAdapter + CrewAIAdapter (multi-agent orchestrator hooks → A2A trust); P(True) judge-model staleness auto-warn; FailureTaxonomist: EXCESSIVE_SEARCH unlocked in production, LOW_RISK pass-through, ANSWER_UNSUPPORTED embedding+word gate, CONFLICTING_EVIDENCE obs-contradiction signal, PREMATURE_STOP multi-hop detection. See CHANGELOG.md.
📋 See CHANGELOG.md for full version history.
What it does
llm-guard-kit wraps any ReAct / tool-calling LLM agent with a reliability stack — no labels required on day one:
| Component | What it does |
|---|---|
AgentGuard |
Score completed chains with SC_OLD behavioral signals + optional Sonnet judge. Emit A2A trust objects. |
A2ATrustObject |
Structured confidence envelope for agent-to-agent handoff (answer + risk + tier + failure_mode + hint). |
QueryRewriter |
When Agent A has low confidence, generate 3 diverse query reformulations for Agent B. |
LabelFreeScorer |
Raw behavioral risk scoring in <15 ms. Zero cold-start. |
QppgMonitor |
Drop-in agent monitor. Auto-calibrates, fires alerts, persists to SQLite, exports CSV. |
FailureTaxonomist |
Diagnoses why a chain failed (retrieval failure, excessive search, hallucination, …). |
SelfHealer |
Converts failure diagnosis into prompt injections that repair the agent mid-run. Validated: RETRIEVAL_FAILURE +11.1pp, EXCESSIVE_SEARCH +20pp pass@1 (exp_selfhealer_ab). |
AdversarialChainDetector |
Detects FARL-style "confident_wrong" adversarial chains. Internal CV AUROC 0.9960 (174 self-curated HP chains, post-hoc features). External holdout AUROC 0.9836 [0.965, 0.997] on FARL phase2/MuSiQue chains (exp_adversarial_holdout). See caveat below. Load via load_default(). |
MultiTurnGuard |
Scores multi-turn conversations for reliability drift across turns. score_factual_turn() uses P(True) factual-correctness path (is_factual_reliability_signal=True). |
DeepChainScorer |
Scores 5–8 step multi-hop chains (2Wiki-style) using retrieval trajectory slope. Auto-routes: short chains → SC_OLD; long chains (≥4 Search steps) → RetrievalCascadeScorer (AUROC 0.629 on 2Wiki vs SC_OLD 0.545); very long (≥6) + optional Mamba → MambaRiskScorer. Zero cost, pure numpy. |
SQLSemanticValidator |
Schema-aware SQL semantic judge (Haiku ~$0.0003/call). Detects wrong JOINs, missing GROUP BY, incorrect filters. Multi-dialect: ANSI, Databricks/Delta, Spark, BigQuery, Snowflake, PostgreSQL, MySQL. Zero-cost heuristic fallback with no API key. |
SQLExecutionOracle |
Run SQL against any DB-API 2.0 connection, check shape + exceptions, blend execution risk with behavioral score. Works with SQLite, Databricks SQL, Snowflake, psycopg2. check_raw() for Spark/BigQuery. |
ChartValidator |
Spec-aware chart/table judge: wrong chart type, swapped axes, unit mismatches, field mismatches. Supports Vega-Lite, Chart.js, plain text spec. Zero-cost heuristic fallback. |
AutoGenAdapter |
Converts AutoGen GroupChat message history (tool calls, function calls, replies) → AgentGuard chain scoring + A2A trust handoff. register_hook() for real-time monitoring. |
CrewAIAdapter |
Converts CrewAI TaskOutput / CrewOutput → AgentGuard chain scoring + A2A trust handoff. wrap_task() for transparent scoring on task completion. |
Namespace note:
LabelFreeScorer,QppgMonitor,FailureTaxonomist, andSelfHealerare in theqppg_servicepackage, notllm_guard.# Correct: from qppg_service import LabelFreeScorer, QppgMonitor from qppg_service import FailureTaxonomist, SelfHealer # This will fail: from llm_guard import LabelFreeScorer # ImportError
MultiTurnGuard(v0.38.0): Full MT-Bench evaluation (n=2,575 pairs,lmsys/mt_bench_human_judgments) — AUROC 0.667 [0.652, 0.681] withuse_ptrue=True. CI collapsed from ±0.124 (n=50) to ±0.015 (full n). Note: MT-Bench measures response quality preference, not factual reliability — these are correlated but different constructs. See docs/mt_bench_validation.md. Lexical-only mode AUROC 0.581 — not valid.
AdversarialChainDetectorcaveat: External holdout AUROC of 0.9836 should be interpreted carefully — all adversarial chains are from MuSiQue domain and all correct chains from HotpotQA, so domain differences may partly drive the discrimination. A within-domain holdout (HotpotQA adversarial vs HotpotQA correct, unseen questions) is needed to fully validate.load_default()emits aUserWarningat runtime.Methodology, dataset construction, and honest limitations: docs/adversarial_methodology.md
Validated AUROC — held-out evaluation (HotpotQA within-domain, TriviaQA cross-domain):
| Method | Within-domain | Cross-domain | Cost/chain |
|---|---|---|---|
| MiniJudge (SC_OLD + LogReg, distilled from Sonnet) | 0.747 ± 0.10 | — | $0 |
| SC_OLD behavioral ensemble (n=0 labels) | 0.817 | 0.703 (2Wiki) / 0.659 (TV) | $0 |
| SC_OLD + Sonnet judge (J5) | 0.777 | 0.741 | ~$0.007 |
| Conformal alert precision at FPR ≤ 10% | 0.908 | — | — |
| Mid-chain Haiku at step 2 | 0.683 | — | ~$0.001/step |
| RetrievalCascadeScorer (long chains ≥4 Search steps) | 0.592 (HP) | 0.629 (2Wiki) / 0.669 (Musique) / 0.742 (TriviaQA) | $0 |
v0.17.0:
MiniJudgeachieves 0.747 AUROC (HP 5-fold CV) at zero inference cost — within 2.7 pp of Sonnet judge. Cross-domain live validation (exp156/156b): 2WikiMultiHop 0.703 [0.628, 0.775] ✅ · TriviaQA 0.659 [0.614, 0.705] ✅ · MuSiQue 0.613 (CI wide) · NQ 0.524 (open-domain factoid, near-random).RetrievalCascadeScorer(v0.52.0): superior on long chains (avg ≥4 Search steps). SC_OLD beats it on short chains (HP AUROC 0.817 vs 0.592).DeepChainScorerauto-selects the right scorer per chain length. All figures from held-out evaluation. Seedocs/production_integration.mdfor full methodology.
MechanismFeatureExtractor — Phase A vs Phase B deployment gap:
The MechanismFeatureExtractor (v0.46.0) achieves AUROC 0.964 on HaluEval2-QA in Phase A (curated benchmark, label-matched training). This number does not transfer to real LLM outputs.
| Dataset | Phase A (curated) | Phase B (model-generated) | Status |
|---|---|---|---|
| HaluEval2-QA | 0.964 [0.947, 0.978] | 0.559 [0.509, 0.608] | ⚠ Large gap |
| Medical QA (PubMed) | 0.530 [0.446, 0.621] | 0.412 [0.338, 0.483] | ❌ Worse than random |
| HaluEval2-Dialogue | 0.702 [0.662, 0.743] | — | ✓ Gate pass |
| TruthfulQA | 0.668 [0.625, 0.712] | — | ✓ Gate pass |
| Sycophancy | 0.503 [0.447, 0.561] | 0.413 [0.342, 0.492] | ❌ Fail |
⚠ Medical domain exclusion:
MechanismFeatureExtractoris actively anti-predictive (AUROC 0.412) on clinical/biomedical QA. Do not use for medical, clinical, or health-related agent reliability scoring. Phase A → Phase B gap root cause: Text-level features (semantic similarity, answer length, certainty density) are confounded by the LLM's generation style. A confident-sounding wrong answer and a confident-sounding correct answer are nearly indistinguishable via surface text features alone. Phase A benchmark data is annotation-curated (hallucinations injected artificially), which inflates AUROC vs. naturally occurring errors in model-generated traces.
Install
# Core (no API key needed)
pip install llm-guard-kit
# With specific framework integrations
pip install "llm-guard-kit[langchain]" # LangChain agents
pip install "llm-guard-kit[openai]" # OpenAI Assistants
pip install "llm-guard-kit[llamaindex]" # LlamaIndex
pip install "llm-guard-kit[haystack]" # Haystack pipelines
# HTTP server + dashboard
pip install "llm-guard-kit[server]"
# Everything
pip install "llm-guard-kit[all]"
Requires Python 3.9+.
Performance
Measured on Apple M-series, behavioral scoring only (v0.48 benchmarks):
| Concurrency | p50 | p95 | p99 | Throughput |
|---|---|---|---|---|
| c=1 serial | 17.6 ms | 19.1 ms | 20.4 ms | — |
| c=10 | 179 ms | — | 209 ms | 56 req/s |
| c=50 | 906 ms | — | 1008 ms | 55 req/s |
| c=100 | 1697 ms | — | 1887 ms | 59 req/s |
Note: Throughput is CPU-bound at ~57 req/s regardless of concurrency. Per-request latency scales linearly with concurrency because the behavioral scorer is single-threaded Python. For high-concurrency deployments, run multiple worker processes (e.g.
uvicorn --workers 4). Runpython experiments/latency_benchmark.pyto reproduce on your hardware.
Table of Contents
- MiniJudge — $0 Local Judge (v0.17.0)
- AgentGuard + A2A Trust (v0.6.0)
- Quick Start — Drop-in Monitor
- Framework Integrations
- Full Pipeline — Detect → Diagnose → Repair
- Persistence & Auto-Calibration
- CLI Reference
- HTTP API Server
- Monitoring Dashboard
- Docker Deployment
- SaaS API Key Auth
- Drift Detection
- Agent Step Format
0. MiniJudge — $0 Local Judge (v0.17.0)
MiniJudge is a logistic regression model trained on 11 behavioral features (SC_OLD), distilled from Sonnet soft labels. AUROC 0.818 on HotpotQA (14-feature model, 5-fold CV), zero inference cost.
See docs/sc_old_features.md for the complete, reproducible feature reference with ablation results.
from llm_guard import MiniJudge
judge = MiniJudge() # loads pre-trained weights automatically
risk = judge.score(question, steps, final_answer) # float in [0, 1]
SC_OLD Feature Table (11 behavioral features)
All features computed from chain dict in < 1 ms. No API calls. Full definitions and ablation in docs/sc_old_features.md.
| # | Feature | What it measures | Standalone AUROC |
|---|---|---|---|
| 1 | sc1_loop_rate |
Fraction of repeated action types | 0.571 |
| 2 | sc2_steps_norm |
Normalized step count (÷ 6) | 0.622 |
| 3 | sc3_empty_obs_rate |
Fraction of observations < 15 chars | 0.733 |
| 4 | sc5_repeated_search |
Query repetition rate | 0.500 |
| 5 | sc6_answer_gap |
Short answer relative to question length | 0.624 |
| 6 | sc8_backtrack_rate |
Fraction of consecutive same-type steps | 0.575 |
| 7 | sc9_obs_util |
Inverse average observation length | 0.605 |
| 8 | sc10_coherence_drop |
Thought length variance (pstdev/mean) | 0.596 |
| 9 | sc11_ans_obs_mismatch |
1 − Jaccard(answer, last_observation) | 0.691 |
| 10 | search_count_norm |
Search steps / total steps | 0.733 |
| 11 | avg_obs_norm |
Normalized average observation length | 0.608 |
Most valuable single contributor (drop-one ablation): f12_ans_question_sim (−0.010 AUROC when dropped), sc9_obs_util (−0.008), sc10_coherence_drop (−0.007).
For maximum accuracy, blend MiniJudge with P(True):
from llm_guard import MiniJudge, AgentGuard, probe_ensemble_blend
guard = AgentGuard(api_key="sk-ant-...")
result = guard.score_with_ptrue(question, steps, final_answer)
mini_risk = MiniJudge().score(question, steps, final_answer)
blended = probe_ensemble_blend(mini_risk, result.ptrue_risk, alpha=0.25) # AUROC ~0.74
Opt-in telemetry — feeds your data back into the retrain pipeline:
guard = AgentGuard(
api_key="sk-ant-...",
contribute_labels=True,
telemetry_token="ghp_...", # GitHub PAT with repo scope
telemetry_repo="your-org/llm-guard-labels",
)
# Every score_chain() + update_isotonic(feedback) call sends 11 floats + 1 bit
1. AgentGuard + A2A Trust (v0.6.0)
Chain scoring with validated behavioral signals
from llm_guard import AgentGuard
# Zero cost — behavioral SC_OLD signals only (~0.81 AUROC within-domain)
guard = AgentGuard()
# With Sonnet judge (~$0.007/chain, ~0.74 AUROC cross-domain)
guard = AgentGuard(api_key="sk-ant-...", use_judge=True)
result = guard.score_chain(
question="When was the Eiffel Tower built?",
steps=[
{"thought": "Search for construction date",
"action_type": "Search", "action_arg": "Eiffel Tower construction year",
"observation": "The Eiffel Tower was built from 1887 to 1889..."},
{"thought": "Completed in 1889",
"action_type": "Finish", "action_arg": "1889", "observation": ""},
],
final_answer="1889",
)
print(result.confidence_tier) # "HIGH" / "MEDIUM" / "LOW"
print(result.risk_score) # 0.0–1.0, higher = more likely wrong
print(result.needs_alert) # True when risk >= 0.70 (Precision=0.908)
print(result.failure_mode) # "retrieval_fail" | "long_chain" | None
print(result.judge_label) # "GOOD" | "BORDERLINE" | "POOR" | None
A2A trust handoff
# Agent A produces a trust object
trust = guard.generate_trust_object(question, steps, final_answer)
payload = trust.to_dict() # JSON-serialisable for queue/API transport
# Agent B receives it and conditions its strategy
from llm_guard import A2ATrustObject, QueryRewriter
trust = A2ATrustObject.from_dict(payload)
print(trust.downstream_hint) # "proceed" / "proceed_with_caution" /
# "rewrite_query" / "escalate_to_human"
# When Agent A had low confidence, diversify Agent B's queries
rewriter = QueryRewriter(api_key="sk-ant-...")
variants = rewriter.rewrite_if_needed(question, trust)
# variants = [paraphrase, decomposed_sub_question, alternative_angle]
# Returns [] when no rewrite needed (HIGH/MEDIUM confidence)
Tier degradation
Without a judge configured (AgentGuard(api_key=None)), generate_trust_object()
uses 2-tier routing: PROCEED (risk < 0.50) or ESCALATE (risk ≥ 0.50).
The trust object's routing_mode field will be "behavioral_only_2tier".
With a judge configured, full 4-tier routing is active:
PROCEED → REVIEW → ESCALATE → REJECT, and routing_mode is "full_4tier".
Downstream agents must check routing_mode before applying 4-tier routing logic:
trust = guard.generate_trust_object(question, steps, final_answer)
if trust.routing_mode == "behavioral_only_2tier":
# Only PROCEED / ESCALATE are meaningful
if trust.confidence_tier == "ESCALATE":
route_to_human_review(trust)
else:
# Full 4-tier routing available
if trust.confidence_tier == "LOW":
rewrite_and_retry(trust)
elif trust.confidence_tier == "MEDIUM":
proceed_with_monitoring(trust)
Mid-chain monitoring
# Call inside your agent loop BEFORE each step executes
step = guard.monitor_step(
question="When was the Eiffel Tower built?",
steps_so_far=[step1],
current_action="Search[Eiffel Tower date]",
)
if step.risk == "high":
pass # intervene early — AUROC 0.683 at step 2 (Δ+0.156 vs behavioral)
1. Quick Start
Zero setup — monitor from query 1
from qppg_service import QppgMonitor
monitor = QppgMonitor(threshold=0.65) # alert above this risk score
# Call after every agent run
alert = monitor.track(
question = "Which city is older, Rome or Athens?",
steps = agent_steps, # see step format below
final_answer= "Athens",
finished = True,
)
if alert:
print(f"HIGH RISK ({alert.risk_score:.2f}): {alert.recommendation}")
# Get a stats report
print(monitor.export_report())
monitor.export_csv("agent_risk_log.csv")
Works on query 1. No training. No labels. AUROC 0.817 within-domain out of the box (HotpotQA, exp156).
With SQLite persistence (recommended for production)
monitor = QppgMonitor(
threshold = 0.65,
db_path = "~/.qppg/chains.db", # auto-creates on first run
domain = "prod", # namespace for multi-domain setups
model_name = "claude-opus-4-6",
recal_every = 100, # re-calibrate GMM every N new chains
check_drift = True, # auto-detect distributional drift
)
2. Framework Integrations
LangChain
Drop a callback into any LangChain AgentExecutor — no code changes to your agent:
pip install "llm-guard-kit[langchain]"
from langchain.agents import AgentExecutor, create_react_agent
from qppg_service.integrations.langchain_callback import QppgLangChainCallback
from qppg_service import QppgMonitor
# With persistence
monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="prod")
callback = QppgLangChainCallback(monitor=monitor, threshold=0.65)
# Attach to any AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools)
result = agent_executor.invoke(
{"input": "What year was the Eiffel Tower built?"},
config={"callbacks": [callback]},
)
# Check result
score = callback.get_last_result()
if score and score.needs_review:
print(f"HIGH RISK: {score.risk_score:.3f}")
print(f"Behavioral: {score.behavioral_score:.3f}")
What gets captured automatically:
on_agent_action→ thought + tool name + tool input → Search/Finish stepon_tool_end→ tool output → observationon_agent_finish→ final answer + full chain scoring
Tool name mapping: tavily_search, duckduckgo_search, wikipedia, retriev* → "Search".
OpenAI Assistants API
pip install "llm-guard-kit[openai]"
from openai import OpenAI
from qppg_service.integrations.openai_adapter import score_assistants_run
from qppg_service import QppgMonitor
client = OpenAI()
monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="assistants")
# Run your assistant normally
thread = client.beta.threads.create()
client.beta.threads.messages.create(thread.id, role="user", content="When was Python created?")
run = client.beta.threads.runs.create_and_poll(thread.id, assistant_id="asst_xxx")
# Score the completed run
result = score_assistants_run(
client,
thread_id = thread.id,
run_id = run.id,
monitor = monitor, # optional: persists to SQLite
question = "When was Python created?", # optional: auto-extracted from thread
)
print(f"Risk: {result.risk_score:.3f} Review: {result.needs_review}")
LlamaIndex
pip install "llm-guard-kit[llamaindex]"
from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.callbacks import CallbackManager
from qppg_service.integrations.llamaindex_callback import QppgLlamaIndexCallback
from qppg_service import QppgMonitor
monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="llamaindex")
qppg_cb = QppgLlamaIndexCallback(monitor=monitor, threshold=0.65)
Settings.callback_manager = CallbackManager([qppg_cb])
# Your index and query engine work unchanged
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is RAG?")
result = qppg_cb.get_last_result()
if result and result.needs_review:
print(f"HIGH RISK: {result.risk_score:.3f}")
Haystack
pip install "llm-guard-kit[haystack]"
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
from qppg_service.integrations.haystack_callback import QppgHaystackMonitor
from qppg_service import QppgMonitor
# Build your pipeline normally
pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=store))
pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o-mini"))
pipeline.connect("retriever", "generator")
monitor = QppgMonitor(db_path="~/.qppg/prod.db", domain="haystack")
qppg = QppgHaystackMonitor(pipeline, monitor=monitor)
# Run through the wrapper instead of pipeline.run()
outputs, result = qppg.run({"retriever": {"query": "Who invented Python?"}})
if result and result.needs_review:
print(f"HIGH RISK: {result.risk_score:.3f}")
3. Full Pipeline
Detect → Diagnose → Repair in one block:
from qppg_service import QppgMonitor, FailureTaxonomist, SelfHealer
monitor = QppgMonitor(threshold=0.65, db_path="~/.qppg/prod.db")
tx = FailureTaxonomist()
healer = SelfHealer()
alert = monitor.track(question, steps, final_answer, finished=True)
if alert:
# Diagnose WHY it failed
failure = tx.classify(question, steps, final_answer, finished=True)
print(failure.primary_mode) # "EXCESSIVE_SEARCH" | "RETRIEVAL_FAILURE" | ...
print(failure.explanation) # human-readable explanation
print(failure.confidence) # 0–1
# Get a repair prompt to inject back into the agent
action = healer.suggest(failure, question, steps, final_answer)
print(action.action_type) # "FORCE_FINISH" | "REPHRASE_QUERY" | ...
print(action.prompt_injection) # ready to inject as next agent message
print(action.urgency) # "HIGH" | "MEDIUM" | "LOW"
Failure modes detected:
Production-validated modes (FailureTaxonomist(), default):
| Mode | When triggered | Suggested repair | Validation |
|---|---|---|---|
RETRIEVAL_FAILURE |
mean cosine(obs, question) < 0.35 |
REPHRASE_QUERY |
F1=0.701 (exp_failure_taxonomy_llm) |
EXCESSIVE_SEARCH |
> 4 search steps |
CONSOLIDATE or FORCE_FINISH |
+20pp pass@1 (exp_selfhealer_ab) |
LOW_RISK |
No failure mode triggered | — (pass-through) | Production sentinel |
Experimental modes (FailureTaxonomist(experimental=True) — improved heuristics in v0.54, no published AUROC):
| Mode | When triggered |
|---|---|
CONFLICTING_EVIDENCE |
Cosine similarity drop between first/second-half observations |
INSUFFICIENT_EVIDENCE |
Low retrieval coverage + uncertainty density |
ANSWER_UNSUPPORTED |
High answer-gap AND low answer-chain similarity |
PREMATURE_STOP |
Multi-hop question with only 1 search step |
4. Persistence & Auto-Calibration
SQLite store (ChainStore)
from qppg_service import ChainStore
store = ChainStore("~/.qppg/chains.db")
# Query your stored data
domains = store.get_domains() # ["prod", "staging"]
pool = store.get_calibration_pool("prod", n=200) # last 200 chains as dicts
stats = store.get_domain_stats("prod")
# {"n_chains": 523, "n_alerts": 41, "avg_risk": 0.47, "last_auroc": 0.83}
# Export audit log
csv_str = store.export_audit("prod", fmt="csv")
json_str = store.export_audit("prod", fmt="json")
# Clear a domain (e.g. after model upgrade)
store.clear_domain("prod")
Mixed-domain warm-up
If you have an existing domain with chains, you can bootstrap a new domain from it:
from qppg_service import ChainStore, LabelFreeScorer
store = ChainStore("~/.qppg/chains.db")
scorer = LabelFreeScorer()
# Copy 25 chains from "staging" to "prod" as cross-domain calibration
# (boosts cross-domain AUROC from ~0.50 to ~0.81)
pool = store.get_calibration_pool("staging", n=25)
scorer.calibrate(pool)
# Now scorer works on "prod" questions with much better accuracy
5. CLI Reference
pip install "llm-guard-kit[server]"
Status — see all monitored domains
llm-guard-kit status
llm-guard-kit status --domain prod
Output:
DOMAIN CHAINS ALERTS AVG RISK AUROC DRIFT
--------------------------------------------------------------------
prod 523 41 0.467 0.883 OK
staging 89 7 0.512 n/a OK
Score — score a single chain from a JSON file
# Prepare a chain file
cat > chain.json << 'EOF'
{
"question": "Who invented the telephone?",
"steps": [
{"thought": "I should search.", "action_type": "Search",
"action_arg": "telephone inventor", "observation": "Alexander Graham Bell..."},
{"thought": "Found it.", "action_type": "Finish",
"action_arg": "Alexander Graham Bell", "observation": ""}
],
"final_answer": "Alexander Graham Bell",
"finished": true
}
EOF
llm-guard-kit score --steps-file chain.json --domain prod
Output:
Risk score : 0.312 (OK)
Behavioral : 0.287
GMM score : 0.291
Retrieval : mean_sim=0.612 min_sim=0.489 (GOOD)
Search steps : 1
Calibrate — warm up a domain
# From a JSON file of chain dicts
llm-guard-kit calibrate --domain prod --chains-file my_chains.json
# Mixed-domain warm-up (copy from another domain)
llm-guard-kit calibrate --domain prod --source-domain staging --chains 25
Recalibrate — after a model upgrade
# IMPORTANT: calibration is model-specific. Cross-model AUROC ≈ 0.508 (chance).
llm-guard-kit recalibrate --domain prod --new-model claude-opus-4-6
Export — audit log
llm-guard-kit export --domain prod --format csv > audit.csv
llm-guard-kit export --domain prod --start 2026-01-01 --end 2026-03-01 --format json
Serve — launch the API server
llm-guard-kit serve --domain prod --port 8000 --host 0.0.0.0
Dashboard — launch with monitoring UI
llm-guard-kit dashboard --domain prod --port 8000
# Open http://localhost:8000/dashboard
6. HTTP API Server
For multi-language / microservice deployments:
pip install "llm-guard-kit[server]"
llm-guard-kit serve --port 8000 --host 0.0.0.0
Interactive docs: http://localhost:8000/docs
Score a chain
curl -X POST http://localhost:8000/score \
-H "Content-Type: application/json" \
-d '{
"question": "Who invented Python?",
"steps": [
{"thought": "Search for it.", "action_type": "Search",
"action_arg": "Python creator", "observation": "Guido van Rossum..."},
{"thought": "Done.", "action_type": "Finish",
"action_arg": "Guido van Rossum", "observation": ""}
]
}'
Response:
{
"confidence": 0.72,
"needs_review": false,
"deployment_status": "DEPLOYED",
"risk_score": 0.31,
"behavioral_score": 0.29,
"gmm_score": 0.33
}
Calibrate (add a verified chain)
curl -X POST http://localhost:8000/calibrate \
-H "Content-Type: application/json" \
-d '{"question": "...", "steps": [...], "correct": true}'
Bulk calibrate (seed from existing logs)
curl -X POST http://localhost:8000/bulk-calibrate \
-H "Content-Type: application/json" \
-d '{"chains": [{"question":"...", "steps":[...], "correct":true}, ...]}'
Deployment status
curl http://localhost:8000/status
Reset (destructive — clears calibration)
curl -X POST "http://localhost:8000/reset?confirm=YES_RESET"
7. Monitoring Dashboard
Start the server with dashboard enabled:
llm-guard-kit serve --domain prod --port 8000 --db ~/.qppg/chains.db --dashboard
# or:
python -m qppg_service.server --domain prod --port 8000 --db ~/.qppg/chains.db --dashboard
Open http://localhost:8000/dashboard
Dashboard features:
- Deployment status banner — COLD START / WARMING / DEPLOYED + Est. AUROC
- Drift alert banner — fires when mean risk shifts > 0.10 over a 7-day window
- 5 KPI cards — Total queries, Alerts, Avg risk, Calibration pool size, Avg search steps
- Risk timeline — 24h / 7d / 30d selector, alert threshold line, color-coded points
- Failure mode breakdown — horizontal bar chart of classified failure types
- Chain log — paginated table, sortable, filter All/Alerts-only, search, model column
- Failure Modes section — detailed view with descriptions
- Domain switcher — switch between domains without restarting
- Export CSV — download the current view as CSV
8. Docker Deployment
Single command
git clone https://github.com/avighan/qppg
cd qppg
docker compose up -d
The server starts on http://localhost:8000.
docker-compose.yml (included)
version: "3.9"
services:
qppg:
build: .
ports:
- "8000:8000"
volumes:
- qppg_data:/data
environment:
QPPG_DOMAIN: "default"
QPPG_DB: "/data/chains.db"
QPPG_PORT: "8000"
QPPG_HOST: "0.0.0.0"
# QPPG_ADMIN_KEY: "your-secret-admin-key" # enable API key creation
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/status"]
interval: 30s
volumes:
qppg_data:
Environment variables
| Variable | Default | Description |
|---|---|---|
QPPG_DOMAIN |
default |
Default scoring domain |
QPPG_DB |
/data/chains.db |
SQLite database path |
QPPG_PORT |
8000 |
HTTP port |
QPPG_HOST |
0.0.0.0 |
Bind address |
QPPG_ADMIN_KEY |
(unset) | Secret to enable /api/keys endpoint |
Production with nginx reverse proxy
server {
listen 443 ssl;
server_name qppg.yourcompany.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
9. SaaS API Key Auth
For multi-tenant deployments where different customers score their own agents:
Step 1: Start server with admin key and SQLite
python -m qppg_service.server \
--port 8000 --db /data/chains.db \
--admin-key "your-secret-admin-key" \
--dashboard
Step 2: Create a customer API key
curl -X POST http://localhost:8000/api/keys \
-H "Content-Type: application/json" \
-H "X-Admin-Key: your-secret-admin-key" \
-d '{"customer_id": "acme-corp", "domain_prefix": "prod"}'
Response:
{
"api_key": "jN2xR...abc", ← Save this — shown once only
"key_id": "a3f9b2d1c0e4",
"customer_id": "acme-corp",
"domain_prefix": "prod"
}
Step 3: Score with Bearer token auth
curl -X POST http://localhost:8000/api/prod/score \
-H "Content-Type: application/json" \
-H "Authorization: Bearer jN2xR...abc" \
-d '{"question": "When was Python created?", "steps": [...]}'
Response:
{
"risk_score": 0.34,
"needs_review": false,
"behavioral_score": 0.31,
"gmm_score": 0.37,
"calibration_size": 47
}
Step 4: Export audit log
curl "http://localhost:8000/api/prod/audit-log?fmt=csv&start=1700000000" \
-H "Authorization: Bearer jN2xR...abc" \
> audit.csv
Rate limit: 1,000 requests/hour per API key (in-memory sliding window).
Domain isolation: Each customer's data is stored under {customer_id}:{domain} — fully isolated.
10. Drift Detection
DriftDetector compares mean risk score in the current 7-day window vs the previous 7-day window and fires an alert if the shift exceeds a threshold:
from qppg_service import ChainStore, DriftDetector, DriftAlert
store = ChainStore("~/.qppg/chains.db")
detector = DriftDetector(threshold=0.10, window_days=7, min_samples=10)
alert = detector.check("prod", store)
if alert:
print(f"Drift detected!")
print(f" Direction: {alert.direction}") # "UP" or "DOWN"
print(f" Delta: {alert.delta:+.3f}") # +0.142
print(f" Current mean: {alert.current_mean:.3f}") # 0.612
print(f" Prev mean: {alert.previous_mean:.3f}") # 0.470
print(f" Samples: {alert.n_current}") # 47
print(alert.recommendation)
When drift fires: Run llm-guard-kit recalibrate if you recently upgraded your LLM model (cross-model AUROC ≈ 0.508 — calibration does not transfer across models).
Agent Step Format
Every integration expects steps in this format:
steps = [
{
"thought": "I need to find when the Eiffel Tower was built.",
"action_type": "Search", # "Search" | "Finish" | any custom tool name
"action_arg": "Eiffel Tower construction date",
"observation": "The Eiffel Tower was built between 1887 and 1889..."
},
{
"thought": "I now have the answer.",
"action_type": "Finish",
"action_arg": "1889",
"observation": ""
}
]
Framework adapters (LangChain, OpenAI, LlamaIndex, Haystack) build this format automatically.
Retrieval Quality Diagnostics
A standalone signal you can use independently of the full monitor:
from qppg_service import LabelFreeScorer
scorer = LabelFreeScorer()
rq = scorer.retrieval_quality(question, steps)
# {
# "mean_sim": 0.41, # average cosine(observation, question)
# "min_sim": 0.22, # worst retrieval step
# "quality_label":"POOR", # "GOOD" | "OK" | "POOR"
# "per_step": [0.52, 0.22, ...]
# }
Correct agents: mean_sim = 0.554; wrong agents: 0.458 (Δ+0.096, p<0.01).
11. MCP Server — Claude Desktop & Cursor Integration (v0.9.0)
llm-guard-kit ships a Model Context Protocol server that exposes all scoring tools directly inside Claude Desktop and Cursor — no code required.
Setup
pip install "llm-guard-kit[mcp]"
Start the server
# stdio transport — for Claude Desktop and Cursor (default)
llm-guard-mcp
# SSE transport — for web clients
llm-guard-mcp --sse --port 8765
Add to Claude Desktop
Open ~/.claude/claude_desktop_config.json and add:
{
"mcpServers": {
"llm-guard-kit": {
"command": "python3",
"args": ["-m", "llm_guard_mcp.server"],
"cwd": "/path/to/your/project",
"env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
}
}
}
Restart Claude Desktop. Eight new tools appear in the tool panel.
Add to Cursor
Add the same block to ~/.cursor/mcp.json, then invoke with @llm-guard-kit in any Cursor chat.
Tool examples (all outputs verified)
score_chain — score a completed agent chain
# In Claude / Cursor, call the tool directly. In Python:
import asyncio
from llm_guard_mcp.server import _dispatch
steps = [
{"thought": "Search for Python creation date.",
"action_type": "Search", "action_arg": "Python creation year",
"observation": "Python was created by Guido van Rossum, released in 1991."},
{"thought": "Found the answer.",
"action_type": "Finish", "action_arg": "1991", "observation": ""},
]
result = asyncio.run(_dispatch("score_chain", {
"question": "What year was Python created?",
"steps": steps,
"final_answer": "1991",
}))
Response:
{
"chain_id": 1,
"risk_score": 0.2958,
"tier": "HIGH",
"needs_review": true,
"needs_alert": true,
"beh_score": 0.2958,
"ptrue_score": 0.0,
"rl_score": null,
"interpretation": "HIGH risk (0.296): Agent failed, contradicted itself, or gave up. Do not use without verification.",
"action": "Block answer. Alert human reviewer. Do not send to end user."
}
tieris behavioral-only when noANTHROPIC_API_KEYis set. With a key,ptrue_scoreis populated and AUROC improves from 0.682 → 0.775.
stream_check — mid-chain abort (call after step 2)
# Check after 2 steps — abort early if chain is failing
result = asyncio.run(_dispatch("stream_check", {
"question": "What year was Python created?",
"steps_so_far": steps[:1], # just the first step
}))
Response (on-track chain):
{
"should_abort": false,
"risk_score": 0.5376,
"step": 1,
"chain_id": null,
"message": "CONTINUE: Chain looks on-track so far."
}
Response (failing chain — empty observations, repeated queries):
{
"should_abort": false,
"risk_score": 0.4964,
"step": 2,
"chain_id": null,
"message": "CONTINUE: Chain looks on-track so far."
}
should_abort=truefires whenrisk_score >= 0.65. Abort the agent immediately and surface a failure message to the user.Note: Mid-chain abort uses SC_OLD at full depth (AUROC 0.683 at step 2). A dedicated partial-chain calibrator is in development and will ship when validated.
submit_feedback — human label → RL signal
# After a user confirms an answer is correct or wrong
result = asyncio.run(_dispatch("submit_feedback", {
"chain_id": 1,
"correct": True,
"note": "Verified via Wikipedia",
}))
Response:
{
"status": "feedback_recorded",
"chain_id": 1,
"label": "correct",
"n_labeled": 1,
"retrain_ready": false,
"retrain_reason": "Need 29 more labels before first retrain (have 1)",
"note": "Verified via Wikipedia"
}
After 30 labels (min 5 per class),
retrain_readybecomestrue. Calltrigger_retrainor letscore_chainauto-retrain.
get_metrics — summary stats for last N days
result = asyncio.run(_dispatch("get_metrics", {"days": 7}))
Response:
{
"summary": {
"days": 7.0,
"total": 5,
"aborted": 0,
"tier_LOW": 0,
"tier_MEDIUM": 0,
"tier_HIGH": 5,
"mean_risk": 0.2952,
"n_labeled": 2,
"n_wrong_labeled": 0,
"label_error_rate": 0.0
},
"recent_chains": [
{"id": 5, "question": "What year was Python created?...", "risk": 0.296, "tier": "HIGH", "labeled": true}
],
"rl_status": {
"model_trained": false,
"labels_since_last_train": 2
}
}
get_auroc — rolling AUROC from labeled feedback
result = asyncio.run(_dispatch("get_auroc", {"days": 30}))
Response (insufficient labels):
{
"auroc": null,
"window_days": 30.0,
"n_labeled": 2,
"drift_alert": null,
"auroc_history": [],
"baseline_ref": {
"behavioral_only": 0.682,
"ptrue_ensemble": 0.775,
"integration_target": 0.795
},
"status": "INSUFFICIENT DATA (need ≥10 labels)"
}
Once you have ≥10 labels, auroc is populated and drift_alert fires if AUROC drops ≥5pp from its peak.
trigger_retrain — retrain RL model from labels
result = asyncio.run(_dispatch("trigger_retrain", {}))
Response (not enough labels yet):
{
"status": "skipped",
"reason": "Need 28 more labels before first retrain (have 2)"
}
Response after 30+ labels (balanced classes):
{
"status": "trained",
"n_labels": 42,
"cv_auroc": 0.831,
"precision": 0.867,
"recall": 0.714,
"model": "LogisticRegression",
"features": ["risk_score", "beh_score", "ptrue_score", "n_steps_norm", "tier_num"]
}
get_pending_review — HIGH-risk chains awaiting labels
result = asyncio.run(_dispatch("get_pending_review", {"limit": 5}))
Response:
{
"pending_count": 3,
"pending": [
{
"chain_id": 3,
"question": "What year was Python created?",
"answer": "1991",
"risk_score": 0.295,
"tier": "HIGH",
"how_to_label": "Call submit_feedback with chain_id=3, correct=true/false"
}
],
"total_labeled": 2,
"retrain_ready": false,
"retrain_reason": "Need 28 more labels before first retrain (have 2)"
}
get_rl_status — full RL training history and drift
result = asyncio.run(_dispatch("get_rl_status", {}))
Response:
{
"model_trained": false,
"total_labels": 2,
"labels_wrong": 0,
"labels_correct": 2,
"labels_since_last_train": 2,
"retrain_ready": false,
"retrain_reason": "Need 28 more labels before first retrain (have 2)",
"training_history": [],
"auroc_trend": [],
"drift_alert": null,
"rl_loop_config": {
"retrain_every_n_labels": 20,
"min_labels_for_first_train": 30,
"min_per_class": 5,
"features": ["risk_score", "beh_score", "ptrue_score", "n_steps_norm", "tier_num"]
}
}
RL feedback loop summary
| Step | What to call | When |
|---|---|---|
| Agent finishes | score_chain |
Every chain |
| User checks step 2 | stream_check |
Optional — saves API cost |
| User marks answer | submit_feedback |
Every manual review |
| After 30 labels | trigger_retrain |
Auto or manual |
| Monitor quality | get_auroc |
Daily / weekly |
| Review queue | get_pending_review |
When assigning to human reviewers |
Labels are stored in SQLite at ~/.llm_guard_mcp/metrics.db. Persists across restarts.
License
MIT — see LICENSE.
Research Background
Built on experiments exp18–53 validating against HotpotQA, NaturalQuestions, TriviaQA, and GSM8K.
Paper draft: docs/research_paper.md
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_guard_kit-0.65.1.tar.gz.
File metadata
- Download URL: llm_guard_kit-0.65.1.tar.gz
- Upload date:
- Size: 739.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aae779818edbd84715609df197ee08025618a97b3c8432265722c322d8cc2544
|
|
| MD5 |
21c42ec40084a4e26bbf2fdf0ec2f23d
|
|
| BLAKE2b-256 |
479ed6864f434887fbf672f0ff58094e0f9f1e4cc349b69bf32a10f76a6b0add
|
File details
Details for the file llm_guard_kit-0.65.1-py3-none-any.whl.
File metadata
- Download URL: llm_guard_kit-0.65.1-py3-none-any.whl
- Upload date:
- Size: 603.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e73fdb1842b4ef26720cb8d9e59a57d78e154cef61ff19237ef4c249ce020ea8
|
|
| MD5 |
ce70c61b8fcbf0c8183b0e6a2c579a94
|
|
| BLAKE2b-256 |
e8dc6aca4cd588aebf981a4c74676e0afca937d2d21d66cdc673d97b552a018f
|