Real-time reliability monitoring, failure diagnosis, and self-repair for LLM agents. AUROC 0.879–0.895.
Project description
llm-guard-kit
Real-time reliability monitoring, failure diagnosis, and self-repair for LLM agents.
What it does
llm-guard-kit wraps any ReAct / tool-calling LLM agent with a four-tier reliability stack — no labels required on day one:
| Tier | Component | What it does |
|---|---|---|
| 0 | LabelFreeScorer |
Risk score per query in <15 ms. Zero cold-start using behavioral signals. |
| 1 | QppgMonitor |
Drop-in agent monitor. Auto-calibrates, fires alerts, exports reports. |
| 3 | FailureTaxonomist |
Diagnoses why a chain failed (retrieval failure, excessive search, hallucination, …) |
| 4 | SelfHealer |
Converts failure diagnosis into prompt injections that repair the agent mid-run. |
Validated AUROC (HotpotQA multi-hop QA, 200 chains):
| Condition | Within-domain | Cross-domain |
|---|---|---|
| n=0 chains — behavioral signals only (SC2) | 0.879 | 0.570 |
| n≥5 chains — + GMM density (SC8) | 0.883 | 0.664 |
| n≥50 labeled — + QARA obs-pool adapter | 0.742 | 0.675 |
| + LLM judge (gpt-4o-mini, J2=SC8+judge) | 0.895 | 0.660 |
Install
pip install llm-guard-kit # core (no API key needed)
pip install "llm-guard-kit[qara]" # + QARA supervised adapter (torch)
pip install "llm-guard-kit[server]" # + FastAPI HTTP server
Requires Python 3.9+. No API key required for zero-label monitoring.
Quick start — zero labels, zero cold-start
from qppg_service import QppgMonitor
monitor = QppgMonitor(threshold=0.65) # fires above this risk score
# Call after every agent run
alert = monitor.track(
question = "Which city is older, Rome or Athens?",
steps = agent_steps, # list of {thought, action_type, action_arg, observation}
final_answer= agent.final_answer,
finished = True,
)
if alert:
print(f"HIGH RISK ({alert.risk_score:.2f}): {alert.recommendation}")
# Export a full monitoring report
print(monitor.export_report())
monitor.export_csv("agent_risk_log.csv")
Works on query 1. No training. No labels. AUROC 0.879 within-domain.
Full pipeline — detect → diagnose → repair
from qppg_service import QppgMonitor, FailureTaxonomist, SelfHealer
monitor = QppgMonitor(threshold=0.65)
tx = FailureTaxonomist()
healer = SelfHealer()
alert = monitor.track(question, steps, final_answer, finished=True)
if alert:
# Diagnose WHY it failed
failure = tx.classify(question, steps, final_answer, finished=True)
print(failure.primary_mode) # "EXCESSIVE_SEARCH" | "RETRIEVAL_FAILURE" | ...
print(failure.explanation) # human-readable explanation
# Get a repair prompt to inject into the agent
action = healer.suggest(failure, question, steps, final_answer)
print(action.action_type) # "FORCE_FINISH" | "REPHRASE_QUERY" | ...
print(action.prompt_injection) # ready to inject as next agent message
print(action.urgency) # "HIGH" | "MEDIUM" | "LOW"
Failure modes detected:
| Mode | Trigger | Suggested repair |
|---|---|---|
RETRIEVAL_FAILURE |
mean cosine(obs, question) < 0.35 | REPHRASE_QUERY |
EXCESSIVE_SEARCH |
> 4 search steps | CONSOLIDATE or FORCE_FINISH |
CONFLICTING_EVIDENCE |
high thought variance + high query diversity | CONSOLIDATE |
INSUFFICIENT_EVIDENCE |
weak retrieval + ≥ 2 searches | ADDITIONAL_SEARCH |
ANSWER_UNSUPPORTED |
answer words absent from reasoning | VERIFY_ANSWER |
PREMATURE_STOP |
≤ 1 search, no clean finish | ADDITIONAL_SEARCH (urgent) |
LOW_RISK |
no flags | NO_ACTION |
Progressive calibration
As you accumulate agent logs, the scorer automatically improves:
# After 5+ chains (any domain) — activates GMM density estimation
monitor.calibrate(chains) # list of {question, steps, final_answer, finished}
# After 50+ labeled chains — activates QARA supervised obs-pool adapter
monitor.calibrate(chains, labeled=True) # chains must have "correct": True/False
# Check current status and expected AUROC
print(monitor.scorer.status())
Retrieval quality diagnostics
A standalone signal that tells you which search steps are failing:
from qppg_service import LabelFreeScorer
scorer = LabelFreeScorer()
rq = scorer.retrieval_quality(question, steps)
# {"mean_sim": 0.41, "min_sim": 0.22, "quality_label": "POOR", "per_step": [...]}
Correct agents average mean_sim = 0.554; wrong agents 0.458 (Δ+0.096, p<0.01).
HTTP API (for multi-language / microservice deployments)
pip install "llm-guard-kit[server]"
python -m qppg_service.server --host 0.0.0.0 --port 8080
curl -X POST http://localhost:8080/score \
-H "Content-Type: application/json" \
-d '{"question": "...", "steps": [...], "final_answer": "..."}'
Legacy API (v0.1.x — still supported)
from llm_guard import LLMGuard
guard = LLMGuard(api_key="sk-ant-...")
guard.fit(correct_questions=["What is the capital of France?", ...])
result = guard.query("What is 15% of 240?")
print(result.risk_score) # 0.12 (lower = lower failure risk)
The original LLMGuard class (exp21-23, within-domain AUROC 0.966-1.000 on MATH/HumanEval/TriviaQA) remains fully functional.
Agent step format
steps = [
{
"thought": "I need to find when the Eiffel Tower was built.",
"action_type": "Search",
"action_arg": "Eiffel Tower construction date",
"observation": "The Eiffel Tower was built between 1887 and 1889..."
},
{
"thought": "I now have the answer.",
"action_type": "Finish",
"action_arg": "1889",
"observation": ""
}
]
Research background
Built on experiments exp18–45 validating against HotpotQA, NaturalQuestions, TriviaQA, and GSM8K:
- Behavioral signals (step count, completion, answer gap): AUROC 0.879 — zero calibration
- GMM density estimation on chain embeddings: +0.004 within, +0.094 cross-domain
- QARA supervised adapter on observation-pool embeddings: best cross-domain (0.675, p=0.025)
- LLM-as-judge (gpt-4o-mini, exp41): 0.895 within-domain when combined with SC8
Paper draft: docs/research_paper.md
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_guard_kit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llm_guard_kit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 85.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94f4e93e4306204d94f428ab7f2f857829c2c8f049f012fdb1259cd32998d363
|
|
| MD5 |
d38f042caf031f073a1798a9b026e56a
|
|
| BLAKE2b-256 |
2205774e777a01be0543084d4d2acc838e86be92d86b3f06f74ecf321efce443
|