Skip to main content

Real-time reliability monitoring, failure diagnosis, and self-repair for LLM agents. AUROC 0.879–0.895.

Project description

llm-guard-kit

Real-time reliability monitoring, failure diagnosis, and self-repair for LLM agents.

PyPI Python License: MIT


What it does

llm-guard-kit wraps any ReAct / tool-calling LLM agent with a four-tier reliability stack — no labels required on day one:

Tier Component What it does
0 LabelFreeScorer Risk score per query in <15 ms. Zero cold-start using behavioral signals.
1 QppgMonitor Drop-in agent monitor. Auto-calibrates, fires alerts, exports reports.
3 FailureTaxonomist Diagnoses why a chain failed (retrieval failure, excessive search, hallucination, …)
4 SelfHealer Converts failure diagnosis into prompt injections that repair the agent mid-run.

Validated AUROC (HotpotQA multi-hop QA, 200 chains):

Condition Within-domain Cross-domain
n=0 chains — behavioral signals only (SC2) 0.879 0.570
n≥5 chains — + GMM density (SC8) 0.883 0.664
n≥50 labeled — + QARA obs-pool adapter 0.742 0.675
+ LLM judge (gpt-4o-mini, J2=SC8+judge) 0.895 0.660

Install

pip install llm-guard-kit                    # core (no API key needed)
pip install "llm-guard-kit[qara]"            # + QARA supervised adapter (torch)
pip install "llm-guard-kit[server]"          # + FastAPI HTTP server

Requires Python 3.9+. No API key required for zero-label monitoring.


Quick start — zero labels, zero cold-start

from qppg_service import QppgMonitor

monitor = QppgMonitor(threshold=0.65)   # fires above this risk score

# Call after every agent run
alert = monitor.track(
    question    = "Which city is older, Rome or Athens?",
    steps       = agent_steps,           # list of {thought, action_type, action_arg, observation}
    final_answer= agent.final_answer,
    finished    = True,
)

if alert:
    print(f"HIGH RISK ({alert.risk_score:.2f}): {alert.recommendation}")

# Export a full monitoring report
print(monitor.export_report())
monitor.export_csv("agent_risk_log.csv")

Works on query 1. No training. No labels. AUROC 0.879 within-domain.


Full pipeline — detect → diagnose → repair

from qppg_service import QppgMonitor, FailureTaxonomist, SelfHealer

monitor = QppgMonitor(threshold=0.65)
tx      = FailureTaxonomist()
healer  = SelfHealer()

alert = monitor.track(question, steps, final_answer, finished=True)

if alert:
    # Diagnose WHY it failed
    failure = tx.classify(question, steps, final_answer, finished=True)
    print(failure.primary_mode)   # "EXCESSIVE_SEARCH" | "RETRIEVAL_FAILURE" | ...
    print(failure.explanation)    # human-readable explanation

    # Get a repair prompt to inject into the agent
    action = healer.suggest(failure, question, steps, final_answer)
    print(action.action_type)       # "FORCE_FINISH" | "REPHRASE_QUERY" | ...
    print(action.prompt_injection)  # ready to inject as next agent message
    print(action.urgency)           # "HIGH" | "MEDIUM" | "LOW"

Failure modes detected:

Mode Trigger Suggested repair
RETRIEVAL_FAILURE mean cosine(obs, question) < 0.35 REPHRASE_QUERY
EXCESSIVE_SEARCH > 4 search steps CONSOLIDATE or FORCE_FINISH
CONFLICTING_EVIDENCE high thought variance + high query diversity CONSOLIDATE
INSUFFICIENT_EVIDENCE weak retrieval + ≥ 2 searches ADDITIONAL_SEARCH
ANSWER_UNSUPPORTED answer words absent from reasoning VERIFY_ANSWER
PREMATURE_STOP ≤ 1 search, no clean finish ADDITIONAL_SEARCH (urgent)
LOW_RISK no flags NO_ACTION

Progressive calibration

As you accumulate agent logs, the scorer automatically improves:

# After 5+ chains (any domain) — activates GMM density estimation
monitor.calibrate(chains)                    # list of {question, steps, final_answer, finished}

# After 50+ labeled chains — activates QARA supervised obs-pool adapter
monitor.calibrate(chains, labeled=True)      # chains must have "correct": True/False

# Check current status and expected AUROC
print(monitor.scorer.status())

Retrieval quality diagnostics

A standalone signal that tells you which search steps are failing:

from qppg_service import LabelFreeScorer

scorer = LabelFreeScorer()
rq = scorer.retrieval_quality(question, steps)
# {"mean_sim": 0.41, "min_sim": 0.22, "quality_label": "POOR", "per_step": [...]}

Correct agents average mean_sim = 0.554; wrong agents 0.458 (Δ+0.096, p<0.01).


HTTP API (for multi-language / microservice deployments)

pip install "llm-guard-kit[server]"
python -m qppg_service.server --host 0.0.0.0 --port 8080
curl -X POST http://localhost:8080/score \
  -H "Content-Type: application/json" \
  -d '{"question": "...", "steps": [...], "final_answer": "..."}'

Legacy API (v0.1.x — still supported)

from llm_guard import LLMGuard

guard = LLMGuard(api_key="sk-ant-...")
guard.fit(correct_questions=["What is the capital of France?", ...])
result = guard.query("What is 15% of 240?")
print(result.risk_score)   # 0.12 (lower = lower failure risk)

The original LLMGuard class (exp21-23, within-domain AUROC 0.966-1.000 on MATH/HumanEval/TriviaQA) remains fully functional.


Agent step format

steps = [
    {
        "thought":      "I need to find when the Eiffel Tower was built.",
        "action_type":  "Search",
        "action_arg":   "Eiffel Tower construction date",
        "observation":  "The Eiffel Tower was built between 1887 and 1889..."
    },
    {
        "thought":      "I now have the answer.",
        "action_type":  "Finish",
        "action_arg":   "1889",
        "observation":  ""
    }
]

Research background

Built on experiments exp18–45 validating against HotpotQA, NaturalQuestions, TriviaQA, and GSM8K:

  • Behavioral signals (step count, completion, answer gap): AUROC 0.879 — zero calibration
  • GMM density estimation on chain embeddings: +0.004 within, +0.094 cross-domain
  • QARA supervised adapter on observation-pool embeddings: best cross-domain (0.675, p=0.025)
  • LLM-as-judge (gpt-4o-mini, exp41): 0.895 within-domain when combined with SC8

Paper draft: docs/research_paper.md


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_guard_kit-0.2.0-py3-none-any.whl (85.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_guard_kit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llm_guard_kit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 85.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_guard_kit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94f4e93e4306204d94f428ab7f2f857829c2c8f049f012fdb1259cd32998d363
MD5 d38f042caf031f073a1798a9b026e56a
BLAKE2b-256 2205774e777a01be0543084d4d2acc838e86be92d86b3f06f74ecf321efce443

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page