Skip to main content

Predict, diagnose, and repair LLM failures automatically. AUROC 0.966–0.993.

Project description

llm-guard

Predict, diagnose, and repair LLM failures automatically.

PyPI Python License: MIT


What it does

llm-guard wraps any LLM call with a three-stage reliability layer:

  1. Predict — scores every query for failure risk in <15ms before the LLM responds
  2. Diagnose — clusters accumulated failures into a labeled error taxonomy
  3. Heal — synthesises targeted repair instructions from failure patterns; applies them automatically on future queries

Validated results (Claude Haiku, internal benchmarks):

Benchmark Task type AUROC Precision@10
MATH-500 Math 0.966 100%
HumanEval Code 0.993 100%
TriviaQA Factual QA 0.992 100%

Cost: <$0.25 to validate on 664 benchmark problems.


Install

pip install llm-guard-kit

Requires Python 3.9+ and an Anthropic API key.


Quick start — three calibration paths

Path A: You have labeled correct examples

from llm_guard import LLMGuard

guard = LLMGuard(api_key="sk-ant-...")

# Fit on questions your LLM is known to handle correctly
guard.fit(correct_questions=[
    "What is the capital of France?",
    "What is 12 * 15?",
    # ... 50+ examples recommended
])

result = guard.query("What is 15% of 240?")
print(result.answer)      # "36"
print(result.confidence)  # "high" | "medium" | "low"
print(result.risk_score)  # 0.12  (lower = more familiar = lower failure risk)

Path B: No labels — use self-consistency

guard = LLMGuard(api_key="sk-ant-...")

# Runs each question 5 times; those with 80%+ agreement are "probably correct"
guard.fit_from_consistency(
    questions=my_question_pool,  # 100–500 questions
    n_samples=5,
    agreement_threshold=0.8,
)

result = guard.query("Explain the water cycle.")
print(result.confidence)  # "high"

Path C: Automated verifier (code, math, SQL, schema)

import subprocess, textwrap

def python_verifier(question, response):
    """Returns True if the code response passes the test suite."""
    try:
        exec(compile(response, "<llm>", "exec"), {})
        return True
    except Exception:
        return False

guard = LLMGuard(api_key="sk-ant-...")
guard.fit_from_execution(
    questions=coding_questions,
    verifier_fn=python_verifier,
)

result = guard.query("Write a function that reverses a string.")
print(result.answer)

Error Autopsy

Cluster accumulated failures into a labeled taxonomy (read-only, does not modify guard state):

clusters = guard.diagnose(
    failed_questions=failed_qs,
    model_answers=model_answers,
    correct_answers=correct_answers,   # optional but enables suggested_fix
)

for c in clusters:
    print(f"Cluster {c['cluster_id']} ({c['size']} failures): {c['label']}")
    print(f"  Fix: {c.get('suggested_fix', 'n/a')}")

Example output:

Cluster 0 (12 failures): The model misreads multi-step word problems,
  computing intermediate values correctly but applying them to the wrong sub-question.
  Fix: Explicitly label each sub-goal before computing.
Cluster 1 (8 failures): Off-by-one errors in loop boundary conditions.
  Fix: Always verify that loop indices match the stated range inclusivity.

Prompt Healer

Learn from failures and auto-apply targeted repairs on future queries in the same error cluster:

guard.learn_from_errors(
    failed_questions=failed_qs,
    model_answers=model_answers,
    correct_answers=correct_answers,
)

# Future queries near a known failure cluster get the repair instruction injected automatically
result = guard.query("If a train travels 60 mph for 2.5 hours, how far does it go?")
print(result.tool_used)   # "error_fix_0"  ← repair tool was applied
print(result.confidence)  # "medium"

GuardResult fields

Field Type Description
answer str LLM response text
risk_score float Mean KNN distance; higher = more likely to fail
confidence str "high" / "medium" / "low"
tool_used str | None Repair tool ID if applied
cluster_id int | None Error cluster ID if matched
was_retried bool True if a resource-failure retry fired
raw_response str Full LLM response (same as answer currently)

Constructor parameters

guard = LLMGuard(
    api_key="sk-ant-...",           # Anthropic key (or set ANTHROPIC_API_KEY)
    model="claude-haiku-4-5-20251001",  # any Claude model
    embedding_model="all-MiniLM-L6-v2", # sentence-transformers model
    n_neighbors=5,                  # k for KNN scoring
)

How it works

The failure predictor uses KNN anomaly scoring on sentence-transformer embeddings:

  1. During calibration, embed all known-correct questions → build a KNN index
  2. At query time, embed the new question → compute mean distance to k nearest correct examples
  3. High distance = unfamiliar territory = high failure risk (AUROC 0.966–0.993)

Risk thresholds are auto-calibrated from the training distribution (75th and 95th percentile), so they work across any domain without manual tuning.

Failure-type detection (applied at medium/high risk):

  • stop_reason == "max_tokens" → resource failure → retry with 2x tokens (no tool)
  • Otherwise → reasoning failure → apply synthesised cluster repair tool

Limitations

  • Calibration quality matters. fit() requires ≥6 correct examples; fit_from_consistency() works best when baseline accuracy is >70%. With very low baseline accuracy, few questions will agree across samples.
  • Embeddings are language-level. The predictor detects unfamiliar phrasing, not unfamiliar reasoning steps. Two questions that look similar but require different reasoning may get similar scores.
  • repair tools are heuristic. learn_from_errors() synthesises prompt additions using the LLM — they help on average but are not guaranteed to fix every instance of a cluster.
  • Currently Anthropic-only. OpenAI/other provider support is on the roadmap.
  • Not a security filter. This tool predicts factual/reasoning failures, not prompt injection or jailbreaks.

Roadmap

  • OpenAI and Ollama provider support
  • Async/streaming API
  • Save/load guard state (.save() / .load())
  • Score-only mode (no LLM call required)
  • Dashboard for failure cluster visualization

License

MIT. See LICENSE.


Citation

If you use this in research:

Majumder, A. (2025). LLM Reliability Guard: KNN-based failure prediction
for large language models. AUROC 0.966–0.993 on math, code, and factual QA.
https://github.com/avighan/qppg

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_guard_kit-0.1.1.tar.gz (42.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_guard_kit-0.1.1-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_guard_kit-0.1.1.tar.gz.

File metadata

  • Download URL: llm_guard_kit-0.1.1.tar.gz
  • Upload date:
  • Size: 42.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_guard_kit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a486cd3250684e6890c97da1a87410b9fcb4be1d297f5fdd1890eabf4dd58938
MD5 36c5a8f3819ea22fed465b3f1eb6bd23
BLAKE2b-256 4d1b2a360c24b62969744d913b96d31ff237e12250ee27b5c3e1eda626d3ba5c

See more details on using hashes here.

File details

Details for the file llm_guard_kit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_guard_kit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_guard_kit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b51be05789910b4a541fbf0e4e206870fd6040d00be06f05b44aa71d8c1389c8
MD5 1047a44938a935e32e83e83750e3a033
BLAKE2b-256 dfa048088ae8c1aaaa39162cf57951a9021aa6e73f79d77c59c62bdcc1d72fe7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page