Skip to main content

Predict, diagnose, and repair LLM failures automatically. AUROC 0.966–0.993.

Project description

llm-guard

Predict, diagnose, and repair LLM failures automatically.

PyPI Python License: MIT


What it does

llm-guard wraps any LLM call with a three-stage reliability layer:

  1. Predict — scores every query for failure risk in <15ms before the LLM responds
  2. Diagnose — clusters accumulated failures into a labeled error taxonomy
  3. Heal — synthesises targeted repair instructions from failure patterns; applies them automatically on future queries

Validated results (Claude Haiku, internal benchmarks):

Benchmark Task type AUROC Precision@10
MATH-500 Math 0.966 100%
HumanEval Code 0.993 100%
TriviaQA Factual QA 0.992 100%

Cost: <$0.25 to validate on 664 benchmark problems.


Install

pip install llm-guard

Requires Python 3.9+ and an Anthropic API key.


Quick start — three calibration paths

Path A: You have labeled correct examples

from llm_guard import LLMGuard

guard = LLMGuard(api_key="sk-ant-...")

# Fit on questions your LLM is known to handle correctly
guard.fit(correct_questions=[
    "What is the capital of France?",
    "What is 12 * 15?",
    # ... 50+ examples recommended
])

result = guard.query("What is 15% of 240?")
print(result.answer)      # "36"
print(result.confidence)  # "high" | "medium" | "low"
print(result.risk_score)  # 0.12  (lower = more familiar = lower failure risk)

Path B: No labels — use self-consistency

guard = LLMGuard(api_key="sk-ant-...")

# Runs each question 5 times; those with 80%+ agreement are "probably correct"
guard.fit_from_consistency(
    questions=my_question_pool,  # 100–500 questions
    n_samples=5,
    agreement_threshold=0.8,
)

result = guard.query("Explain the water cycle.")
print(result.confidence)  # "high"

Path C: Automated verifier (code, math, SQL, schema)

import subprocess, textwrap

def python_verifier(question, response):
    """Returns True if the code response passes the test suite."""
    try:
        exec(compile(response, "<llm>", "exec"), {})
        return True
    except Exception:
        return False

guard = LLMGuard(api_key="sk-ant-...")
guard.fit_from_execution(
    questions=coding_questions,
    verifier_fn=python_verifier,
)

result = guard.query("Write a function that reverses a string.")
print(result.answer)

Error Autopsy

Cluster accumulated failures into a labeled taxonomy (read-only, does not modify guard state):

clusters = guard.diagnose(
    failed_questions=failed_qs,
    model_answers=model_answers,
    correct_answers=correct_answers,   # optional but enables suggested_fix
)

for c in clusters:
    print(f"Cluster {c['cluster_id']} ({c['size']} failures): {c['label']}")
    print(f"  Fix: {c.get('suggested_fix', 'n/a')}")

Example output:

Cluster 0 (12 failures): The model misreads multi-step word problems,
  computing intermediate values correctly but applying them to the wrong sub-question.
  Fix: Explicitly label each sub-goal before computing.
Cluster 1 (8 failures): Off-by-one errors in loop boundary conditions.
  Fix: Always verify that loop indices match the stated range inclusivity.

Prompt Healer

Learn from failures and auto-apply targeted repairs on future queries in the same error cluster:

guard.learn_from_errors(
    failed_questions=failed_qs,
    model_answers=model_answers,
    correct_answers=correct_answers,
)

# Future queries near a known failure cluster get the repair instruction injected automatically
result = guard.query("If a train travels 60 mph for 2.5 hours, how far does it go?")
print(result.tool_used)   # "error_fix_0"  ← repair tool was applied
print(result.confidence)  # "medium"

GuardResult fields

Field Type Description
answer str LLM response text
risk_score float Mean KNN distance; higher = more likely to fail
confidence str "high" / "medium" / "low"
tool_used str | None Repair tool ID if applied
cluster_id int | None Error cluster ID if matched
was_retried bool True if a resource-failure retry fired
raw_response str Full LLM response (same as answer currently)

Constructor parameters

guard = LLMGuard(
    api_key="sk-ant-...",           # Anthropic key (or set ANTHROPIC_API_KEY)
    model="claude-haiku-4-5-20251001",  # any Claude model
    embedding_model="all-MiniLM-L6-v2", # sentence-transformers model
    n_neighbors=5,                  # k for KNN scoring
)

How it works

The failure predictor uses KNN anomaly scoring on sentence-transformer embeddings:

  1. During calibration, embed all known-correct questions → build a KNN index
  2. At query time, embed the new question → compute mean distance to k nearest correct examples
  3. High distance = unfamiliar territory = high failure risk (AUROC 0.966–0.993)

Risk thresholds are auto-calibrated from the training distribution (75th and 95th percentile), so they work across any domain without manual tuning.

Failure-type detection (applied at medium/high risk):

  • stop_reason == "max_tokens" → resource failure → retry with 2x tokens (no tool)
  • Otherwise → reasoning failure → apply synthesised cluster repair tool

Limitations

  • Calibration quality matters. fit() requires ≥6 correct examples; fit_from_consistency() works best when baseline accuracy is >70%. With very low baseline accuracy, few questions will agree across samples.
  • Embeddings are language-level. The predictor detects unfamiliar phrasing, not unfamiliar reasoning steps. Two questions that look similar but require different reasoning may get similar scores.
  • repair tools are heuristic. learn_from_errors() synthesises prompt additions using the LLM — they help on average but are not guaranteed to fix every instance of a cluster.
  • Currently Anthropic-only. OpenAI/other provider support is on the roadmap.
  • Not a security filter. This tool predicts factual/reasoning failures, not prompt injection or jailbreaks.

Roadmap

  • OpenAI and Ollama provider support
  • Async/streaming API
  • Save/load guard state (.save() / .load())
  • Score-only mode (no LLM call required)
  • Dashboard for failure cluster visualization

License

MIT. See LICENSE.


Citation

If you use this in research:

Majumder, A. (2025). LLM Reliability Guard: KNN-based failure prediction
for large language models. AUROC 0.966–0.993 on math, code, and factual QA.
https://github.com/avighan/qppg

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_guard_kit-0.1.0.tar.gz (39.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_guard_kit-0.1.0-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file llm_guard_kit-0.1.0.tar.gz.

File metadata

  • Download URL: llm_guard_kit-0.1.0.tar.gz
  • Upload date:
  • Size: 39.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_guard_kit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c0e02c3dd32b7d26705c2fe1046b9d503f2978b611f56142be6e63458a93a047
MD5 1f5ddddf8d059efca41073e84082b077
BLAKE2b-256 8285f9b17bf06ce327fe516d11ad31654579266865ca21e3f350aa16e52e6c84

See more details on using hashes here.

File details

Details for the file llm_guard_kit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_guard_kit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_guard_kit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1489e56f0c7a73dc6204ad8cbc60132405e0d9ff35e6137bd21cf1bbaf44f5cd
MD5 57692b890ea0556594cba802f2d917b5
BLAKE2b-256 83710aad7eaa600f10bec5d048d77aeb02fdd88eed4504bedca74405adcf0b1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page