Predict, diagnose, and repair LLM failures automatically. AUROC 0.966–0.993.
Project description
llm-guard
Predict, diagnose, and repair LLM failures automatically.
What it does
llm-guard wraps any LLM call with a three-stage reliability layer:
- Predict — scores every query for failure risk in <15ms before the LLM responds
- Diagnose — clusters accumulated failures into a labeled error taxonomy
- Heal — synthesises targeted repair instructions from failure patterns; applies them automatically on future queries
Validated results (Claude Haiku, internal benchmarks):
| Benchmark | Task type | AUROC | Precision@10 |
|---|---|---|---|
| MATH-500 | Math | 0.966 | 100% |
| HumanEval | Code | 0.993 | 100% |
| TriviaQA | Factual QA | 0.992 | 100% |
Cost: <$0.25 to validate on 664 benchmark problems.
Install
pip install llm-guard
Requires Python 3.9+ and an Anthropic API key.
Quick start — three calibration paths
Path A: You have labeled correct examples
from llm_guard import LLMGuard
guard = LLMGuard(api_key="sk-ant-...")
# Fit on questions your LLM is known to handle correctly
guard.fit(correct_questions=[
"What is the capital of France?",
"What is 12 * 15?",
# ... 50+ examples recommended
])
result = guard.query("What is 15% of 240?")
print(result.answer) # "36"
print(result.confidence) # "high" | "medium" | "low"
print(result.risk_score) # 0.12 (lower = more familiar = lower failure risk)
Path B: No labels — use self-consistency
guard = LLMGuard(api_key="sk-ant-...")
# Runs each question 5 times; those with 80%+ agreement are "probably correct"
guard.fit_from_consistency(
questions=my_question_pool, # 100–500 questions
n_samples=5,
agreement_threshold=0.8,
)
result = guard.query("Explain the water cycle.")
print(result.confidence) # "high"
Path C: Automated verifier (code, math, SQL, schema)
import subprocess, textwrap
def python_verifier(question, response):
"""Returns True if the code response passes the test suite."""
try:
exec(compile(response, "<llm>", "exec"), {})
return True
except Exception:
return False
guard = LLMGuard(api_key="sk-ant-...")
guard.fit_from_execution(
questions=coding_questions,
verifier_fn=python_verifier,
)
result = guard.query("Write a function that reverses a string.")
print(result.answer)
Error Autopsy
Cluster accumulated failures into a labeled taxonomy (read-only, does not modify guard state):
clusters = guard.diagnose(
failed_questions=failed_qs,
model_answers=model_answers,
correct_answers=correct_answers, # optional but enables suggested_fix
)
for c in clusters:
print(f"Cluster {c['cluster_id']} ({c['size']} failures): {c['label']}")
print(f" Fix: {c.get('suggested_fix', 'n/a')}")
Example output:
Cluster 0 (12 failures): The model misreads multi-step word problems,
computing intermediate values correctly but applying them to the wrong sub-question.
Fix: Explicitly label each sub-goal before computing.
Cluster 1 (8 failures): Off-by-one errors in loop boundary conditions.
Fix: Always verify that loop indices match the stated range inclusivity.
Prompt Healer
Learn from failures and auto-apply targeted repairs on future queries in the same error cluster:
guard.learn_from_errors(
failed_questions=failed_qs,
model_answers=model_answers,
correct_answers=correct_answers,
)
# Future queries near a known failure cluster get the repair instruction injected automatically
result = guard.query("If a train travels 60 mph for 2.5 hours, how far does it go?")
print(result.tool_used) # "error_fix_0" ← repair tool was applied
print(result.confidence) # "medium"
GuardResult fields
| Field | Type | Description |
|---|---|---|
answer |
str | LLM response text |
risk_score |
float | Mean KNN distance; higher = more likely to fail |
confidence |
str | "high" / "medium" / "low" |
tool_used |
str | None | Repair tool ID if applied |
cluster_id |
int | None | Error cluster ID if matched |
was_retried |
bool | True if a resource-failure retry fired |
raw_response |
str | Full LLM response (same as answer currently) |
Constructor parameters
guard = LLMGuard(
api_key="sk-ant-...", # Anthropic key (or set ANTHROPIC_API_KEY)
model="claude-haiku-4-5-20251001", # any Claude model
embedding_model="all-MiniLM-L6-v2", # sentence-transformers model
n_neighbors=5, # k for KNN scoring
)
How it works
The failure predictor uses KNN anomaly scoring on sentence-transformer embeddings:
- During calibration, embed all known-correct questions → build a KNN index
- At query time, embed the new question → compute mean distance to k nearest correct examples
- High distance = unfamiliar territory = high failure risk (AUROC 0.966–0.993)
Risk thresholds are auto-calibrated from the training distribution (75th and 95th percentile), so they work across any domain without manual tuning.
Failure-type detection (applied at medium/high risk):
stop_reason == "max_tokens"→ resource failure → retry with 2x tokens (no tool)- Otherwise → reasoning failure → apply synthesised cluster repair tool
Limitations
- Calibration quality matters.
fit()requires ≥6 correct examples;fit_from_consistency()works best when baseline accuracy is >70%. With very low baseline accuracy, few questions will agree across samples. - Embeddings are language-level. The predictor detects unfamiliar phrasing, not unfamiliar reasoning steps. Two questions that look similar but require different reasoning may get similar scores.
- repair tools are heuristic.
learn_from_errors()synthesises prompt additions using the LLM — they help on average but are not guaranteed to fix every instance of a cluster. - Currently Anthropic-only. OpenAI/other provider support is on the roadmap.
- Not a security filter. This tool predicts factual/reasoning failures, not prompt injection or jailbreaks.
Roadmap
- OpenAI and Ollama provider support
- Async/streaming API
- Save/load guard state (
.save()/.load()) - Score-only mode (no LLM call required)
- Dashboard for failure cluster visualization
License
MIT. See LICENSE.
Citation
If you use this in research:
Majumder, A. (2025). LLM Reliability Guard: KNN-based failure prediction
for large language models. AUROC 0.966–0.993 on math, code, and factual QA.
https://github.com/avighan/qppg
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_guard_kit-0.1.0.tar.gz.
File metadata
- Download URL: llm_guard_kit-0.1.0.tar.gz
- Upload date:
- Size: 39.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0e02c3dd32b7d26705c2fe1046b9d503f2978b611f56142be6e63458a93a047
|
|
| MD5 |
1f5ddddf8d059efca41073e84082b077
|
|
| BLAKE2b-256 |
8285f9b17bf06ce327fe516d11ad31654579266865ca21e3f350aa16e52e6c84
|
File details
Details for the file llm_guard_kit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_guard_kit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1489e56f0c7a73dc6204ad8cbc60132405e0d9ff35e6137bd21cf1bbaf44f5cd
|
|
| MD5 |
57692b890ea0556594cba802f2d917b5
|
|
| BLAKE2b-256 |
83710aad7eaa600f10bec5d048d77aeb02fdd88eed4504bedca74405adcf0b1e
|