Monitor, diagnose, and auto-correct LLM failures — with XGBoost failure classification, question-type routing, auto-calibrating thresholds, Wikidata/Serper ground truth, and production analytics
Project description
Failure Intelligence Engine (FIE)
Real-time LLM failure detection, root cause diagnosis, and automatic correction.
What's New in v1.1.0
XGBoost Failure Classifier (AUC 0.728)
The rule-based POET algorithm is now backed by a trained XGBoost model. Every inference runs through the classifier post-pipeline and the high_failure_risk flag is set by the model, not just rules. AUC-ROC improved from 0.663 → 0.728 on TruthfulQA.
Question-Type Routing
FIE now classifies every prompt into one of five types before running the pipeline:
| Type | External GT used | Example |
|---|---|---|
FACTUAL |
Wikidata + Serper | "Who invented the telephone?" |
TEMPORAL |
Serper only | "What is Bitcoin's price today?" |
REASONING |
None (internal only) | "Why does entropy increase?" |
CODE |
None (internal only) | "Write a Python function to sort a list" |
OPINION |
None (skipped entirely) | "Should I use Python or JavaScript?" |
This eliminates false positives on code/opinion questions where Wikidata lookups produce wrong "corrections."
Auto-Calibrating Thresholds
XGBoost uses a different decision threshold per question type (FACTUAL=0.40, CODE=0.52, OPINION=0.60, etc.). After every 50 new feedback submissions the thresholds are automatically recalculated from real labeled data using sklearn.precision_recall_curve. No manual threshold tuning required.
Production Analytics (5 New API Endpoints)
See API Endpoints below. Includes daily request volume, XGBoost vs POET agreement, confidence calibration curves (ECE), per-type breakdown, and a single "paper-metrics" endpoint that returns everything needed for the results section of a research paper.
Opt-In SDK Telemetry
When users set FIE_TELEMETRY=true, the SDK sends anonymized pings (no prompts, no API keys) back to the server after each call. Admins can see field failure rates, SDK version distribution, and question-type breakdown from real users via GET /api/v1/analytics/sdk-telemetry.
MonitorResponse Enhancements
Every /monitor response now includes:
classifier_probability— raw XGBoost score (0–1)model_version— which model made the decision (xgboost-v2)config_version— which threshold config was active
FIE sits between your LLM and your users. When the model gives a wrong answer, FIE catches it, finds the correct answer from a trusted source, and returns the correction — before the user ever sees the mistake.
What It Does
LLMs hallucinate. They say "Thomas Edison invented the telephone" with the same confidence as correct answers. There is no built-in signal. The wrong answer simply goes out to the user.
FIE solves this in real time:
- Detect — runs the same prompt through 3 independent shadow models, computes an ensemble signal
- Diagnose — a jury of 3 specialist agents votes on the root cause (hallucination, injection, temporal cutoff, etc.)
- Verify — queries Wikidata or Google Search to find the correct answer
- Correct — returns the verified answer to the user instead of the wrong one
The integration is one decorator:
from fie import monitor
@monitor(fie_url="http://localhost:8000", api_key="fie-xxx", mode="correct")
def ask_ai(prompt: str) -> str:
return your_llm(prompt)
How It Works — Pipeline Overview
User Prompt
│
▼
Your LLM → primary answer → FIE
│
┌───────────┴────────────┐
│ │
Phase 1: Shadow Ensemble Phase 2: FSV
3 models answer in Agreement, entropy,
parallel (Groq) outlier detection
│
▼
Phase 3: Diagnostic Jury
3 agents vote on root cause
(AdversarialSpecialist, LinguisticAuditor, DomainCritic)
│
▼
Phase 4: Ground Truth Pipeline
Cache → Wikidata → Serper → Shadow consensus
│
▼
Phase 5: Fix Engine
Return corrected answer to user
Shadow Model Ensemble
Three shadow models from different families run in parallel on every query:
| Model | Provider | Why |
|---|---|---|
llama-3.3-70b-versatile |
Meta | Strong general knowledge |
deepseek-r1-distill-llama-70b |
DeepSeek | Reasoning-focused, different RLHF |
qwen-qwq-32b |
Alibaba | Different pretraining corpus |
Different families reduce correlated failure — if one model is wrong, the others are unlikely to make the same mistake.
Each shadow model self-reports its certainty. FIE weights votes by confidence:
| Model reports | Vote weight |
|---|---|
| CONFIDENCE: HIGH | 3.0 |
| CONFIDENCE: MEDIUM | 2.0 |
| CONFIDENCE: LOW | 1.0 |
Failure Signal Vector (FSV)
After collecting all 4 answers (1 primary + 3 shadows), FIE computes:
| Signal | What it measures |
|---|---|
agreement_score |
Fraction of models that gave the same answer |
fsd_score |
Gap between the top-2 answer clusters |
entropy_score |
Normalized Shannon entropy of the answer distribution |
ensemble_disagreement |
Embedding-based pairwise disagreement flag |
ensemble_similarity |
Cosine similarity between primary and secondary |
high_failure_risk |
Final risk flag (set by XGBoost v1.1+) |
question_type |
Classified prompt type: FACTUAL / TEMPORAL / REASONING / CODE / OPINION |
Shannon Entropy
H = -Σ p(x) × log₂(p(x))
H_normalized = H / log₂(total_outputs)
- All 4 models agree → entropy = 0.0 (no uncertainty)
- All 4 models differ → entropy = 1.0 (maximum uncertainty)
- 3 agree, 1 differs → entropy ≈ 0.41
Entropy is used alongside agreement because a 2-vs-2 split (entropy=1.0) is far more alarming than a 3-vs-1 split (entropy=0.41), even though both have low agreement.
Primary-Outlier Detection — POET Algorithm
high_failure_risk is set by POET (Primary Outlier Ensemble Test) — the core novel algorithm in FIE. It does not check overall ensemble agreement. It specifically checks whether the primary model is the one disagreeing with the shadow majority:
shadow_agreement = agreement among shadows only (primary excluded)
if shadow_agreement < 0.60 → can't blame primary (shadows confused)
else:
majority = most common shadow answer cluster
if primary semantically matches majority → NOT an outlier
if primary is far from majority (cosine sim < 0.72) → IS an outlier → high_failure_risk = True
This dropped the false positive rate from 80% to 20% compared to threshold-based ensemble agreement.
Archetype Classification
7 failure archetypes based on the FSV:
| Archetype | When it fires |
|---|---|
HALLUCINATION_RISK |
Ensemble disagrees AND high entropy |
OVERCONFIDENT_FAILURE |
High risk but very low entropy (confident but wrong) |
MODEL_BLIND_SPOT |
Systematic knowledge gap in a domain |
UNSTABLE_OUTPUT |
High entropy alone (genuine ambiguity) |
LOW_CONFIDENCE |
Low agreement without high entropy |
RESOURCE_CONSTRAINT |
High latency AND high entropy |
STABLE |
All signals within normal range |
Diagnostic Jury
Three agents independently analyze the failure and vote on the root cause:
AdversarialSpecialist — engine/agents/adversarial_specialist.py
Detects intentional attacks using 3 detection layers:
- Regex — patterns for PROMPT_INJECTION, JAILBREAK_ATTEMPT, INSTRUCTION_OVERRIDE, TOKEN_SMUGGLING
- Prompt Guard — statistical heuristic scorer
- FAISS semantic search — finds novel attacks similar to known attack vectors
Priority: TOKEN_SMUGGLING > PROMPT_INJECTION > JAILBREAK > OVERRIDE
LinguisticAuditor — engine/agents/linguistic_auditor.py
Detects structural response problems: excessive hedging, truncation, format inconsistency, length anomalies, repetition loops.
DomainCritic — engine/agents/domain_critic.py
Detects factual and temporal failures using 5 weighted layers:
| Layer | Weight | What it checks |
|---|---|---|
| Contradiction signal | 0.40 | FSV entropy + agreement vs thresholds |
| Self-contradiction | 0.35 | Cosine similarity between primary and secondary |
| Hedge detection | 0.15 | Uncertainty phrases in model outputs |
| Temporal detection | 0.10 | Time-sensitive keywords in prompt |
| External verification | 0.45 | Wikipedia/RAG fact check |
Permanent fact guard: Chemical formulas, math identities, and physical constants are never routed to temporal (Serper) verification — they are verified via Wikidata only.
Jury Aggregation
Priority 1: Adversarial verdict (if any agent detected an attack)
Priority 2: Temporal verdict (routes to live search)
Default: Highest confidence verdict wins
Ground Truth Pipeline
Runs only when both gates pass:
- Gate 1:
high_failure_risk = True - Gate 2:
jury_confidence >= 0.45
Pipeline steps:
1. Cache lookup (MongoDB ground_truth_cache)
→ HIT: return verified answer immediately
2. Permanent fact check
→ chemical formula / math / physics constant → Wikidata only (no Serper)
3. Temporal routing
→ root_cause = TEMPORAL_KNOWLEDGE_CUTOFF → Serper (Google Search)
→ all other root causes → Wikidata
4. Wikidata (SPARQL)
→ Extract claim: subject / property / value
→ Search Wikidata with enriched query
→ contradiction + confidence ≥ 0.75 → OVERRIDE
→ confirmation + confidence ≥ 0.60 → CONFIRM
5. Serper (Google Search)
→ contradicts primary → OVERRIDE with search answer
→ confirms primary → CONFIRM
6. Shadow consensus fallback
→ weighted shadow agreement ≥ 0.60 → use majority shadow answer
→ below 0.60 → ESCALATE to human review
7. Write-through cache
→ verified answer with confidence ≥ 0.90 → saved to cache
Fix Strategies
| Strategy | When used |
|---|---|
WIKIDATA_OVERRIDE |
Wikidata contradicts the primary answer |
SERPER_OVERRIDE |
Google Search contradicts the primary answer |
SHADOW_CONSENSUS |
External sources exhausted, shadows agree |
SANITIZE_AND_RERUN |
Adversarial attack detected |
CONTEXT_INJECTION |
Temporal failure, search result available |
PROMPT_DECOMPOSITION |
Question too complex |
HUMAN_ESCALATION |
No reliable ground truth, shadow consensus too weak |
NO_FIX |
Output is stable |
Root Causes
| Root Cause | Meaning |
|---|---|
FACTUAL_HALLUCINATION |
Model stated a wrong fact |
TEMPORAL_KNOWLEDGE_CUTOFF |
Model's training data is outdated |
KNOWLEDGE_BOUNDARY_FAILURE |
Model uncertain at edge of training data |
PROMPT_INJECTION |
User attempting to override system prompt |
JAILBREAK_ATTEMPT |
User attempting to bypass safety guidelines |
INSTRUCTION_OVERRIDE |
User claiming fake authority |
TOKEN_SMUGGLING |
Special model tokens embedded in user input |
PROMPT_COMPLEXITY_OOD |
Question out-of-distribution / too complex |
Signal Logging
Every inference is logged to MongoDB signal_logs with 30+ fields including agreement, entropy, archetype, root cause, jury confidence, GT source, fix applied, and latency. Human feedback can be submitted via POST /api/v1/feedback/{request_id} to label signal logs as correct or incorrect — building a labeled dataset for future classifier training.
SDK Modes
# mode="monitor" — async, no latency added
# FIE checks in background, original answer returned immediately
@monitor(fie_url="...", api_key="...", mode="monitor")
def ask_ai(prompt: str) -> str:
return your_llm(prompt)
# mode="correct" — synchronous, real-time correction
# FIE verifies and returns corrected answer if wrong
@monitor(fie_url="...", api_key="...", mode="correct")
def ask_ai(prompt: str) -> str:
return your_llm(prompt)
API Endpoints
| Method | Path | What it does |
|---|---|---|
| POST | /api/v1/monitor |
Main production endpoint — full pipeline |
| POST | /api/v1/diagnose |
Run diagnostic jury on provided outputs |
| POST | /api/v1/analyze |
Phase 1 signal extraction only |
| POST | /api/v1/feedback/{id} |
Submit human feedback on an inference |
| GET | /api/v1/inferences |
List stored inferences |
| GET | /api/v1/trend |
EMA-based degradation trend |
| GET | /api/v1/clusters |
Archetype cluster summary |
| GET | /api/v1/monitor/signal-logs |
Raw signal logs (admin) |
| GET | /api/v1/monitor/calibration |
Per-confidence-bucket accuracy stats (admin) |
| GET | /api/v1/analytics/usage |
Request volume, failure rate, daily breakdown |
| GET | /api/v1/analytics/model-performance |
XGBoost accuracy, per-question-type stats |
| GET | /api/v1/analytics/calibration |
Confidence calibration curves + ECE score |
| GET | /api/v1/analytics/question-breakdown |
Failure/fix/escalation rate per question type |
| GET | /api/v1/analytics/paper-metrics |
All research paper metrics in one call |
| GET | /api/v1/analytics/sdk-telemetry |
Field usage data from opted-in SDK users |
| GET | /health |
Server health check |
File Structure
Failure_Intelligence_System/
│
├── app/
│ ├── main.py FastAPI app entry point
│ ├── routes.py All API endpoints
│ ├── schemas.py Pydantic schemas (FSV, JuryVerdict, FixResult)
│ ├── auth.py / auth_guard.py API key authentication + tenant isolation
│ └── auth_routes.py Google OAuth routes
│
├── engine/
│ ├── groq_service.py Shadow model fan-out + confidence weighting
│ ├── encoder.py SentenceTransformer singleton (all-MiniLM-L6-v2)
│ ├── fix_engine.py Fix strategy selection and execution
│ ├── claim_extractor.py Extract subject/property/value from model output
│ ├── prompt_guard.py Statistical adversarial prompt scorer
│ ├── rag_grounder.py Wikipedia RAG for external verification
│ ├── question_classifier.py Rule-based question-type classifier (5 types)
│ ├── fie_config.py Auto-calibrating thresholds + MongoDB-backed config
│ └── failure_classifier.py XGBoost v2 failure classifier (AUC 0.728)
│ │
│ ├── detector/
│ │ ├── consistency.py compute_consistency(), is_primary_outlier()
│ │ ├── entropy.py Shannon entropy computation
│ │ ├── ensemble.py Pairwise embedding disagreement
│ │ └── embedding.py compute_embedding_distance()
│ │
│ ├── archetypes/
│ │ ├── labeling.py 7-archetype classification rules
│ │ ├── clustering.py Adaptive archetype cluster registry
│ │ └── registry.py FAISS index for adversarial pattern search
│ │
│ ├── agents/
│ │ ├── base_agent.py BaseJuryAgent, DiagnosticContext
│ │ ├── failure_agent.py DiagnosticJury + FailureAgent singletons
│ │ ├── adversarial_specialist.py 3-layer adversarial attack detection
│ │ ├── domain_critic.py 5-layer factual/temporal failure detection
│ │ └── linguistic_auditor.py Response structure and quality analysis
│ │
│ ├── verifier/
│ │ ├── ground_truth_pipeline.py GT pipeline orchestrator
│ │ ├── wikidata_verifier.py SPARQL queries against Wikidata
│ │ └── serper_verifier.py Google Search via Serper.dev
│ │
│ ├── evolution/
│ │ └── tracker.py EMA-based model degradation tracking
│ │
│ └── explainability/
│ └── explanation_builder.py Human-readable XAI explanation builder
│
├── fie/ Python SDK (pip install fie-sdk)
│ ├── monitor.py @monitor decorator
│ ├── client.py HTTP client for FIE server
│ └── config.py FIEConfig
│
├── storage/
│ ├── database.py MongoDB connection + inference CRUD
│ ├── signal_logger.py 30-field signal logging + feedback wiring
│ └── ground_truth_cache.py Verified answer cache (write-through)
│
├── Frontend/ React dashboard (Vite)
│
├── data/
│ ├── download_datasets.py TruthfulQA download (817 examples)
│ └── synthetic_generator.py Synthetic failure data generator
│
├── config.py Settings (thresholds, model names, flags)
├── test_local.py Group A/B recall + FPR benchmark test
├── test_ground_truth.py Ground truth pipeline isolation test
├── demo.py Interactive demo (chatbot with FIE)
└── FIE_COMPLETE_TECHNICAL_STORY.md Full technical documentation
Local Setup
Requirements
- Python 3.11+
- MongoDB Atlas URI
- Groq API key (free at console.groq.com)
- Node.js 18+ (for dashboard only)
1. Backend
git clone https://github.com/AyushSingh110/Failure_Intelligence_System.git
cd Failure_Intelligence_System
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
2. Environment
Create .env in the project root:
MONGODB_URI=your_mongodb_atlas_uri
MONGODB_DB_NAME=fie_database
GROQ_API_KEY=gsk_your_groq_key
GROQ_ENABLED=true
WIKIDATA_ENABLED=true
GROUND_TRUTH_CACHE_ENABLED=true
# Optional — needed for temporal question verification
SERPER_API_KEY=your_serper_key
SERPER_ENABLED=true
OLLAMA_ENABLED=false
GOOGLE_CLIENT_ID=your-google-client-id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-google-client-secret
GOOGLE_REDIRECT_URI=http://localhost:5173
JWT_SECRET_KEY=replace-with-a-long-random-secret
JWT_ALGORITHM=HS256
JWT_EXPIRE_HOURS=24
ADMIN_EMAIL=your-admin-email@example.com
3. Start Server
uvicorn app.main:app --reload
# Server: http://localhost:8000
# API docs: http://localhost:8000/docs
4. Dashboard (optional)
cd Frontend
npm install
npm run dev
# Dashboard: http://localhost:5173
5. Run Demo
python demo.py
6. Run Tests
# Full recall + FPR benchmark
python test_local.py
# Ground truth pipeline isolation
python test_ground_truth.py
# Backend unit tests
pytest
Required APIs
| Service | Required | Purpose | Free tier |
|---|---|---|---|
| Groq | Yes | Shadow models | 14,400 req/day per model |
| MongoDB Atlas | Yes | Storage | 512MB free |
| Wikidata | Yes | Factual verification | No key needed |
| Serper.dev | Optional | Temporal verification | 2,500 searches/month |
Example Request
curl -X POST http://localhost:8000/api/v1/monitor \
-H "Content-Type: application/json" \
-H "X-API-Key: fie-your-key" \
-d '{
"prompt": "Who invented the telephone?",
"primary_output": "Thomas Edison invented the telephone.",
"primary_model_name": "gpt-4",
"run_full_jury": true
}'
Example response (trimmed):
{
"high_failure_risk": true,
"archetype": "MODEL_BLIND_SPOT",
"failure_signal_vector": {
"agreement_score": 0.75,
"entropy_score": 0.406,
"high_failure_risk": true
},
"jury": {
"primary_verdict": {
"root_cause": "FACTUAL_HALLUCINATION",
"confidence_score": 0.62
}
},
"ground_truth": {
"verified_answer": "Alexander Graham Bell",
"confidence": 0.85,
"source": "wikidata",
"from_cache": false
},
"fix_result": {
"fix_applied": true,
"fix_strategy": "WIKIDATA_OVERRIDE",
"fixed_output": "Alexander Graham Bell",
"original_output": "Thomas Edison"
}
}
Key Thresholds
| Parameter | Value | File |
|---|---|---|
| High entropy threshold | 0.75 | config.py |
| Low agreement threshold | 0.80 | config.py |
| Primary-outlier cosine threshold | 0.72 | engine/detector/consistency.py |
| Shadow agreement minimum | 0.60 | engine/detector/consistency.py |
| GT Gate — jury confidence minimum | 0.45 | app/routes.py |
| Wikidata override confidence | 0.75 | engine/verifier/ground_truth_pipeline.py |
| Cache write confidence | 0.90 | engine/verifier/ground_truth_pipeline.py |
| Shadow consensus minimum | 0.60 | engine/verifier/ground_truth_pipeline.py |
| Embedding dimensions | 384 | engine/encoder.py |
Technology Stack
- Backend: FastAPI, Pydantic, Python 3.11
- Failure Classifier: XGBoost v2 with auto-calibrating per-type thresholds
- Question Routing: Rule-based classifier (5 types: FACTUAL/TEMPORAL/REASONING/CODE/OPINION)
- Storage: MongoDB Atlas
- Shadow Models: Groq API (Llama, DeepSeek, Qwen)
- Semantic Encoder: SentenceTransformers
all-MiniLM-L6-v2 - Vector Search: FAISS
- Fact Verification: Wikidata SPARQL, Serper.dev
- Frontend: React, Vite
- Auth: Google OAuth, JWT
- SDK Telemetry: Opt-in anonymized usage pings (no PII)
- Deployment: Docker, Google Cloud Run, Vercel
Benchmark Results
Evaluated on TruthfulQA (817 adversarial questions designed to trigger LLM hallucinations). 869 labeled examples generated via the synthetic pipeline.
| Method | Recall | FPR | F1 | AUC-ROC |
|---|---|---|---|---|
| POET rule-based (baseline) | 56.4% | 38.7% | 58.7% | — |
| XGBoost v1 (equal FPR) | 65.5% | 40.2% | 63.7% | 0.663 |
| XGBoost v1 (best F1) | 80.5% | 50.6% | 69.7% | 0.663 |
| XGBoost v2 with GT features | — | — | — | 0.728 |
Cross-validation (5-fold): Recall = 63.7% ± 4.0%
v1.1.0 improvement: XGBoost v2 (AUC 0.728) incorporates ground truth pipeline outputs (gt_source, fix_strategy, gt_confidence) as features — the 47 post-GT features account for the majority of AUC gain. Per-question-type thresholds (auto-calibrated) further reduce false positives on CODE and OPINION queries to near-zero.
Key finding: The Diagnostic Jury verdict remains the strongest individual predictor — confirming that the 3-agent jury adds meaningful signal beyond ensemble disagreement alone.
For Full Technical Documentation
See README_files/FIE_COMPLETE_TECHNICAL_STORY.md — covers every algorithm, formula, pipeline decision, benchmark result, and file in detail.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fie_sdk-1.1.0.tar.gz.
File metadata
- Download URL: fie_sdk-1.1.0.tar.gz
- Upload date:
- Size: 36.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94e83f0aca14c3c8a561223ceddb7bd1ccbeebe6bd4eab1693ec27d31f1696c5
|
|
| MD5 |
8d97a8fcda5db3b112cbb939d514977c
|
|
| BLAKE2b-256 |
a450cbd4f9bbfcc87dd4cfc3696338d49f56883358833e86102bf8d14a5c1210
|
File details
Details for the file fie_sdk-1.1.0-py3-none-any.whl.
File metadata
- Download URL: fie_sdk-1.1.0-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1be7cb8281ab30cd385821ff23797cd601295240bd4e5f83ad7f3fe882136dd
|
|
| MD5 |
583c07f216778cf33bff1d5872dd1db8
|
|
| BLAKE2b-256 |
83345620911f9a2e03a8411b7d8b9fd9eadd6754f40997da8a18d47e5e5368de
|