Monitor, diagnose, and auto-correct LLM failures — with XGBoost failure classification, question-type routing, auto-calibrating thresholds, Wikidata/Serper ground truth, and production analytics

These details have not been verified by PyPI

Project links

Project description

Failure Intelligence Engine (FIE)

Real-time LLM failure detection, root cause diagnosis, and automatic correction.

What's New in v1.1.0

XGBoost Failure Classifier (AUC 0.728)

The rule-based POET algorithm is now backed by a trained XGBoost model. Every inference runs through the classifier post-pipeline and the high_failure_risk flag is set by the model, not just rules. AUC-ROC improved from 0.663 → 0.728 on TruthfulQA.

Question-Type Routing

FIE now classifies every prompt into one of five types before running the pipeline:

Type	External GT used	Example
`FACTUAL`	Wikidata + Serper	"Who invented the telephone?"
`TEMPORAL`	Serper only	"What is Bitcoin's price today?"
`REASONING`	None (internal only)	"Why does entropy increase?"
`CODE`	None (internal only)	"Write a Python function to sort a list"
`OPINION`	None (skipped entirely)	"Should I use Python or JavaScript?"

This eliminates false positives on code/opinion questions where Wikidata lookups produce wrong "corrections."

Auto-Calibrating Thresholds

XGBoost uses a different decision threshold per question type (FACTUAL=0.40, CODE=0.52, OPINION=0.60, etc.). After every 50 new feedback submissions the thresholds are automatically recalculated from real labeled data using sklearn.precision_recall_curve. No manual threshold tuning required.

Production Analytics (5 New API Endpoints)

See API Endpoints below. Includes daily request volume, XGBoost vs POET agreement, confidence calibration curves (ECE), per-type breakdown, and a single "paper-metrics" endpoint that returns everything needed for the results section of a research paper.

Opt-In SDK Telemetry

When users set FIE_TELEMETRY=true, the SDK sends anonymized pings (no prompts, no API keys) back to the server after each call. Admins can see field failure rates, SDK version distribution, and question-type breakdown from real users via GET /api/v1/analytics/sdk-telemetry.

MonitorResponse Enhancements

Every /monitor response now includes:

classifier_probability — raw XGBoost score (0–1)
model_version — which model made the decision (xgboost-v2)
config_version — which threshold config was active

FIE sits between your LLM and your users. When the model gives a wrong answer, FIE catches it, finds the correct answer from a trusted source, and returns the correction — before the user ever sees the mistake.

What It Does

LLMs hallucinate. They say "Thomas Edison invented the telephone" with the same confidence as correct answers. There is no built-in signal. The wrong answer simply goes out to the user.

FIE solves this in real time:

Detect — runs the same prompt through 3 independent shadow models, computes an ensemble signal
Diagnose — a jury of 3 specialist agents votes on the root cause (hallucination, injection, temporal cutoff, etc.)
Verify — queries Wikidata or Google Search to find the correct answer
Correct — returns the verified answer to the user instead of the wrong one

The integration is one decorator:

from fie import monitor

@monitor(fie_url="http://localhost:8000", api_key="fie-xxx", mode="correct")
def ask_ai(prompt: str) -> str:
    return your_llm(prompt)

How It Works — Pipeline Overview

User Prompt
     │
     ▼
Your LLM  →  primary answer  →  FIE
                                 │
                     ┌───────────┴────────────┐
                     │                        │
              Phase 1: Shadow Ensemble    Phase 2: FSV
              3 models answer in         Agreement, entropy,
              parallel (Groq)            outlier detection
                     │
                     ▼
              Phase 3: Diagnostic Jury
              3 agents vote on root cause
              (AdversarialSpecialist, LinguisticAuditor, DomainCritic)
                     │
                     ▼
              Phase 4: Ground Truth Pipeline
              Cache → Wikidata → Serper → Shadow consensus
                     │
                     ▼
              Phase 5: Fix Engine
              Return corrected answer to user

Shadow Model Ensemble

Three shadow models from different families run in parallel on every query:

Model	Provider	Why
`llama-3.3-70b-versatile`	Meta	Strong general knowledge
`deepseek-r1-distill-llama-70b`	DeepSeek	Reasoning-focused, different RLHF
`qwen-qwq-32b`	Alibaba	Different pretraining corpus

Different families reduce correlated failure — if one model is wrong, the others are unlikely to make the same mistake.

Each shadow model self-reports its certainty. FIE weights votes by confidence:

Model reports	Vote weight
CONFIDENCE: HIGH	3.0
CONFIDENCE: MEDIUM	2.0
CONFIDENCE: LOW	1.0

Failure Signal Vector (FSV)

After collecting all 4 answers (1 primary + 3 shadows), FIE computes:

Signal	What it measures
`agreement_score`	Fraction of models that gave the same answer
`fsd_score`	Gap between the top-2 answer clusters
`entropy_score`	Normalized Shannon entropy of the answer distribution
`ensemble_disagreement`	Embedding-based pairwise disagreement flag
`ensemble_similarity`	Cosine similarity between primary and secondary
`high_failure_risk`	Final risk flag (set by XGBoost v1.1+)
`question_type`	Classified prompt type: FACTUAL / TEMPORAL / REASONING / CODE / OPINION

Shannon Entropy

H = -Σ p(x) × log₂(p(x))
H_normalized = H / log₂(total_outputs)

All 4 models agree → entropy = 0.0 (no uncertainty)
All 4 models differ → entropy = 1.0 (maximum uncertainty)
3 agree, 1 differs → entropy ≈ 0.41

Entropy is used alongside agreement because a 2-vs-2 split (entropy=1.0) is far more alarming than a 3-vs-1 split (entropy=0.41), even though both have low agreement.

Primary-Outlier Detection — POET Algorithm

high_failure_risk is set by POET (Primary Outlier Ensemble Test) — the core novel algorithm in FIE. It does not check overall ensemble agreement. It specifically checks whether the primary model is the one disagreeing with the shadow majority:

shadow_agreement = agreement among shadows only (primary excluded)
if shadow_agreement < 0.60 → can't blame primary (shadows confused)
else:
    majority = most common shadow answer cluster
    if primary semantically matches majority → NOT an outlier
    if primary is far from majority (cosine sim < 0.72) → IS an outlier → high_failure_risk = True

This dropped the false positive rate from 80% to 20% compared to threshold-based ensemble agreement.

Archetype Classification

7 failure archetypes based on the FSV:

Archetype	When it fires
`HALLUCINATION_RISK`	Ensemble disagrees AND high entropy
`OVERCONFIDENT_FAILURE`	High risk but very low entropy (confident but wrong)
`MODEL_BLIND_SPOT`	Systematic knowledge gap in a domain
`UNSTABLE_OUTPUT`	High entropy alone (genuine ambiguity)
`LOW_CONFIDENCE`	Low agreement without high entropy
`RESOURCE_CONSTRAINT`	High latency AND high entropy
`STABLE`	All signals within normal range

Diagnostic Jury

Three agents independently analyze the failure and vote on the root cause:

AdversarialSpecialist — `engine/agents/adversarial_specialist.py`

Detects intentional attacks using 3 detection layers:

Regex — patterns for PROMPT_INJECTION, JAILBREAK_ATTEMPT, INSTRUCTION_OVERRIDE, TOKEN_SMUGGLING
Prompt Guard — statistical heuristic scorer
FAISS semantic search — finds novel attacks similar to known attack vectors

Priority: TOKEN_SMUGGLING > PROMPT_INJECTION > JAILBREAK > OVERRIDE

LinguisticAuditor — `engine/agents/linguistic_auditor.py`

Detects structural response problems: excessive hedging, truncation, format inconsistency, length anomalies, repetition loops.

DomainCritic — `engine/agents/domain_critic.py`

Detects factual and temporal failures using 5 weighted layers:

Layer	Weight	What it checks
Contradiction signal	0.40	FSV entropy + agreement vs thresholds
Self-contradiction	0.35	Cosine similarity between primary and secondary
Hedge detection	0.15	Uncertainty phrases in model outputs
Temporal detection	0.10	Time-sensitive keywords in prompt
External verification	0.45	Wikipedia/RAG fact check

Permanent fact guard: Chemical formulas, math identities, and physical constants are never routed to temporal (Serper) verification — they are verified via Wikidata only.

Jury Aggregation

Priority 1: Adversarial verdict (if any agent detected an attack)
Priority 2: Temporal verdict (routes to live search)
Default:    Highest confidence verdict wins

Ground Truth Pipeline

Runs only when both gates pass:

Gate 1: high_failure_risk = True
Gate 2: jury_confidence >= 0.45

Pipeline steps:

1. Cache lookup (MongoDB ground_truth_cache)
   → HIT: return verified answer immediately

2. Permanent fact check
   → chemical formula / math / physics constant → Wikidata only (no Serper)

3. Temporal routing
   → root_cause = TEMPORAL_KNOWLEDGE_CUTOFF → Serper (Google Search)
   → all other root causes → Wikidata

4. Wikidata (SPARQL)
   → Extract claim: subject / property / value
   → Search Wikidata with enriched query
   → contradiction + confidence ≥ 0.75 → OVERRIDE
   → confirmation + confidence ≥ 0.60 → CONFIRM

5. Serper (Google Search)
   → contradicts primary → OVERRIDE with search answer
   → confirms primary → CONFIRM

6. Shadow consensus fallback
   → weighted shadow agreement ≥ 0.60 → use majority shadow answer
   → below 0.60 → ESCALATE to human review

7. Write-through cache
   → verified answer with confidence ≥ 0.90 → saved to cache

Fix Strategies

Strategy	When used
`WIKIDATA_OVERRIDE`	Wikidata contradicts the primary answer
`SERPER_OVERRIDE`	Google Search contradicts the primary answer
`SHADOW_CONSENSUS`	External sources exhausted, shadows agree
`SANITIZE_AND_RERUN`	Adversarial attack detected
`CONTEXT_INJECTION`	Temporal failure, search result available
`PROMPT_DECOMPOSITION`	Question too complex
`HUMAN_ESCALATION`	No reliable ground truth, shadow consensus too weak
`NO_FIX`	Output is stable

Root Causes

Root Cause	Meaning
`FACTUAL_HALLUCINATION`	Model stated a wrong fact
`TEMPORAL_KNOWLEDGE_CUTOFF`	Model's training data is outdated
`KNOWLEDGE_BOUNDARY_FAILURE`	Model uncertain at edge of training data
`PROMPT_INJECTION`	User attempting to override system prompt
`JAILBREAK_ATTEMPT`	User attempting to bypass safety guidelines
`INSTRUCTION_OVERRIDE`	User claiming fake authority
`TOKEN_SMUGGLING`	Special model tokens embedded in user input
`PROMPT_COMPLEXITY_OOD`	Question out-of-distribution / too complex

Signal Logging

Every inference is logged to MongoDB signal_logs with 30+ fields including agreement, entropy, archetype, root cause, jury confidence, GT source, fix applied, and latency. Human feedback can be submitted via POST /api/v1/feedback/{request_id} to label signal logs as correct or incorrect — building a labeled dataset for future classifier training.

SDK Modes

# mode="monitor" — async, no latency added
# FIE checks in background, original answer returned immediately
@monitor(fie_url="...", api_key="...", mode="monitor")
def ask_ai(prompt: str) -> str:
    return your_llm(prompt)

# mode="correct" — synchronous, real-time correction
# FIE verifies and returns corrected answer if wrong
@monitor(fie_url="...", api_key="...", mode="correct")
def ask_ai(prompt: str) -> str:
    return your_llm(prompt)

API Endpoints

Method	Path	What it does
POST	`/api/v1/monitor`	Main production endpoint — full pipeline
POST	`/api/v1/diagnose`	Run diagnostic jury on provided outputs
POST	`/api/v1/analyze`	Phase 1 signal extraction only
POST	`/api/v1/feedback/{id}`	Submit human feedback on an inference
GET	`/api/v1/inferences`	List stored inferences
GET	`/api/v1/trend`	EMA-based degradation trend
GET	`/api/v1/clusters`	Archetype cluster summary
GET	`/api/v1/monitor/signal-logs`	Raw signal logs (admin)
GET	`/api/v1/monitor/calibration`	Per-confidence-bucket accuracy stats (admin)
GET	`/api/v1/analytics/usage`	Request volume, failure rate, daily breakdown
GET	`/api/v1/analytics/model-performance`	XGBoost accuracy, per-question-type stats
GET	`/api/v1/analytics/calibration`	Confidence calibration curves + ECE score
GET	`/api/v1/analytics/question-breakdown`	Failure/fix/escalation rate per question type
GET	`/api/v1/analytics/paper-metrics`	All research paper metrics in one call
GET	`/api/v1/analytics/sdk-telemetry`	Field usage data from opted-in SDK users
GET	`/health`	Server health check

File Structure

Failure_Intelligence_System/
│
├── app/
│   ├── main.py                    FastAPI app entry point
│   ├── routes.py                  All API endpoints
│   ├── schemas.py                 Pydantic schemas (FSV, JuryVerdict, FixResult)
│   ├── auth.py / auth_guard.py    API key authentication + tenant isolation
│   └── auth_routes.py             Google OAuth routes
│
├── engine/
│   ├── groq_service.py            Shadow model fan-out + confidence weighting
│   ├── encoder.py                 SentenceTransformer singleton (all-MiniLM-L6-v2)
│   ├── fix_engine.py              Fix strategy selection and execution
│   ├── claim_extractor.py         Extract subject/property/value from model output
│   ├── prompt_guard.py            Statistical adversarial prompt scorer
│   ├── rag_grounder.py            Wikipedia RAG for external verification
│   ├── question_classifier.py     Rule-based question-type classifier (5 types)
│   ├── fie_config.py              Auto-calibrating thresholds + MongoDB-backed config
│   └── failure_classifier.py      XGBoost v2 failure classifier (AUC 0.728)
│   │
│   ├── detector/
│   │   ├── consistency.py         compute_consistency(), is_primary_outlier()
│   │   ├── entropy.py             Shannon entropy computation
│   │   ├── ensemble.py            Pairwise embedding disagreement
│   │   └── embedding.py           compute_embedding_distance()
│   │
│   ├── archetypes/
│   │   ├── labeling.py            7-archetype classification rules
│   │   ├── clustering.py          Adaptive archetype cluster registry
│   │   └── registry.py            FAISS index for adversarial pattern search
│   │
│   ├── agents/
│   │   ├── base_agent.py          BaseJuryAgent, DiagnosticContext
│   │   ├── failure_agent.py       DiagnosticJury + FailureAgent singletons
│   │   ├── adversarial_specialist.py  3-layer adversarial attack detection
│   │   ├── domain_critic.py       5-layer factual/temporal failure detection
│   │   └── linguistic_auditor.py  Response structure and quality analysis
│   │
│   ├── verifier/
│   │   ├── ground_truth_pipeline.py  GT pipeline orchestrator
│   │   ├── wikidata_verifier.py      SPARQL queries against Wikidata
│   │   └── serper_verifier.py        Google Search via Serper.dev
│   │
│   ├── evolution/
│   │   └── tracker.py             EMA-based model degradation tracking
│   │
│   └── explainability/
│       └── explanation_builder.py Human-readable XAI explanation builder
│
├── fie/                           Python SDK (pip install fie-sdk)
│   ├── monitor.py                 @monitor decorator
│   ├── client.py                  HTTP client for FIE server
│   └── config.py                  FIEConfig
│
├── storage/
│   ├── database.py                MongoDB connection + inference CRUD
│   ├── signal_logger.py           30-field signal logging + feedback wiring
│   └── ground_truth_cache.py      Verified answer cache (write-through)
│
├── Frontend/                      React dashboard (Vite)
│
├── data/
│   ├── download_datasets.py       TruthfulQA download (817 examples)
│   └── synthetic_generator.py     Synthetic failure data generator
│
├── config.py                      Settings (thresholds, model names, flags)
├── test_local.py                  Group A/B recall + FPR benchmark test
├── test_ground_truth.py           Ground truth pipeline isolation test
├── demo.py                        Interactive demo (chatbot with FIE)
└── FIE_COMPLETE_TECHNICAL_STORY.md  Full technical documentation

Local Setup

Requirements

Python 3.11+
MongoDB Atlas URI
Groq API key (free at console.groq.com)
Node.js 18+ (for dashboard only)

1. Backend

git clone https://github.com/AyushSingh110/Failure_Intelligence_System.git
cd Failure_Intelligence_System
python -m venv .venv
.venv\Scripts\activate       # Windows
# source .venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

2. Environment

Create .env in the project root:

MONGODB_URI=your_mongodb_atlas_uri
MONGODB_DB_NAME=fie_database

GROQ_API_KEY=gsk_your_groq_key
GROQ_ENABLED=true

WIKIDATA_ENABLED=true
GROUND_TRUTH_CACHE_ENABLED=true

# Optional — needed for temporal question verification
SERPER_API_KEY=your_serper_key
SERPER_ENABLED=true

OLLAMA_ENABLED=false

GOOGLE_CLIENT_ID=your-google-client-id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-google-client-secret
GOOGLE_REDIRECT_URI=http://localhost:5173

JWT_SECRET_KEY=replace-with-a-long-random-secret
JWT_ALGORITHM=HS256
JWT_EXPIRE_HOURS=24
ADMIN_EMAIL=your-admin-email@example.com

3. Start Server

uvicorn app.main:app --reload
# Server: http://localhost:8000
# API docs: http://localhost:8000/docs

4. Dashboard (optional)

cd Frontend
npm install
npm run dev
# Dashboard: http://localhost:5173

5. Run Demo

python demo.py

6. Run Tests

# Full recall + FPR benchmark
python test_local.py

# Ground truth pipeline isolation
python test_ground_truth.py

# Backend unit tests
pytest

Required APIs

Service	Required	Purpose	Free tier
Groq	Yes	Shadow models	14,400 req/day per model
MongoDB Atlas	Yes	Storage	512MB free
Wikidata	Yes	Factual verification	No key needed
Serper.dev	Optional	Temporal verification	2,500 searches/month

Example Request

curl -X POST http://localhost:8000/api/v1/monitor \
  -H "Content-Type: application/json" \
  -H "X-API-Key: fie-your-key" \
  -d '{
    "prompt": "Who invented the telephone?",
    "primary_output": "Thomas Edison invented the telephone.",
    "primary_model_name": "gpt-4",
    "run_full_jury": true
  }'

Example response (trimmed):

{
  "high_failure_risk": true,
  "archetype": "MODEL_BLIND_SPOT",
  "failure_signal_vector": {
    "agreement_score": 0.75,
    "entropy_score": 0.406,
    "high_failure_risk": true
  },
  "jury": {
    "primary_verdict": {
      "root_cause": "FACTUAL_HALLUCINATION",
      "confidence_score": 0.62
    }
  },
  "ground_truth": {
    "verified_answer": "Alexander Graham Bell",
    "confidence": 0.85,
    "source": "wikidata",
    "from_cache": false
  },
  "fix_result": {
    "fix_applied": true,
    "fix_strategy": "WIKIDATA_OVERRIDE",
    "fixed_output": "Alexander Graham Bell",
    "original_output": "Thomas Edison"
  }
}

Key Thresholds

Parameter	Value	File
High entropy threshold	0.75	`config.py`
Low agreement threshold	0.80	`config.py`
Primary-outlier cosine threshold	0.72	`engine/detector/consistency.py`
Shadow agreement minimum	0.60	`engine/detector/consistency.py`
GT Gate — jury confidence minimum	0.45	`app/routes.py`
Wikidata override confidence	0.75	`engine/verifier/ground_truth_pipeline.py`
Cache write confidence	0.90	`engine/verifier/ground_truth_pipeline.py`
Shadow consensus minimum	0.60	`engine/verifier/ground_truth_pipeline.py`
Embedding dimensions	384	`engine/encoder.py`

Technology Stack

Backend: FastAPI, Pydantic, Python 3.11
Failure Classifier: XGBoost v2 with auto-calibrating per-type thresholds
Question Routing: Rule-based classifier (5 types: FACTUAL/TEMPORAL/REASONING/CODE/OPINION)
Storage: MongoDB Atlas
Shadow Models: Groq API (Llama, DeepSeek, Qwen)
Semantic Encoder: SentenceTransformers all-MiniLM-L6-v2
Vector Search: FAISS
Fact Verification: Wikidata SPARQL, Serper.dev
Frontend: React, Vite
Auth: Google OAuth, JWT
SDK Telemetry: Opt-in anonymized usage pings (no PII)
Deployment: Docker, Google Cloud Run, Vercel

Benchmark Results

Evaluated on TruthfulQA (817 adversarial questions designed to trigger LLM hallucinations). 869 labeled examples generated via the synthetic pipeline.

Method	Recall	FPR	F1	AUC-ROC
POET rule-based (baseline)	56.4%	38.7%	58.7%	—
XGBoost v1 (equal FPR)	65.5%	40.2%	63.7%	0.663
XGBoost v1 (best F1)	80.5%	50.6%	69.7%	0.663
XGBoost v2 with GT features	—	—	—	0.728

Cross-validation (5-fold): Recall = 63.7% ± 4.0%

v1.1.0 improvement: XGBoost v2 (AUC 0.728) incorporates ground truth pipeline outputs (gt_source, fix_strategy, gt_confidence) as features — the 47 post-GT features account for the majority of AUC gain. Per-question-type thresholds (auto-calibrated) further reduce false positives on CODE and OPINION queries to near-zero.

Key finding: The Diagnostic Jury verdict remains the strongest individual predictor — confirming that the 3-agent jury adds meaningful signal beyond ensemble disagreement alone.

For Full Technical Documentation

See README_files/FIE_COMPLETE_TECHNICAL_STORY.md — covers every algorithm, formula, pipeline decision, benchmark result, and file in detail.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.11.0

Jun 2, 2026

1.10.1

May 30, 2026

1.10.0

May 28, 2026

1.9.0

May 27, 2026

1.8.0

May 26, 2026

1.7.0

May 26, 2026

1.6.0

May 24, 2026

1.5.1

May 18, 2026

1.4.1

May 6, 2026

1.4.0

May 5, 2026

1.3.0

May 4, 2026

1.2.0

Apr 30, 2026

This version

1.1.0

Apr 29, 2026

0.3.0

Apr 8, 2026

0.2.0

Mar 27, 2026

0.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fie_sdk-1.1.0.tar.gz (36.3 MB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fie_sdk-1.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file fie_sdk-1.1.0.tar.gz.

File metadata

Download URL: fie_sdk-1.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 36.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for fie_sdk-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`94e83f0aca14c3c8a561223ceddb7bd1ccbeebe6bd4eab1693ec27d31f1696c5`
MD5	`8d97a8fcda5db3b112cbb939d514977c`
BLAKE2b-256	`a450cbd4f9bbfcc87dd4cfc3696338d49f56883358833e86102bf8d14a5c1210`

See more details on using hashes here.

File details

Details for the file fie_sdk-1.1.0-py3-none-any.whl.

File metadata

Download URL: fie_sdk-1.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 19.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for fie_sdk-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1be7cb8281ab30cd385821ff23797cd601295240bd4e5f83ad7f3fe882136dd`
MD5	`583c07f216778cf33bff1d5872dd1db8`
BLAKE2b-256	`83345620911f9a2e03a8411b7d8b9fd9eadd6754f40997da8a18d47e5e5368de`

See more details on using hashes here.

fie-sdk 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Failure Intelligence Engine (FIE)

What's New in v1.1.0

XGBoost Failure Classifier (AUC 0.728)

Question-Type Routing

Auto-Calibrating Thresholds

Production Analytics (5 New API Endpoints)

Opt-In SDK Telemetry

MonitorResponse Enhancements

What It Does

How It Works — Pipeline Overview

Shadow Model Ensemble

Failure Signal Vector (FSV)

Shannon Entropy

Primary-Outlier Detection — POET Algorithm

Archetype Classification

Diagnostic Jury

AdversarialSpecialist — engine/agents/adversarial_specialist.py

LinguisticAuditor — engine/agents/linguistic_auditor.py

DomainCritic — engine/agents/domain_critic.py

Jury Aggregation

Ground Truth Pipeline

Fix Strategies

Root Causes

Signal Logging

SDK Modes

API Endpoints

File Structure

Local Setup

Requirements

1. Backend

2. Environment

3. Start Server

4. Dashboard (optional)

5. Run Demo

6. Run Tests

Required APIs

Example Request

Key Thresholds

Technology Stack

Benchmark Results

For Full Technical Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

AdversarialSpecialist — `engine/agents/adversarial_specialist.py`

LinguisticAuditor — `engine/agents/linguistic_auditor.py`

DomainCritic — `engine/agents/domain_critic.py`