Production AI pipeline monitoring — root cause detection, anomaly alerts, regression guard
Project description
Failure Forensics
Production AI pipeline monitoring — root cause detection, anomaly alerts, regression guard, and Gemini-powered recommendations.
Installation
pip install failure-forensics
Quick Start
from failure_forensics import trace
@trace(step="retrieval", version="v1")
def my_retrieval_function(query):
# your code here
pass
Features 🔬
A self-hosted, zero-cost LLM pipeline observability tool that gives you root cause detection, anomaly alerts, A/B reporting, and a live terminal dashboard — without sending your data to any third-party service.
🆚 Why Not LangSmith or Braintrust?
| Failure Forensics | LangSmith | Braintrust | |
|---|---|---|---|
| Cost | Free | Paid tiers | Paid tiers |
| Data privacy | Stays on your machine | Sent to cloud | Sent to cloud |
| Customization | Full control | Limited | Limited |
| Slack alerts | Built-in | Premium only | Premium only |
| A/B reporting | Built-in | Basic | Basic |
| Circuit breaker / trend | Built-in | ❌ | ❌ |
Failure Forensics is designed for teams who need production-grade observability without vendor lock-in.
✨ What It Does
Every pipeline run passes through a structured logging and analysis layer:
Pipeline Step → logger.py → requests.jsonl
↓
┌────────────────┴────────────────┐
│ │
forensics.py pattern.py
(root cause detection) (time series + anomaly)
│ │
versioning.py baseline.py
(v1 vs v2 comparison) (7-day moving average)
│ │
ab_report.py alerts.py
(A/B comparison table) (Slack / console alert)
└────────────────┬────────────────┘
↓
dashboard.py
(ASCII terminal dashboard)
📁 Project Structure
failure-forensics/
├── src/
│ ├── logger.py # Logs every pipeline step to JSONL
│ ├── forensics.py # Root cause detection (5 categories)
│ ├── pattern.py # Time-series failure rate + anomaly detection
│ ├── baseline.py # 7-day moving average + trend (IMPROVING/STABLE/DEGRADING)
│ ├── alerts.py # Slack webhook + console alerts
│ ├── versioning.py # Per-version failure rate stats
│ ├── ab_report.py # A/B comparison report (table + JSON)
│ └── dashboard.py # ASCII bar chart terminal dashboard
├── data/
│ └── logs/
│ └── requests.jsonl # All pipeline logs (gitignored)
├── tests/
│ └── test_forensics.py # 8 unit tests
├── config.py # Thresholds, Slack URL, step limits
├── main.py # 5-scenario demo runner
├── simulate.py # Realistic test data generator (100 runs, anomaly day)
└── requirements.txt
🚀 Getting Started
1. Clone & Install
git clone https://github.com/jasstt/failure-forensics.git
cd failure-forensics
pip install -r requirements.txt
2. (Optional) Configure Slack Alerts
Edit config.py:
SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
If left empty, all alerts print to the console.
3. Run the Full Demo
python main.py
This runs 5 scenarios:
- Simulation — generates 100 realistic pipeline runs (2 prompt versions, anomaly day)
- Root cause analysis — detects the failing step and assigns a category
- 7-day pattern report — failure rate per day + step breakdown + anomaly check
- A/B report —
prompt_v1vsprompt_v2with per-step improvement table - Terminal dashboard — live ASCII bar charts, trend, top 5 failed runs
4. Run Unit Tests
python tests/test_forensics.py
python tests/test_advanced.py
🚀 Advanced Features (New in v2)
| Katman | Özellik | Teknoloji |
|---|---|---|
| 1 | Otomatik öneri motoru | Kural tabanlı |
| 2 | AI destekli hata analizi | Gemini 2.5 Pro |
| 3 | Eval seti otomatik büyütme | Frequency analysis |
| 4 | Prompt optimizasyon açıklaması | Gemini 2.5 Pro |
| 5 | Regression guard | Baseline comparison |
Senaryo 6: Regression Guard
Yeni bir prompt (v3) deploy edilmeden önce otomatik regresyon kontrolü yapar:
REGRESSION CHECK — v3
Baseline (v2): 11.0% failure rate
Yeni (v3): 24.5% failure rate
Delta: +13.5pp → REGRESSION_DETECTED ❌
Test Results
| Katman | Test | Sonuç |
|---|---|---|
| 1 — Recommender | Kategori → öneri mapping | ✅ PASS |
| 2 — LLM Analyzer | Gemini fallback | ✅ PASS |
| 3 — Eval Collector | Duplicate prevention | ✅ PASS |
| 4 — Prompt Optimizer | A/B açıklama (v2: +10pp) | ✅ PASS |
| 5 — Regression Guard | DETECTED + PASS senaryoları | ✅ PASS |
Key Results
- A/B: prompt_v2, v1'e göre 10pp iyileşme
- Regression Guard: v3 deploy'u +6pp delta ile WARNING olarak engelledi
- Eval Collector: 5 yeni eval adayı otomatik toplandı
- LLM Analyzer: Gemini kapalıyken kural tabanlına sorunsuz fallback
📊 Results
| Feature | Result |
|---|---|
| Unit Tests | 8/8 PASS ✅ |
| Root cause categories | 5 types (RETRIEVAL_QUALITY, RERANKER_FAILURE, LLM_HALLUCINATION, CITATION_MISS, API_ERROR) |
| Anomaly detection | 20% delta threshold — flags when today's rate exceeds 7-day average by >20pp |
| A/B comparison | v2: 11.5pp improvement over v1 (22.5% → 11.0% failure rate) |
| Trend analysis | IMPROVING / STABLE / DEGRADING based on 7-day moving average |
| Slack integration | Webhook ready — fires on rate threshold, anomaly, or 3 consecutive failures |
⚙️ Configuration (config.py)
| Parameter | Default | Description |
|---|---|---|
FAILURE_RATE_THRESHOLD |
0.25 |
Alert fires above this failure rate |
ANOMALY_THRESHOLD |
0.20 |
Flag if today exceeds 7-day avg by this delta |
SLACK_WEBHOOK_URL |
"" |
Empty = console output |
CONSECUTIVE_FAILURE_THRESHOLD |
3 |
Alert after N consecutive step failures |
STEP_THRESHOLDS |
see config | Per-step max acceptable failure rate |
🧪 Root Cause Categories
| Category | Trigger |
|---|---|
RETRIEVAL_QUALITY |
Retrieval step fails — no results, low score |
RERANKER_FAILURE |
Reranker can't parse LLM response or times out |
LLM_HALLUCINATION |
Generation returns empty or uncited response |
CITATION_MISS |
Answer produced but no source citations found |
API_ERROR |
Timeout, 429 rate limit, 503 service unavailable |
📈 Terminal Dashboard (Sample Output)
═════════════════════════════════════════════════════════════
🔬 FAILURE FORENSICS — Terminal Dashboard
═════════════════════════════════════════════════════════════
📅 SON 7 GÜNÜN FAILURE RATE GRAFİĞİ
2026-06-03 [███░░░░░░░░░░░░░░░░░░░░░░░░░░░] 13.0%
2026-06-07 [████████░░░░░░░░░░░░░░░░░░░░░░] 27.3% ⚠️
2026-06-10 [███░░░░░░░░░░░░░░░░░░░░░░░░░░░] 12.0%
🔍 ADIM BAZINDA HATA DAĞILIMI
retrieval [███████░░░░░░░░░░░░░] 38.0% (38/100 hatalı)
reranking [██░░░░░░░░░░░░░░░░░░] 13.0% (13/100 hatalı)
generation [██░░░░░░░░░░░░░░░░░░] 10.0% (10/100 hatalı)
citation [█░░░░░░░░░░░░░░░░░░░] 6.0% (6/100 hatalı)
⚡ ANOMALİ: ✅ Normal: Bugün (12.0%) ≈ 7g ort. (16.2%)
📊 TREND: ➡️ STABLE — Hareketli Ort: 16.0%
🛠 Technologies Used
- Python standard library —
json,collections,datetime,threading - requests — Slack webhook HTTP calls
- python-dotenv — Environment variable management
No heavy dependencies. No cloud. No API keys required.
🔭 Roadmap
- FastAPI REST endpoint for remote log ingestion
- HTML report export
- PostgreSQL backend for large-scale log storage
- Multi-pipeline support (compare RAG vs fine-tuned model)
- Email alerts as alternative to Slack
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file failure_forensics-0.1.1.tar.gz.
File metadata
- Download URL: failure_forensics-0.1.1.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
818e96923dd14e7a137f700c255318aca1ef167bf256b3e5d2a3b69134c4609b
|
|
| MD5 |
0b4f973a5f93f1e6645076d2f257a3a8
|
|
| BLAKE2b-256 |
af70a44b631a0d665f43de1037fc8ed474114d837ca8bdb4cfd727a3c5e42741
|
File details
Details for the file failure_forensics-0.1.1-py3-none-any.whl.
File metadata
- Download URL: failure_forensics-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0cbf3854252b2e9979a53253e86d04e24a766a7eda943500ea32253c3f11bd7
|
|
| MD5 |
34626445df9d8a6c4705810eac6a6bf1
|
|
| BLAKE2b-256 |
9351a55a5c00ededdf4610d7ef3c1d2dabc900d71ed638ad6ab332517f809766
|