fitz-gov: Comprehensive RAG Governance Benchmark
Project description
fitz-gov: Comprehensive RAG Governance Benchmark
fitz-gov is a benchmark for evaluating RAG system governance - the ability to know when to abstain, dispute, qualify, or confidently answer questions.
Why fitz-gov?
Most RAG benchmarks focus on retrieval quality (BEIR) or answer correctness (RAGAS). But real-world RAG systems need epistemic honesty - knowing what they don't know.
fitz-gov measures:
| Category | What it Tests | Maps to |
|---|---|---|
| Abstention | Refuses when context is insufficient | ABSTAIN mode |
| Dispute | Flags conflicting sources | DISPUTED mode |
| Qualification | Hedges uncertain claims | QUALIFIED mode |
| Confidence | Answers confidently when evidence is clear | CONFIDENT mode |
| Grounding | Answers are grounded in context (no hallucination) | Answer quality |
| Relevance | Answers address the actual question | Answer quality |
Installation
pip install fitz-gov
Or install from local path during development:
pip install -e path/to/fitz-gov
Quick Start
Tiered Evaluation (Recommended)
fitz-gov uses a two-tier evaluation system:
- Tier 0 (Sanity): 60 easy cases with 95% pass threshold - gates Tier 1
- Tier 1 (Core): 271 discriminative cases with gradient scoring
from fitz_gov import FitzGovEvaluator, load_tier, Tier, AnswerMode
# Load tiered cases
tier0_cases = load_tier(Tier.SANITY) # 60 cases
tier1_cases = load_tier(Tier.CORE) # 271 cases
# Your RAG system generates responses and modes for each tier
tier0_responses, tier0_modes = your_rag_system.evaluate(tier0_cases)
tier1_responses, tier1_modes = your_rag_system.evaluate(tier1_cases)
# Run tiered evaluation
evaluator = FitzGovEvaluator()
result = evaluator.evaluate_tiered(
tier0_cases, tier0_responses, tier0_modes,
tier1_cases, tier1_responses, tier1_modes,
)
print(result)
# fitz-gov Tiered Evaluation
# ==========================
#
# TIER 0 (Sanity Check): PASSED
# Threshold: 95% | Achieved: 98.3% (59/60)
#
# TIER 1 (Core Benchmark): 78.1%
# By Category:
# abstention: 26/30 (86.7%)
# dispute: 22/30 (73.3%)
# ...
#
# Summary: Tier 0 PASSED, Tier 1 Score: 78.1%
With Fitz RAG Engine
from fitz_ai.evaluation.benchmarks import FitzGovBenchmark
# Create benchmark and evaluate your engine
benchmark = FitzGovBenchmark()
results = benchmark.evaluate(engine)
print(results)
Standalone Usage (Any RAG System)
The fitz-gov package contains all evaluation logic, so any RAG system can be evaluated:
from fitz_gov import FitzGovEvaluator, load_cases, FitzGovCategory, AnswerMode
# Load test cases
cases = load_cases()
# Create evaluator
evaluator = FitzGovEvaluator()
# Evaluate your RAG system's responses
responses = []
modes = []
for case in cases:
# Your RAG system generates response
response = your_rag_system.query(case.query, case.contexts)
mode = your_rag_system.classify_mode(response) # Your mode classification
responses.append(response)
modes.append(mode)
# Get comprehensive results
results = evaluator.evaluate_all(cases, responses, modes)
print(f"Overall accuracy: {results.overall_accuracy:.1%}")
Evaluating Individual Cases
from fitz_gov import FitzGovEvaluator, load_case_by_id
evaluator = FitzGovEvaluator()
# Load specific test case
case = load_case_by_id("abstain_001")
# Your system's response
response = "Based on the context provided, I cannot find information about..."
mode = AnswerMode.ABSTAIN
# Evaluate
result = evaluator.evaluate_case(case, response, mode)
print(f"Passed: {result.passed}")
print(f"Expected: {case.expected_mode.value}, Got: {mode.value}")
Two-Pass Validation (Answer Quality Categories)
For grounding and relevance categories, fitz-gov uses two-pass validation to reduce false positives:
- Regex pass: Fast pattern matching catches obvious violations
- LLM pass: Semantic validation for flagged cases
Enable LLM Validation
from fitz_gov import FitzGovEvaluator
# Enable LLM validation with local Ollama
evaluator = FitzGovEvaluator(
llm_validation=True,
llm_model="qwen2.5:14b", # or any Ollama model
llm_base_url="http://localhost:11434"
)
# Responses flagged by regex are sent to LLM for semantic check
results = evaluator.evaluate_all(cases, responses, modes)
Validation Flow
Response contains forbidden_claim pattern?
│
├─ No → PASS (no hallucination detected)
│
└─ Yes → LLM validates: "Is this an actual hallucination?"
│
├─ LLM says no (e.g., "no revenue mentioned") → PASS
│
└─ LLM says yes (fabricated specific value) → FAIL
Caching
LLM validation results are cached for 7 days to speed up repeated evaluations:
- Cache location:
~/.cache/fitz_gov/ - Automatic cache cleanup on startup
API Reference
Core Classes
from fitz_gov import (
# Evaluator
FitzGovEvaluator,
# Data loading
load_cases,
load_tier,
load_case_by_id,
get_category_info,
get_tier_info,
get_data_dir,
get_tier_dir,
Tier,
# Models
FitzGovCategory,
AnswerMode,
FitzGovCase,
FitzGovCaseResult,
FitzGovCategoryResult,
FitzGovConfusionMatrix,
FitzGovResult,
# Tiered Results
TieredResult,
Tier0Result,
Tier1Result,
# LLM Validation
OllamaValidator,
ValidatorConfig,
ValidationResult,
)
FitzGovEvaluator
evaluator = FitzGovEvaluator(
llm_validation=False, # Enable two-pass validation
llm_model="qwen2.5:14b", # Ollama model for validation
llm_base_url="http://localhost:11434"
)
# Tiered evaluation (recommended)
result = evaluator.evaluate_tiered(
tier0_cases, tier0_responses, tier0_modes,
tier1_cases, tier1_responses, tier1_modes,
tier0_threshold=0.95, # Default: 95%
gating_enabled=True, # Skip Tier 1 if Tier 0 fails
)
# Flat evaluation (all cases together)
results = evaluator.evaluate_all(cases, responses, modes)
# Evaluate single case
result = evaluator.evaluate_case(case, response, mode)
Loading Test Cases
# Load by tier (recommended)
tier0_cases = load_tier(Tier.SANITY) # 60 sanity cases
tier1_cases = load_tier(Tier.CORE) # 271 core cases
# Load all cases (331 total)
all_cases = load_cases()
# Load specific categories from a tier
abstention_tier0 = load_tier(Tier.SANITY, [FitzGovCategory.ABSTENTION])
# Load specific categories across all tiers
governance_cases = load_cases([
FitzGovCategory.ABSTENTION,
FitzGovCategory.DISPUTE,
])
# Load single case by ID (IDs prefixed with t0_ or t1_)
case = load_case_by_id("t1_dispute_medium_005")
Data Format
Test cases are organized in a tiered structure:
data/
├── tier0_sanity/ # 60 cases - baseline verification (95% threshold)
│ ├── abstention.json # 12 cases
│ ├── dispute.json # 12 cases
│ ├── qualification.json # 10 cases
│ ├── confidence.json # 10 cases
│ ├── grounding.json # 8 cases
│ └── relevance.json # 8 cases
├── tier1_core/ # 271 cases - discriminative benchmark
│ ├── abstention.json # 51 cases
│ ├── dispute.json # 43 cases
│ ├── qualification.json # 58 cases
│ ├── confidence.json # 53 cases
│ ├── grounding.json # 34 cases
│ └── relevance.json # 32 cases
└── corpus/
└── documents.jsonl # 378 reference documents
Each case has:
{
"id": "abstain_001",
"query": "What is the company's revenue for 2024?",
"contexts": ["The company was founded in 2010..."],
"expected_mode": "abstain",
"subcategory": "different_domain",
"difficulty": "medium",
"mode_rationale": "Context contains no financial data",
"evaluation_config": {
"forbidden_claims": ["\\$\\d"],
"allowed_phrases": ["not specified", "cannot find"]
}
}
Version
Current version: 2.0.0
See CHANGELOG.md for release history and docs/roadmap for implementation details.
Architecture Note
fitz-gov is designed as a standalone package so that:
- Any RAG system can benchmark against the same test cases
- Evaluation logic is consistent - all systems get identical evaluation
- Test data is versioned - reproducible benchmarks across releases
For Fitz RAG engine integration, see fitz_ai.evaluation.benchmarks.FitzGovBenchmark which wraps this package.
Contributing
We welcome contributions! To add new test cases:
- Fork this repo
- Add cases to the appropriate
data/<category>/directory - Run validation:
python scripts/validate.py - Submit a PR
License
MIT License - see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fitz_gov-2.0.0.tar.gz.
File metadata
- Download URL: fitz_gov-2.0.0.tar.gz
- Upload date:
- Size: 229.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec9e0035a4c82e6e1d508591ba987ef04e8b6d89cbaa6dbe5bd51c22186b796a
|
|
| MD5 |
7305f26fa7ab81c68b50c61e7b53c9ae
|
|
| BLAKE2b-256 |
543bb621a7a1c162ef62bc47e0e0106a49da66c9b2bfb850cfd4fd34793321bd
|
File details
Details for the file fitz_gov-2.0.0-py3-none-any.whl.
File metadata
- Download URL: fitz_gov-2.0.0-py3-none-any.whl
- Upload date:
- Size: 40.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbb2db6114948d88736814c7cfce3d418f0f8cc7dd1ff7dfaf51fdf3a357eedc
|
|
| MD5 |
e9afbd41f3125a6a0ae0e26ec16fec47
|
|
| BLAKE2b-256 |
06427f6bb9031b2a828896cb619f37bef2d6ceaf15c93e6b944782c897031bb7
|