Skip to main content

fitz-gov: Comprehensive RAG Governance Benchmark

Project description

fitz-gov: Comprehensive RAG Governance Benchmark

fitz-gov is a benchmark for evaluating RAG system governance - the ability to know when to abstain, dispute, qualify, or confidently answer questions.

Why fitz-gov?

Most RAG benchmarks focus on retrieval quality (BEIR) or answer correctness (RAGAS). But real-world RAG systems need epistemic honesty - knowing what they don't know.

fitz-gov measures:

Category What it Tests Maps to
Abstention Refuses when context is insufficient ABSTAIN mode
Dispute Flags conflicting sources DISPUTED mode
Qualification Hedges uncertain claims QUALIFIED mode
Confidence Answers confidently when evidence is clear CONFIDENT mode
Grounding Answers are grounded in context (no hallucination) Answer quality
Relevance Answers address the actual question Answer quality

Installation

pip install fitz-gov

Or install from local path during development:

pip install -e path/to/fitz-gov

Quick Start

Tiered Evaluation (Recommended)

fitz-gov uses a two-tier evaluation system:

  • Tier 0 (Sanity): 60 easy cases with 95% pass threshold - gates Tier 1
  • Tier 1 (Core): 271 discriminative cases with gradient scoring
from fitz_gov import FitzGovEvaluator, load_tier, Tier, AnswerMode

# Load tiered cases
tier0_cases = load_tier(Tier.SANITY)  # 60 cases
tier1_cases = load_tier(Tier.CORE)    # 271 cases

# Your RAG system generates responses and modes for each tier
tier0_responses, tier0_modes = your_rag_system.evaluate(tier0_cases)
tier1_responses, tier1_modes = your_rag_system.evaluate(tier1_cases)

# Run tiered evaluation
evaluator = FitzGovEvaluator()
result = evaluator.evaluate_tiered(
    tier0_cases, tier0_responses, tier0_modes,
    tier1_cases, tier1_responses, tier1_modes,
)

print(result)
# fitz-gov Tiered Evaluation
# ==========================
#
# TIER 0 (Sanity Check): PASSED
#   Threshold: 95% | Achieved: 98.3% (59/60)
#
# TIER 1 (Core Benchmark): 78.1%
#   By Category:
#     abstention: 26/30 (86.7%)
#     dispute: 22/30 (73.3%)
#     ...
#
# Summary: Tier 0 PASSED, Tier 1 Score: 78.1%

With Fitz RAG Engine

from fitz_ai.evaluation.benchmarks import FitzGovBenchmark

# Create benchmark and evaluate your engine
benchmark = FitzGovBenchmark()
results = benchmark.evaluate(engine)

print(results)

Standalone Usage (Any RAG System)

The fitz-gov package contains all evaluation logic, so any RAG system can be evaluated:

from fitz_gov import FitzGovEvaluator, load_cases, FitzGovCategory, AnswerMode

# Load test cases
cases = load_cases()

# Create evaluator
evaluator = FitzGovEvaluator()

# Evaluate your RAG system's responses
responses = []
modes = []

for case in cases:
    # Your RAG system generates response
    response = your_rag_system.query(case.query, case.contexts)
    mode = your_rag_system.classify_mode(response)  # Your mode classification

    responses.append(response)
    modes.append(mode)

# Get comprehensive results
results = evaluator.evaluate_all(cases, responses, modes)
print(f"Overall accuracy: {results.overall_accuracy:.1%}")

Evaluating Individual Cases

from fitz_gov import FitzGovEvaluator, load_case_by_id

evaluator = FitzGovEvaluator()

# Load specific test case
case = load_case_by_id("abstain_001")

# Your system's response
response = "Based on the context provided, I cannot find information about..."
mode = AnswerMode.ABSTAIN

# Evaluate
result = evaluator.evaluate_case(case, response, mode)
print(f"Passed: {result.passed}")
print(f"Expected: {case.expected_mode.value}, Got: {mode.value}")

Two-Pass Validation (Answer Quality Categories)

For grounding and relevance categories, fitz-gov uses two-pass validation to reduce false positives:

  1. Regex pass: Fast pattern matching catches obvious violations
  2. LLM pass: Semantic validation for flagged cases

Enable LLM Validation

from fitz_gov import FitzGovEvaluator

# Enable LLM validation with local Ollama
evaluator = FitzGovEvaluator(
    llm_validation=True,
    llm_model="qwen2.5:14b",  # or any Ollama model
    llm_base_url="http://localhost:11434"
)

# Responses flagged by regex are sent to LLM for semantic check
results = evaluator.evaluate_all(cases, responses, modes)

Validation Flow

Response contains forbidden_claim pattern?
    │
    ├─ No  → PASS (no hallucination detected)
    │
    └─ Yes → LLM validates: "Is this an actual hallucination?"
                │
                ├─ LLM says no (e.g., "no revenue mentioned") → PASS
                │
                └─ LLM says yes (fabricated specific value) → FAIL

Caching

LLM validation results are cached for 7 days to speed up repeated evaluations:

  • Cache location: ~/.cache/fitz_gov/
  • Automatic cache cleanup on startup

API Reference

Core Classes

from fitz_gov import (
    # Evaluator
    FitzGovEvaluator,

    # Data loading
    load_cases,
    load_tier,
    load_case_by_id,
    get_category_info,
    get_tier_info,
    get_data_dir,
    get_tier_dir,
    Tier,

    # Models
    FitzGovCategory,
    AnswerMode,
    FitzGovCase,
    FitzGovCaseResult,
    FitzGovCategoryResult,
    FitzGovConfusionMatrix,
    FitzGovResult,

    # Tiered Results
    TieredResult,
    Tier0Result,
    Tier1Result,

    # LLM Validation
    OllamaValidator,
    ValidatorConfig,
    ValidationResult,
)

FitzGovEvaluator

evaluator = FitzGovEvaluator(
    llm_validation=False,      # Enable two-pass validation
    llm_model="qwen2.5:14b",   # Ollama model for validation
    llm_base_url="http://localhost:11434"
)

# Tiered evaluation (recommended)
result = evaluator.evaluate_tiered(
    tier0_cases, tier0_responses, tier0_modes,
    tier1_cases, tier1_responses, tier1_modes,
    tier0_threshold=0.95,      # Default: 95%
    gating_enabled=True,       # Skip Tier 1 if Tier 0 fails
)

# Flat evaluation (all cases together)
results = evaluator.evaluate_all(cases, responses, modes)

# Evaluate single case
result = evaluator.evaluate_case(case, response, mode)

Loading Test Cases

# Load by tier (recommended)
tier0_cases = load_tier(Tier.SANITY)  # 60 sanity cases
tier1_cases = load_tier(Tier.CORE)    # 271 core cases

# Load all cases (331 total)
all_cases = load_cases()

# Load specific categories from a tier
abstention_tier0 = load_tier(Tier.SANITY, [FitzGovCategory.ABSTENTION])

# Load specific categories across all tiers
governance_cases = load_cases([
    FitzGovCategory.ABSTENTION,
    FitzGovCategory.DISPUTE,
])

# Load single case by ID (IDs prefixed with t0_ or t1_)
case = load_case_by_id("t1_dispute_medium_005")

Data Format

Test cases are organized in a tiered structure:

data/
├── tier0_sanity/          # 60 cases - baseline verification (95% threshold)
│   ├── abstention.json    # 12 cases
│   ├── dispute.json       # 12 cases
│   ├── qualification.json # 10 cases
│   ├── confidence.json    # 10 cases
│   ├── grounding.json     # 8 cases
│   └── relevance.json     # 8 cases
├── tier1_core/            # 271 cases - discriminative benchmark
│   ├── abstention.json    # 51 cases
│   ├── dispute.json       # 43 cases
│   ├── qualification.json # 58 cases
│   ├── confidence.json    # 53 cases
│   ├── grounding.json     # 34 cases
│   └── relevance.json     # 32 cases
└── corpus/
    └── documents.jsonl    # 378 reference documents

Each case has:

{
  "id": "abstain_001",
  "query": "What is the company's revenue for 2024?",
  "contexts": ["The company was founded in 2010..."],
  "expected_mode": "abstain",
  "subcategory": "different_domain",
  "difficulty": "medium",
  "mode_rationale": "Context contains no financial data",
  "evaluation_config": {
    "forbidden_claims": ["\\$\\d"],
    "allowed_phrases": ["not specified", "cannot find"]
  }
}

Version

Current version: 2.0.0

See CHANGELOG.md for release history and docs/roadmap for implementation details.

Architecture Note

fitz-gov is designed as a standalone package so that:

  1. Any RAG system can benchmark against the same test cases
  2. Evaluation logic is consistent - all systems get identical evaluation
  3. Test data is versioned - reproducible benchmarks across releases

For Fitz RAG engine integration, see fitz_ai.evaluation.benchmarks.FitzGovBenchmark which wraps this package.

Contributing

We welcome contributions! To add new test cases:

  1. Fork this repo
  2. Add cases to the appropriate data/<category>/ directory
  3. Run validation: python scripts/validate.py
  4. Submit a PR

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitz_gov-2.0.0.tar.gz (229.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fitz_gov-2.0.0-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file fitz_gov-2.0.0.tar.gz.

File metadata

  • Download URL: fitz_gov-2.0.0.tar.gz
  • Upload date:
  • Size: 229.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for fitz_gov-2.0.0.tar.gz
Algorithm Hash digest
SHA256 ec9e0035a4c82e6e1d508591ba987ef04e8b6d89cbaa6dbe5bd51c22186b796a
MD5 7305f26fa7ab81c68b50c61e7b53c9ae
BLAKE2b-256 543bb621a7a1c162ef62bc47e0e0106a49da66c9b2bfb850cfd4fd34793321bd

See more details on using hashes here.

File details

Details for the file fitz_gov-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: fitz_gov-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for fitz_gov-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fbb2db6114948d88736814c7cfce3d418f0f8cc7dd1ff7dfaf51fdf3a357eedc
MD5 e9afbd41f3125a6a0ae0e26ec16fec47
BLAKE2b-256 06427f6bb9031b2a828896cb619f37bef2d6ceaf15c93e6b944782c897031bb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page