fitz-gov: Comprehensive RAG Governance Benchmark

These details have not been verified by PyPI

Project links

Project description

fitz-gov: Comprehensive RAG Governance Benchmark

fitz-gov is a benchmark for evaluating RAG system governance - the ability to know when to abstain, dispute, qualify, or confidently answer questions.

Why fitz-gov?

Most RAG benchmarks focus on retrieval quality (BEIR) or answer correctness (RAGAS). But real-world RAG systems need epistemic honesty - knowing what they don't know.

fitz-gov measures:

Category	What it Tests	Maps to
Abstention	Refuses when context is insufficient	`ABSTAIN` mode
Dispute	Flags conflicting sources	`DISPUTED` mode
Qualification	Hedges uncertain claims	`QUALIFIED` mode
Confidence	Answers confidently when evidence is clear	`CONFIDENT` mode
Grounding	Answers are grounded in context (no hallucination)	Answer quality
Relevance	Answers address the actual question	Answer quality

Installation

pip install fitz-gov

Or install from local path during development:

pip install -e path/to/fitz-gov

Quick Start

Tiered Evaluation (Recommended)

fitz-gov uses a two-tier evaluation system:

Tier 0 (Sanity): 60 easy cases with 95% pass threshold - gates Tier 1
Tier 1 (Core): 271 discriminative cases with gradient scoring

from fitz_gov import FitzGovEvaluator, load_tier, Tier, AnswerMode

# Load tiered cases
tier0_cases = load_tier(Tier.SANITY)  # 60 cases
tier1_cases = load_tier(Tier.CORE)    # 271 cases

# Your RAG system generates responses and modes for each tier
tier0_responses, tier0_modes = your_rag_system.evaluate(tier0_cases)
tier1_responses, tier1_modes = your_rag_system.evaluate(tier1_cases)

# Run tiered evaluation
evaluator = FitzGovEvaluator()
result = evaluator.evaluate_tiered(
    tier0_cases, tier0_responses, tier0_modes,
    tier1_cases, tier1_responses, tier1_modes,
)

print(result)
# fitz-gov Tiered Evaluation
# ==========================
#
# TIER 0 (Sanity Check): PASSED
#   Threshold: 95% | Achieved: 98.3% (59/60)
#
# TIER 1 (Core Benchmark): 78.1%
#   By Category:
#     abstention: 26/30 (86.7%)
#     dispute: 22/30 (73.3%)
#     ...
#
# Summary: Tier 0 PASSED, Tier 1 Score: 78.1%

With Fitz RAG Engine

from fitz_ai.evaluation.benchmarks import FitzGovBenchmark

# Create benchmark and evaluate your engine
benchmark = FitzGovBenchmark()
results = benchmark.evaluate(engine)

print(results)

Standalone Usage (Any RAG System)

The fitz-gov package contains all evaluation logic, so any RAG system can be evaluated:

from fitz_gov import FitzGovEvaluator, load_cases, FitzGovCategory, AnswerMode

# Load test cases
cases = load_cases()

# Create evaluator
evaluator = FitzGovEvaluator()

# Evaluate your RAG system's responses
responses = []
modes = []

for case in cases:
    # Your RAG system generates response
    response = your_rag_system.query(case.query, case.contexts)
    mode = your_rag_system.classify_mode(response)  # Your mode classification

    responses.append(response)
    modes.append(mode)

# Get comprehensive results
results = evaluator.evaluate_all(cases, responses, modes)
print(f"Overall accuracy: {results.overall_accuracy:.1%}")

Evaluating Individual Cases

from fitz_gov import FitzGovEvaluator, load_case_by_id

evaluator = FitzGovEvaluator()

# Load specific test case
case = load_case_by_id("abstain_001")

# Your system's response
response = "Based on the context provided, I cannot find information about..."
mode = AnswerMode.ABSTAIN

# Evaluate
result = evaluator.evaluate_case(case, response, mode)
print(f"Passed: {result.passed}")
print(f"Expected: {case.expected_mode.value}, Got: {mode.value}")

Two-Pass Validation (Answer Quality Categories)

For grounding and relevance categories, fitz-gov uses two-pass validation to reduce false positives:

Regex pass: Fast pattern matching catches obvious violations
LLM pass: Semantic validation for flagged cases

Enable LLM Validation

from fitz_gov import FitzGovEvaluator

# Enable LLM validation with local Ollama
evaluator = FitzGovEvaluator(
    llm_validation=True,
    llm_model="qwen2.5:14b",  # or any Ollama model
    llm_base_url="http://localhost:11434"
)

# Responses flagged by regex are sent to LLM for semantic check
results = evaluator.evaluate_all(cases, responses, modes)

Validation Flow

Response contains forbidden_claim pattern?
    │
    ├─ No  → PASS (no hallucination detected)
    │
    └─ Yes → LLM validates: "Is this an actual hallucination?"
                │
                ├─ LLM says no (e.g., "no revenue mentioned") → PASS
                │
                └─ LLM says yes (fabricated specific value) → FAIL

Caching

LLM validation results are cached for 7 days to speed up repeated evaluations:

Cache location: ~/.cache/fitz_gov/
Automatic cache cleanup on startup

API Reference

Core Classes

from fitz_gov import (
    # Evaluator
    FitzGovEvaluator,

    # Data loading
    load_cases,
    load_tier,
    load_case_by_id,
    get_category_info,
    get_tier_info,
    get_data_dir,
    get_tier_dir,
    Tier,

    # Models
    FitzGovCategory,
    AnswerMode,
    FitzGovCase,
    FitzGovCaseResult,
    FitzGovCategoryResult,
    FitzGovConfusionMatrix,
    FitzGovResult,

    # Tiered Results
    TieredResult,
    Tier0Result,
    Tier1Result,

    # LLM Validation
    OllamaValidator,
    ValidatorConfig,
    ValidationResult,
)

FitzGovEvaluator

evaluator = FitzGovEvaluator(
    llm_validation=False,      # Enable two-pass validation
    llm_model="qwen2.5:14b",   # Ollama model for validation
    llm_base_url="http://localhost:11434"
)

# Tiered evaluation (recommended)
result = evaluator.evaluate_tiered(
    tier0_cases, tier0_responses, tier0_modes,
    tier1_cases, tier1_responses, tier1_modes,
    tier0_threshold=0.95,      # Default: 95%
    gating_enabled=True,       # Skip Tier 1 if Tier 0 fails
)

# Flat evaluation (all cases together)
results = evaluator.evaluate_all(cases, responses, modes)

# Evaluate single case
result = evaluator.evaluate_case(case, response, mode)

Loading Test Cases

# Load by tier (recommended)
tier0_cases = load_tier(Tier.SANITY)  # 60 sanity cases
tier1_cases = load_tier(Tier.CORE)    # 271 core cases

# Load all cases (331 total)
all_cases = load_cases()

# Load specific categories from a tier
abstention_tier0 = load_tier(Tier.SANITY, [FitzGovCategory.ABSTENTION])

# Load specific categories across all tiers
governance_cases = load_cases([
    FitzGovCategory.ABSTENTION,
    FitzGovCategory.DISPUTE,
])

# Load single case by ID (IDs prefixed with t0_ or t1_)
case = load_case_by_id("t1_dispute_medium_005")

Data Format

Test cases are organized in a tiered structure:

data/
├── tier0_sanity/          # 60 cases - baseline verification (95% threshold)
│   ├── abstention.json    # 12 cases
│   ├── dispute.json       # 12 cases
│   ├── qualification.json # 10 cases
│   ├── confidence.json    # 10 cases
│   ├── grounding.json     # 8 cases
│   └── relevance.json     # 8 cases
├── tier1_core/            # 271 cases - discriminative benchmark
│   ├── abstention.json    # 51 cases
│   ├── dispute.json       # 43 cases
│   ├── qualification.json # 58 cases
│   ├── confidence.json    # 53 cases
│   ├── grounding.json     # 34 cases
│   └── relevance.json     # 32 cases
└── corpus/
    └── documents.jsonl    # 378 reference documents

Each case has:

{
  "id": "abstain_001",
  "query": "What is the company's revenue for 2024?",
  "contexts": ["The company was founded in 2010..."],
  "expected_mode": "abstain",
  "subcategory": "different_domain",
  "difficulty": "medium",
  "mode_rationale": "Context contains no financial data",
  "evaluation_config": {
    "forbidden_claims": ["\\$\\d"],
    "allowed_phrases": ["not specified", "cannot find"]
  }
}

Version

Current version: 2.0.0

See CHANGELOG.md for release history and docs/roadmap for implementation details.

Architecture Note

fitz-gov is designed as a standalone package so that:

Any RAG system can benchmark against the same test cases
Evaluation logic is consistent - all systems get identical evaluation
Test data is versioned - reproducible benchmarks across releases

For Fitz RAG engine integration, see fitz_ai.evaluation.benchmarks.FitzGovBenchmark which wraps this package.

Contributing

We welcome contributions! To add new test cases:

Fork this repo
Add cases to the appropriate data/<category>/ directory
Run validation: python scripts/validate.py
Submit a PR

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.1.0

Feb 15, 2026

4.0.0

Feb 14, 2026

This version

2.0.0

Feb 5, 2026

1.1.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitz_gov-2.0.0.tar.gz (229.6 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fitz_gov-2.0.0-py3-none-any.whl (40.4 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file fitz_gov-2.0.0.tar.gz.

File metadata

Download URL: fitz_gov-2.0.0.tar.gz
Upload date: Feb 5, 2026
Size: 229.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for fitz_gov-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ec9e0035a4c82e6e1d508591ba987ef04e8b6d89cbaa6dbe5bd51c22186b796a`
MD5	`7305f26fa7ab81c68b50c61e7b53c9ae`
BLAKE2b-256	`543bb621a7a1c162ef62bc47e0e0106a49da66c9b2bfb850cfd4fd34793321bd`

See more details on using hashes here.

File details

Details for the file fitz_gov-2.0.0-py3-none-any.whl.

File metadata

Download URL: fitz_gov-2.0.0-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 40.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for fitz_gov-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fbb2db6114948d88736814c7cfce3d418f0f8cc7dd1ff7dfaf51fdf3a357eedc`
MD5	`e9afbd41f3125a6a0ae0e26ec16fec47`
BLAKE2b-256	`06427f6bb9031b2a828896cb619f37bef2d6ceaf15c93e6b944782c897031bb7`

See more details on using hashes here.

fitz-gov 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fitz-gov: Comprehensive RAG Governance Benchmark

Why fitz-gov?

Installation

Quick Start

Tiered Evaluation (Recommended)

With Fitz RAG Engine

Standalone Usage (Any RAG System)

Evaluating Individual Cases

Two-Pass Validation (Answer Quality Categories)

Enable LLM Validation

Validation Flow

Caching

API Reference

Core Classes

FitzGovEvaluator

Loading Test Cases

Data Format

Version

Architecture Note

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes