Skip to main content

Benchmark for structured information retrieval from financial documents using graph-verifiable questions

Project description

knowlytix-benchmark

Benchmark for Structured Retrieval from Financial Documents. Auto-generates questions from document-graph topology and scores LLM predictions against a provably correct graph-traversal baseline. (Internally referred to as the FinStructBench module — knowlytix.benchmark.*.)

knowlytix-benchmark is one of four packages in the Geometric Memory Systems family. Use it to answer questions like "how close does my RAG pipeline get to the graph-verified ground truth on this financial report?" — with metrics that resist gaming because the ground truth comes from graph operations, not a held-out human-labeled test set.

  • Package: knowlytix-benchmark
  • License: Apache-2.0
  • Python: 3.12+
  • Status: alpha (v0.x)

Install

pip install knowlytix-benchmark

Depends on knowlytix-core (pinned ~=0.1.0). LLM-mode scoring routes through LiteLLM — set GMS_LLM_MODEL plus your provider's API key and any model works.

Quickstart — score a prediction set

import json
from importlib.resources import files

from knowlytix.benchmark import score_answer

# Smoke fixtures shipped with the wheel:
questions = json.loads((files("knowlytix.benchmark.fixtures.smoke") / "questions.json").read_text())
predictions = json.loads((files("knowlytix.benchmark.fixtures.smoke") / "predictions.json").read_text())

by_id = {p["id"]: p["answer"] for p in predictions["predictions"]}
for q in questions["questions"]:
    result = score_answer(by_id[q["id"]], q["ground_truth"])
    mark = "correct" if result.correct else "wrong"
    print(f"{q['id']}: {mark}  partial={result.partial_score:.2f}  ({result.detail})")

Quickstart — run the full benchmark

from knowlytix.benchmark import Benchmark, get_instance_path

bench = Benchmark(get_instance_path("model_validation"))
result = bench.run()          # graph-only mode (no LLM, no API key needed)
bench.print_results(result)

To evaluate an LLM against the same ground-truth graph:

from knowlytix.benchmark.llm_caller import create_client

client = create_client()       # reads GMS_LLM_MODEL_SCORER → GMS_LLM_MODEL
result = bench.run(llm_client=client)
bench.print_results(result)

CLI

benchmark --instance model_validation
benchmark --instance credit_portfolio --llm-model anthropic/claude-opus-4-6

Configuration

FINSTRUCTBENCH_* — scoring tolerances

Variable Default Meaning
FINSTRUCTBENCH_FLOAT_TOL 1e-6 Absolute tolerance for float comparisons.
FINSTRUCTBENCH_CLOSE_THRESHOLD 0.01 Relative tolerance for "close enough" financial values.
FINSTRUCTBENCH_TUPLE_ELEMENT_TOL 1e-3 Tolerance per element inside tuple answers.

GMS_LLM_* — LLM routing (only needed for LLM-mode scoring)

Variable Meaning
GMS_LLM_MODEL Base LiteLLM model string.
GMS_LLM_MODEL_SCORER Override for scoring calls (recommended).
GMS_LLM_TIMEOUT_SECONDS Per-call timeout. Default 60.

See .env.example in the source repo for the full provider key reference.

Included benchmark instances

Five synthetic financial-domain instances ship with the wheel:

Instance Topic
basel_capital Bank capital adequacy under Basel III
credit_portfolio Credit risk portfolio analysis
fair_lending Fair lending compliance testing
model_validation Model validation report (largest)
stress_test Stress testing scenarios

All synthetic — no real institution, person, or market event is depicted.

Public API

from knowlytix.benchmark import (
    Benchmark, BenchmarkResult,
    DocumentGraph, ENMEntry, ENMKey, PhaseEncoder,
    FinStructBenchSettings,
    GeneratedQuestion, ScoreResult, score_answer,
    default_generators, get_instance_path, ingest_markdown, list_instances,
)

GeneratedQuestion is a stable contract consumed by knowlytix.harness.testing.bridge — don't rename without coordinating (see CLAUDE.md §Coding standards).

Related packages

Package Role
knowlytix-core Geometric memory engine (required runtime dep)
knowlytix-knowledge Document-graph ingest + query front-end
knowlytix-harness DOE-driven testing + runtime governance (consumes GeneratedQuestion)

Links

  • Source: knowlytix/gms
  • Paper: FinStructBench: Benchmarking Structured Retrieval from Financial Documents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

knowlytix_benchmark-0.0.2-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file knowlytix_benchmark-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for knowlytix_benchmark-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d16c4fa111ccfd81fbecd0e90b83185ca2331116c77d2a8ccafefe0727eaf390
MD5 c59a0a8075645901579f756154668d83
BLAKE2b-256 3a250dad6fcd92a9f73b188dc4cbfacdf1cd83b7ae47cc085ef23586c19ad9b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for knowlytix_benchmark-0.0.2-py3-none-any.whl:

Publisher: publish-pypi.yml on knowlytix/GMS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page