Benchmark for structured information retrieval from financial documents using graph-verifiable questions
Project description
knowlytix-benchmark
Benchmark for Structured Retrieval from Financial Documents. Auto-generates questions from document-graph topology and scores LLM predictions against a provably correct graph-traversal baseline. (Internally referred to as the FinStructBench module —
knowlytix.benchmark.*.)
knowlytix-benchmark is one of four packages in the Geometric Memory Systems
family. Use it to answer questions like "how close does my RAG pipeline get
to the graph-verified ground truth on this financial report?" — with metrics
that resist gaming because the ground truth comes from graph operations, not
a held-out human-labeled test set.
- Package:
knowlytix-benchmark - License: Apache-2.0
- Python: 3.12+
- Status: alpha (v0.x)
Install
pip install knowlytix-benchmark
Depends on knowlytix-core (pinned ~=0.1.0). LLM-mode
scoring routes through LiteLLM — set GMS_LLM_MODEL plus your
provider's API key and any model works.
Quickstart — score a prediction set
import json
from importlib.resources import files
from knowlytix.benchmark import score_answer
# Smoke fixtures shipped with the wheel:
questions = json.loads((files("knowlytix.benchmark.fixtures.smoke") / "questions.json").read_text())
predictions = json.loads((files("knowlytix.benchmark.fixtures.smoke") / "predictions.json").read_text())
by_id = {p["id"]: p["answer"] for p in predictions["predictions"]}
for q in questions["questions"]:
result = score_answer(by_id[q["id"]], q["ground_truth"])
mark = "correct" if result.correct else "wrong"
print(f"{q['id']}: {mark} partial={result.partial_score:.2f} ({result.detail})")
Quickstart — run the full benchmark
from knowlytix.benchmark import Benchmark, get_instance_path
bench = Benchmark(get_instance_path("model_validation"))
result = bench.run() # graph-only mode (no LLM, no API key needed)
bench.print_results(result)
To evaluate an LLM against the same ground-truth graph:
from knowlytix.benchmark.llm_caller import create_client
client = create_client() # reads GMS_LLM_MODEL_SCORER → GMS_LLM_MODEL
result = bench.run(llm_client=client)
bench.print_results(result)
CLI
benchmark --instance model_validation
benchmark --instance credit_portfolio --llm-model anthropic/claude-opus-4-6
Configuration
FINSTRUCTBENCH_* — scoring tolerances
| Variable | Default | Meaning |
|---|---|---|
FINSTRUCTBENCH_FLOAT_TOL |
1e-6 |
Absolute tolerance for float comparisons. |
FINSTRUCTBENCH_CLOSE_THRESHOLD |
0.01 |
Relative tolerance for "close enough" financial values. |
FINSTRUCTBENCH_TUPLE_ELEMENT_TOL |
1e-3 |
Tolerance per element inside tuple answers. |
GMS_LLM_* — LLM routing (only needed for LLM-mode scoring)
| Variable | Meaning |
|---|---|
GMS_LLM_MODEL |
Base LiteLLM model string. |
GMS_LLM_MODEL_SCORER |
Override for scoring calls (recommended). |
GMS_LLM_TIMEOUT_SECONDS |
Per-call timeout. Default 60. |
See .env.example in the source repo for the full provider key reference.
Included benchmark instances
Five synthetic financial-domain instances ship with the wheel:
| Instance | Topic |
|---|---|
basel_capital |
Bank capital adequacy under Basel III |
credit_portfolio |
Credit risk portfolio analysis |
fair_lending |
Fair lending compliance testing |
model_validation |
Model validation report (largest) |
stress_test |
Stress testing scenarios |
All synthetic — no real institution, person, or market event is depicted.
Public API
from knowlytix.benchmark import (
Benchmark, BenchmarkResult,
DocumentGraph, ENMEntry, ENMKey, PhaseEncoder,
FinStructBenchSettings,
GeneratedQuestion, ScoreResult, score_answer,
default_generators, get_instance_path, ingest_markdown, list_instances,
)
GeneratedQuestion is a stable contract consumed by knowlytix.harness.testing.bridge —
don't rename without coordinating (see CLAUDE.md §Coding standards).
Related packages
| Package | Role |
|---|---|
knowlytix-core |
Geometric memory engine (required runtime dep) |
knowlytix-knowledge |
Document-graph ingest + query front-end |
knowlytix-harness |
DOE-driven testing + runtime governance (consumes GeneratedQuestion) |
Links
- Source: knowlytix/gms
- Paper: FinStructBench: Benchmarking Structured Retrieval from Financial Documents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file knowlytix_benchmark-0.0.2-py3-none-any.whl.
File metadata
- Download URL: knowlytix_benchmark-0.0.2-py3-none-any.whl
- Upload date:
- Size: 3.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d16c4fa111ccfd81fbecd0e90b83185ca2331116c77d2a8ccafefe0727eaf390
|
|
| MD5 |
c59a0a8075645901579f756154668d83
|
|
| BLAKE2b-256 |
3a250dad6fcd92a9f73b188dc4cbfacdf1cd83b7ae47cc085ef23586c19ad9b0
|
Provenance
The following attestation bundles were made for knowlytix_benchmark-0.0.2-py3-none-any.whl:
Publisher:
publish-pypi.yml on knowlytix/GMS
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
knowlytix_benchmark-0.0.2-py3-none-any.whl -
Subject digest:
d16c4fa111ccfd81fbecd0e90b83185ca2331116c77d2a8ccafefe0727eaf390 - Sigstore transparency entry: 1565585104
- Sigstore integration time:
-
Permalink:
knowlytix/GMS@d3dc0ca80da49e06700ca6b3737ea1729cf06c3a -
Branch / Tag:
refs/heads/pypi-stub-0.0.1-v2 - Owner: https://github.com/knowlytix
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@d3dc0ca80da49e06700ca6b3737ea1729cf06c3a -
Trigger Event:
workflow_dispatch
-
Statement type: