A benchmark and CLI measuring whether analytics agents are business-correct, not merely execution-correct.
Project description
LedgerBench
LedgerBench measures whether analytics agents are business-correct, not merely execution-correct.
An AI analyst can write SQL that runs cleanly and returns a confident number that is business-wrong: the wrong metric definition, silent double-counting from a fan-out join, answering an ambiguous question instead of clarifying, answering an unanswerable question instead of refusing, or explaining assumptions that do not match the SQL it actually ran. Existing benchmarks (Spider, BIRD) score execution accuracy, which is saturating and no longer discriminates. LedgerBench scores the gap between "the query ran fine" and "the answer was right" across five axes — and ships the chart that shows it.
Five scoring axes
- Definitional correctness — numeric reconciliation to gold within tolerance.
- Grain safety — static analysis of the agent's SQL against declared grains; catches fan-out double-counting.
- Ambiguity handling — the agent must clarify when the question is underspecified.
- Refusal correctness — the agent must refuse when the question is unanswerable, naming what is missing.
- Explanation faithfulness — stated assumptions must match the executed SQL.
Two modes, one engine
- Demo / benchmark — a bundled deterministic fake company where every true answer is known by construction. The public benchmark.
- BYO — point the engine at a real dbt project, auto-generate the adversarial suite from your declared semantics, compute gold read-only, and grade your agent.
The finding
Every agent tested executes flawlessly; none is reliably business-correct — and the business rulebook helps without coming close to closing the gap (committed manifests):
| agent | ran fine | business-correct (closed book) | business-correct (open book) |
|---|---|---|---|
| naive floor | 100% | 9.3% | 9.3% |
| claude-haiku-4-5 ¹ | 100% | 38.0% | 44.0% |
| gpt-4o-mini | 100% | 42.0% | 59.3% |
The open-book residual — two in five answers still wrong with the rulebook in hand, on queries that all ran cleanly — is the argument for verification beyond documentation. ¹ single seed (credit-constrained); see the report for the contract-binding analysis of haiku's open-book malformed cluster. Leaderboard: https://kartikeyamandhar.github.io/ledgerbench/ · Technical report: docs/report.md
Status
v1.0.0 — all eight phases complete: deterministic worlds, frozen contracts, the golden-tested five-axis scorer, the fail-closed grain checker (TPR 1.000 / FPR 0.000 on its published corpus), the SELECT-only sandboxed runner with kill-tests, the 150-item bank with recipe-derived gold, the five-minute demo, BYO/dbt mode (guide), and release packaging.
Quickstart
From a checkout (PyPI packaging lands in Phase 8):
git clone https://github.com/kartikeyamandhar/ledgerbench && cd ledgerbench
python3.11 -m venv agentic_flow && source agentic_flow/bin/activate
pip install -e .
ledgerbench demo # ~35s: builds both worlds, runs the offline baseline, opens the report
No API keys, no network. The demo runs the deterministic naive baseline over all 150 items and renders the headline finding: on our machine, 100% of its queries ran fine and 9% of its answers were business-correct. That gap is the benchmark's point.
Other commands: ledgerbench run -c ledgerbench.yaml (config-driven, exit code 1 on
axis-threshold breach — the CI gate), ledgerbench report (re-render/re-score from
traces, no model calls), ledgerbench validate (lint the item bank, recompute gold),
ledgerbench world build.
Develop
python3.11 -m venv agentic_flow
source agentic_flow/bin/activate
pip install -e ".[dev]"
pre-commit install
make check # format check + lint + type + tests with coverage gate
License
Apache-2.0. Copyright © 2026 Kartikeya Mandhar. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ledgerbench-1.1.0.tar.gz.
File metadata
- Download URL: ledgerbench-1.1.0.tar.gz
- Upload date:
- Size: 76.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7244b9b18e50027965428b596fe6f274b910f12b028124a4cd5f2ae63180d5f
|
|
| MD5 |
0c763a3c8deafc39f7722013b03d9d59
|
|
| BLAKE2b-256 |
2e86e89b4f3c4ca928cc12d61881f51c0356eab2cac2001decfd6807239aa510
|
Provenance
The following attestation bundles were made for ledgerbench-1.1.0.tar.gz:
Publisher:
release.yml on kartikeyamandhar/ledgerbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ledgerbench-1.1.0.tar.gz -
Subject digest:
a7244b9b18e50027965428b596fe6f274b910f12b028124a4cd5f2ae63180d5f - Sigstore transparency entry: 1806405312
- Sigstore integration time:
-
Permalink:
kartikeyamandhar/ledgerbench@71524ed4ddd1bcd4b657393206b581eaee81fe6a -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/kartikeyamandhar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@71524ed4ddd1bcd4b657393206b581eaee81fe6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file ledgerbench-1.1.0-py3-none-any.whl.
File metadata
- Download URL: ledgerbench-1.1.0-py3-none-any.whl
- Upload date:
- Size: 92.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
819103ca774221746bb9ce67f89150c2850e4dca87cbd9755a140b7a88fb11e7
|
|
| MD5 |
0e4d3e3d59259fd92ef19824c244d336
|
|
| BLAKE2b-256 |
e3d0d554864d2f2a88647161c481f4378018aa149759c53a166c29894db151dc
|
Provenance
The following attestation bundles were made for ledgerbench-1.1.0-py3-none-any.whl:
Publisher:
release.yml on kartikeyamandhar/ledgerbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ledgerbench-1.1.0-py3-none-any.whl -
Subject digest:
819103ca774221746bb9ce67f89150c2850e4dca87cbd9755a140b7a88fb11e7 - Sigstore transparency entry: 1806405380
- Sigstore integration time:
-
Permalink:
kartikeyamandhar/ledgerbench@71524ed4ddd1bcd4b657393206b581eaee81fe6a -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/kartikeyamandhar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@71524ed4ddd1bcd4b657393206b581eaee81fe6a -
Trigger Event:
push
-
Statement type: