A benchmark and CLI measuring whether analytics agents are business-correct, not merely execution-correct.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kart0511

These details have not been verified by PyPI

Project description

LedgerBench

LedgerBench measures whether analytics agents are business-correct, not merely execution-correct.

An AI analyst can write SQL that runs cleanly and returns a confident number that is business-wrong: the wrong metric definition, silent double-counting from a fan-out join, answering an ambiguous question instead of clarifying, answering an unanswerable question instead of refusing, or explaining assumptions that do not match the SQL it actually ran. Existing benchmarks (Spider, BIRD) score execution accuracy, which is saturating and no longer discriminates. LedgerBench scores the gap between "the query ran fine" and "the answer was right" across five axes — and ships the chart that shows it.

Five scoring axes

Definitional correctness — numeric reconciliation to gold within tolerance.
Grain safety — static analysis of the agent's SQL against declared grains; catches fan-out double-counting.
Ambiguity handling — the agent must clarify when the question is underspecified.
Refusal correctness — the agent must refuse when the question is unanswerable, naming what is missing.
Explanation faithfulness — stated assumptions must match the executed SQL.

Two modes, one engine

Demo / benchmark — a bundled deterministic fake company where every true answer is known by construction. The public benchmark.
BYO — point the engine at a real dbt project, auto-generate the adversarial suite from your declared semantics, compute gold read-only, and grade your agent.

The finding

Every agent tested executes flawlessly; none is reliably business-correct — and the business rulebook helps without coming close to closing the gap (committed manifests):

agent	ran fine	business-correct (closed book)	business-correct (open book)
naive floor	100%	9.3%	9.3%
claude-haiku-4-5 ¹	100%	38.0%	44.0%
gpt-4o-mini	100%	42.0%	59.3%

The open-book residual — two in five answers still wrong with the rulebook in hand, on queries that all ran cleanly — is the argument for verification beyond documentation. ¹ single seed (credit-constrained); see the report for the contract-binding analysis of haiku's open-book malformed cluster. Leaderboard: https://kartikeyamandhar.github.io/ledgerbench/ · Technical report: docs/report.md

Status

v1.0.0 — all eight phases complete: deterministic worlds, frozen contracts, the golden-tested five-axis scorer, the fail-closed grain checker (TPR 1.000 / FPR 0.000 on its published corpus), the SELECT-only sandboxed runner with kill-tests, the 150-item bank with recipe-derived gold, the five-minute demo, BYO/dbt mode (guide), and release packaging.

Quickstart

From a checkout (PyPI packaging lands in Phase 8):

git clone https://github.com/kartikeyamandhar/ledgerbench && cd ledgerbench
python3.11 -m venv agentic_flow && source agentic_flow/bin/activate
pip install -e .
ledgerbench demo          # ~35s: builds both worlds, runs the offline baseline, opens the report

No API keys, no network. The demo runs the deterministic naive baseline over all 150 items and renders the headline finding: on our machine, 100% of its queries ran fine and 9% of its answers were business-correct. That gap is the benchmark's point.

Other commands: ledgerbench run -c ledgerbench.yaml (config-driven, exit code 1 on axis-threshold breach — the CI gate), ledgerbench report (re-render/re-score from traces, no model calls), ledgerbench validate (lint the item bank, recompute gold), ledgerbench world build.

Develop

python3.11 -m venv agentic_flow
source agentic_flow/bin/activate
pip install -e ".[dev]"
pre-commit install
make check                # format check + lint + type + tests with coverage gate

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kart0511

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ledgerbench-1.1.0.tar.gz (76.8 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ledgerbench-1.1.0-py3-none-any.whl (92.7 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file ledgerbench-1.1.0.tar.gz.

File metadata

Download URL: ledgerbench-1.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 76.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ledgerbench-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a7244b9b18e50027965428b596fe6f274b910f12b028124a4cd5f2ae63180d5f`
MD5	`0c763a3c8deafc39f7722013b03d9d59`
BLAKE2b-256	`2e86e89b4f3c4ca928cc12d61881f51c0356eab2cac2001decfd6807239aa510`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ledgerbench-1.1.0.tar.gz:

Publisher: release.yml on kartikeyamandhar/ledgerbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ledgerbench-1.1.0.tar.gz
- Subject digest: a7244b9b18e50027965428b596fe6f274b910f12b028124a4cd5f2ae63180d5f
- Sigstore transparency entry: 1806405312
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: kartikeyamandhar/ledgerbench@71524ed4ddd1bcd4b657393206b581eaee81fe6a
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/kartikeyamandhar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@71524ed4ddd1bcd4b657393206b581eaee81fe6a
- Trigger Event: push

File details

Details for the file ledgerbench-1.1.0-py3-none-any.whl.

File metadata

Download URL: ledgerbench-1.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 92.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ledgerbench-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`819103ca774221746bb9ce67f89150c2850e4dca87cbd9755a140b7a88fb11e7`
MD5	`0e4d3e3d59259fd92ef19824c244d336`
BLAKE2b-256	`e3d0d554864d2f2a88647161c481f4378018aa149759c53a166c29894db151dc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ledgerbench-1.1.0-py3-none-any.whl:

Publisher: release.yml on kartikeyamandhar/ledgerbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ledgerbench-1.1.0-py3-none-any.whl
- Subject digest: 819103ca774221746bb9ce67f89150c2850e4dca87cbd9755a140b7a88fb11e7
- Sigstore transparency entry: 1806405380
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: kartikeyamandhar/ledgerbench@71524ed4ddd1bcd4b657393206b581eaee81fe6a
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/kartikeyamandhar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@71524ed4ddd1bcd4b657393206b581eaee81fe6a
- Trigger Event: push

ledgerbench 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LedgerBench

Five scoring axes

Two modes, one engine

The finding

Status

Quickstart

Develop

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance