Skip to main content

A benchmark and CLI measuring whether analytics agents are business-correct, not merely execution-correct.

Project description

LedgerBench

ci python license status

LedgerBench measures whether analytics agents are business-correct, not merely execution-correct.

An AI analyst can write SQL that runs cleanly and returns a confident number that is business-wrong: the wrong metric definition, silent double-counting from a fan-out join, answering an ambiguous question instead of clarifying, answering an unanswerable question instead of refusing, or explaining assumptions that do not match the SQL it actually ran. Existing benchmarks (Spider, BIRD) score execution accuracy, which is saturating and no longer discriminates. LedgerBench scores the gap between "the query ran fine" and "the answer was right" across five axes — and ships the chart that shows it.

Five scoring axes

  1. Definitional correctness — numeric reconciliation to gold within tolerance.
  2. Grain safety — static analysis of the agent's SQL against declared grains; catches fan-out double-counting.
  3. Ambiguity handling — the agent must clarify when the question is underspecified.
  4. Refusal correctness — the agent must refuse when the question is unanswerable, naming what is missing.
  5. Explanation faithfulness — stated assumptions must match the executed SQL.

Two modes, one engine

  • Demo / benchmark — a bundled deterministic fake company where every true answer is known by construction. The public benchmark.
  • BYO — point the engine at a real dbt project, auto-generate the adversarial suite from your declared semantics, compute gold read-only, and grade your agent.

The finding

Every agent tested executes flawlessly; none is reliably business-correct — and the business rulebook helps without coming close to closing the gap (committed manifests):

agent ran fine business-correct (closed book) business-correct (open book)
naive floor 100% 9.3% 9.3%
claude-haiku-4-5 ¹ 100% 38.0% 44.0%
gpt-4o-mini 100% 42.0% 59.3%

The open-book residual — two in five answers still wrong with the rulebook in hand, on queries that all ran cleanly — is the argument for verification beyond documentation. ¹ single seed (credit-constrained); see the report for the contract-binding analysis of haiku's open-book malformed cluster. Leaderboard: https://kartikeyamandhar.github.io/ledgerbench/ · Technical report: docs/report.md

Status

v1.0.0 — all eight phases complete: deterministic worlds, frozen contracts, the golden-tested five-axis scorer, the fail-closed grain checker (TPR 1.000 / FPR 0.000 on its published corpus), the SELECT-only sandboxed runner with kill-tests, the 150-item bank with recipe-derived gold, the five-minute demo, BYO/dbt mode (guide), and release packaging.

Quickstart

From a checkout (PyPI packaging lands in Phase 8):

git clone https://github.com/kartikeyamandhar/ledgerbench && cd ledgerbench
python3.11 -m venv agentic_flow && source agentic_flow/bin/activate
pip install -e .
ledgerbench demo          # ~35s: builds both worlds, runs the offline baseline, opens the report

No API keys, no network. The demo runs the deterministic naive baseline over all 150 items and renders the headline finding: on our machine, 100% of its queries ran fine and 9% of its answers were business-correct. That gap is the benchmark's point.

Other commands: ledgerbench run -c ledgerbench.yaml (config-driven, exit code 1 on axis-threshold breach — the CI gate), ledgerbench report (re-render/re-score from traces, no model calls), ledgerbench validate (lint the item bank, recompute gold), ledgerbench world build.

Develop

python3.11 -m venv agentic_flow
source agentic_flow/bin/activate
pip install -e ".[dev]"
pre-commit install
make check                # format check + lint + type + tests with coverage gate

License

Apache-2.0. Copyright © 2026 Kartikeya Mandhar. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ledgerbench-1.1.0.tar.gz (76.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ledgerbench-1.1.0-py3-none-any.whl (92.7 kB view details)

Uploaded Python 3

File details

Details for the file ledgerbench-1.1.0.tar.gz.

File metadata

  • Download URL: ledgerbench-1.1.0.tar.gz
  • Upload date:
  • Size: 76.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ledgerbench-1.1.0.tar.gz
Algorithm Hash digest
SHA256 a7244b9b18e50027965428b596fe6f274b910f12b028124a4cd5f2ae63180d5f
MD5 0c763a3c8deafc39f7722013b03d9d59
BLAKE2b-256 2e86e89b4f3c4ca928cc12d61881f51c0356eab2cac2001decfd6807239aa510

See more details on using hashes here.

Provenance

The following attestation bundles were made for ledgerbench-1.1.0.tar.gz:

Publisher: release.yml on kartikeyamandhar/ledgerbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ledgerbench-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: ledgerbench-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 92.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ledgerbench-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 819103ca774221746bb9ce67f89150c2850e4dca87cbd9755a140b7a88fb11e7
MD5 0e4d3e3d59259fd92ef19824c244d336
BLAKE2b-256 e3d0d554864d2f2a88647161c481f4378018aa149759c53a166c29894db151dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for ledgerbench-1.1.0-py3-none-any.whl:

Publisher: release.yml on kartikeyamandhar/ledgerbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page