Evidence-preserving LLM agent benchmark harness with a live mobile-first SSH TUI.

These details have not been verified by PyPI

Project description

BenchDeck

Evidence-preserving LLM-agent benchmark harness with a live terminal dashboard built for narrow SSH sessions — including Termius on iPhone.

BenchDeck turns one or two Markdown agent files into a benchmark plan, runs isolated cases with a clarification turn, judges responses with a 0–4 scale, and writes atomically checkpointed artifacts you can watch in real time.

Screenshots

Overview — progress bar, rating distribution, per-family scores, policy blocks, token usage

Case list — per-agent ratings, blocked cases, pending items, status marks

Case detail — purpose, judgment, gate check, agent output

Help — phone-keyboard-friendly controls

Captured from a live benchmark run (gpt-4o-mini, 8 cases, repository-integrity-agent). Regenerate with scripts/generate_demo_screens.py --run-dir benchmark_out/<run_id>.

Benchmark Results

A live benchmark of the included repository-integrity-agent against gpt-4o-mini:

Metric	Value
Cases planned	8
Cases judged	8
Excellent (4)	2
Strong (3)	1
Weak (1)	1
Fail (0)	4
Gate failures	4
Total tokens	37,463
API requests	32
Wall-clock time	~2 min 20 s
Status	`completed_with_failures`

Run: benchdeck run --agent-a examples/repository-integrity-agent.md --model gpt-4o-mini --judge-model gpt-4o-mini -o benchmark_out

Why BenchDeck

Benchmarks are prone to silent ambiguity. BenchDeck makes state explicit:

Ambiguous situation	BenchDeck handling
Empty model response	Retried up to 3x; recorded with response ID, status, and raw payload
Policy-blocked response	Logged as a policy block — not an agent failure
Infrastructure failure	Recorded separately from agent failures
Inconsistent scoring scale	Fixed 0–4 scale (Fail, Weak, Acceptable, Strong, Excellent)
Judge transcript duplicates candidate output	Stored in separate fields; never commingled
Half-written checkpoint crash	Atomic file replacement — the TUI never reads a partial write
Run status vs. real coverage	`inconclusive`, `completed_with_failures`, `infrastructure_failed`, or `aborted` when all cases aren't judged

Quick Start

Prerequisites: Python 3.11+, an OpenAI API key

python -m venv .venv && source .venv/bin/activate
pip install -e .                    # user install (pip install -e '.[dev]' for development)
export OPENAI_API_KEY='sk-...'      # required — the run command checks this

Run a benchmark:

benchdeck run \
  --agent-a examples/repository-integrity-agent.md \
  --model gpt-4o-mini \
  --judge-model gpt-4o-mini \
  --output-dir benchmark_out

Watch it live (second SSH session):

benchdeck tui benchmark_out

Inspect the results:

benchdeck inspect benchmark_out

TUI Controls

The TUI targets 32-column terminals. Arrow keys and letter keys both work — no mouse or modifier chords needed:

Key	Action
`1` `2` `3` `4`	Open overview, cases, detail, or help screen
`h` / `l` or `←` / `→`	Previous / next screen
`j` / `k` or `↓` / `↑`	Move selection or scroll
`Enter`	Open selected case
`e`	Export case as Markdown
`n`	Launch a new benchmark run (subprocess)
`x`	Cancel running benchmark (press twice to confirm)
`r`	Reload artifacts
`q` / `Esc`	Quit

Recommended Termius settings: UTF-8, monospace font, extra keyboard row with Escape and arrow keys.

CLI Reference

Global flags

benchdeck [--config <file.toml>] [--log-level DEBUG|INFO|WARNING|ERROR|CRITICAL] [--log-file <path>] {run,tui,inspect}

Flag	Description
`--config`	Path to a TOML configuration file (searched in `~/.config/benchdeck/config.toml`, `./benchdeck.toml`, then explicit path)
`--log-level`	Logging level (default: `WARNING`)
`--log-file`	Write JSON-structured logs to a file

`benchdeck run`

benchdeck run \
  --agent-a <agent.md>              # required: first agent Markdown file
  --agent-b <agent.md>              # optional: second agent for comparison mode
  --model gpt-4o-mini               # model for agent (default: gpt-4o-mini)
  --planner-model gpt-4o-mini       # model for plan generation (defaults to --model)
  --judge-model gpt-4o-mini         # model for judge (default: gpt-4o-mini)
  --plan benchmark_plan.json        # optional: use a frozen plan instead of generating one
  --output-dir benchmark_out        # output directory for artifacts (short: -o)
  --timeout 90                      # API timeout in seconds (default: 90)
  --max-retries 3                   # max retry attempts per call (default: 3)
  --judges 1                        # number of independent judge calls per case (default: 1)
  --capture-level full              # response capture detail: minimal, standard, or full
  --resume <run_dir>                # resume an interrupted run from the given directory
  --overwrite                       # overwrite if a prior run exists at the exact output path
  --max-output-tokens-planner N     # budget: max output tokens for the planner
  --max-output-tokens-agent N       # budget: max output tokens for the agent
  --max-output-tokens-judge N       # budget: max output tokens for the judge
  --max-logical-requests N          # budget: max logical (API) requests
  --max-http-attempts N             # budget: max HTTP attempts (incl. retries)
  --max-total-input-tokens N        # budget: max total input tokens
  --max-total-output-tokens N       # budget: max total output tokens

`benchdeck tui`

benchdeck tui benchmark_out                     # watch a live run
benchdeck tui fixtures/original_run.zip          # open the bundled run

`benchdeck inspect`

benchdeck inspect fixtures/original_run.zip

Detects incomplete coverage, empty outputs, duplicated judge transcripts, undeclared scoring scales, misleading run status, and validates per-agent tallies against schemas/summary_tally.schema.json.

Using a frozen plan

python - <<'PY'
import json
from pathlib import Path
from benchdeck.loader import load_snapshot
plan = load_snapshot(Path('fixtures/original_run.zip')).plan
Path('/tmp/benchmark_plan.json').write_text(json.dumps(plan, indent=2) + '\n')
PY
benchdeck run --agent-a examples/repository-integrity-agent.md --plan /tmp/benchmark_plan.json -o benchmark_out

Architecture

Agent.md ──► Plan ──► Execute ──► Judge ──► Artifacts ──► Loader ──► TUI
              (planner     (agent         (judge        (atomic     (ZIP/dir
               gateway)     gateway)        gateway)      writes)      reader)
                                     │
                               Gate check (0-4)
                               Typed rubric (8 dims)
                               Policy block log
                               Infra failure log

Eight modules:

Planning (prompts.py, openai_gateway.py) — generate or load a versioned benchmark plan from agent Markdown
Execution (runner.py) — run each case with one clarification turn; retry empty responses; classify failures; budget enforcement; resume interrupted runs
Judging (runner.py, models/) — evaluate output independently; 8-dimension typed rubric; multi-judge with disagreement detection
Artifacts (storage.py) — atomically checkpoint JSON; concurrent-reader-safe writes
Loader / UI (loader.py, tui.py) — safe ZIP/directory artifact loading; 32-column curses TUI with optional color, per-agent views, run-launch and cancel controls
Configuration (config.py) — TOML config with 3-layer merge (~/.config/benchdeck/, ./benchdeck.toml, --config)
Budget (budget.py) — 7-dimension budget tracker; preflight warning; mid-run enforcement
Logging (logging_config.py) — JSON-structured log output with configurable level and file destination

See docs/architecture.md, docs/benchmark-contract.md, and docs/mobile-tui.md for details.

Limitations

No PyPI release or signed artifacts. CI workflows for publish (publish.yml, supports both PYPI_API_TOKEN and OIDC Trusted Publishing — see docs/publish.md) and release with SBOM (release.yml) exist; no tag has produced a successful publish yet.
Inspector hardening partial. inspect.py validates schema and manifest checksums (via manifest.verify()); referential integrity and counter consistency checks remain pending.
No cross-process run lock. storage.py uses atomic writes (os.replace), but concurrent writers to the same output directory could race.
No Windows testing. Developed and tested on Linux only.
No dependency lock file. requirements.txt provides reproducible pins; no requirements.lock or uv.lock exists.
dist/ artifacts stale. (Built 2026-06-11; source has changed since.) Not committed — dist/ is gitignored.

See REMAINING_ISSUES.md for the full list of known limitations.

Known Issues

The CHANGELOG lists issues resolved since the v0.1.0 release. For current limitations, see REMAINING_ISSUES.md.

Development

ruff check .                              # lint
ruff format --check .                     # formatting
mypy src/benchdeck/                       # type checking (strict; requires types-jsonschema in dev deps)
pytest --cov=src/benchdeck --cov-report=term-missing  # 408 tests (2 skipped — live API only)

Or use the Makefile:

make install   # pip install -e '.[dev]'
make test      # pytest --cov=src/benchdeck --cov-report=term-missing
make lint      # ruff check .
make fixture   # benchdeck inspect fixtures/original_run.zip

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchdeck-0.1.3.tar.gz (112.2 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchdeck-0.1.3-py3-none-any.whl (62.0 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file benchdeck-0.1.3.tar.gz.

File metadata

Download URL: benchdeck-0.1.3.tar.gz
Upload date: Jun 16, 2026
Size: 112.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for benchdeck-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`9855a0597612d65dc2a272387d54573d21fe21b60123e39843727fa772dd29af`
MD5	`004dc6ba467b10985c59179c22efb1d6`
BLAKE2b-256	`5b0bf01c40a0b4fc6c21b3e0578c9a61a4244dcb76111c264a5c1945b833bd22`

See more details on using hashes here.

File details

Details for the file benchdeck-0.1.3-py3-none-any.whl.

File metadata

Download URL: benchdeck-0.1.3-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 62.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for benchdeck-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc616b0dd9117f421ba91f17842f4279301e20d4addbffde7b6dd3011c92f60a`
MD5	`d6d44f6e019b88ae8fada3876a3490d9`
BLAKE2b-256	`81cb04d6697f77bbca4089d1d55efab8827e43028031bc80046bd6f14a0aa5e7`

See more details on using hashes here.

benchdeck 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

BenchDeck

Screenshots

Benchmark Results

Why BenchDeck

Quick Start

TUI Controls

CLI Reference

Global flags

`benchdeck run`

`benchdeck tui`

`benchdeck inspect`

Using a frozen plan

Architecture

Limitations

Known Issues

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes