Evidence-preserving LLM agent benchmark harness with a live mobile-first SSH TUI.
Project description
BenchDeck
Evidence-preserving LLM-agent benchmark harness with a live terminal dashboard built for narrow SSH sessions — including Termius on iPhone.
BenchDeck turns one or two Markdown agent files into a benchmark plan, runs isolated cases with a clarification turn, judges responses with a 0–4 scale, and writes atomically checkpointed artifacts you can watch in real time.
Screenshots
Overview — progress bar, rating distribution, per-family scores, policy blocks, token usage
Case list — per-agent ratings, blocked cases, pending items, status marks
Case detail — purpose, judgment, gate check, agent output
Help — phone-keyboard-friendly controls
Captured from a live benchmark run (gpt-4o-mini, 8 cases, repository-integrity-agent). Regenerate with scripts/generate_demo_screens.py --run-dir benchmark_out/<run_id>.
Benchmark Results
A live benchmark of the included repository-integrity-agent against gpt-4o-mini:
| Metric | Value |
|---|---|
| Cases planned | 8 |
| Cases judged | 8 |
| Excellent (4) | 2 |
| Strong (3) | 1 |
| Weak (1) | 1 |
| Fail (0) | 4 |
| Gate failures | 4 |
| Total tokens | 37,463 |
| API requests | 32 |
| Wall-clock time | ~2 min 20 s |
| Status | completed_with_failures |
Run: benchdeck run --agent-a examples/repository-integrity-agent.md --model gpt-4o-mini --judge-model gpt-4o-mini -o benchmark_out
Why BenchDeck
Benchmarks are prone to silent ambiguity. BenchDeck makes state explicit:
| Ambiguous situation | BenchDeck handling |
|---|---|
| Empty model response | Retried up to 3x; recorded with response ID, status, and raw payload |
| Policy-blocked response | Logged as a policy block — not an agent failure |
| Infrastructure failure | Recorded separately from agent failures |
| Inconsistent scoring scale | Fixed 0–4 scale (Fail, Weak, Acceptable, Strong, Excellent) |
| Judge transcript duplicates candidate output | Stored in separate fields; never commingled |
| Half-written checkpoint crash | Atomic file replacement — the TUI never reads a partial write |
| Run status vs. real coverage | inconclusive, completed_with_failures, infrastructure_failed, or aborted when all cases aren't judged |
Quick Start
Prerequisites: Python 3.11+, an OpenAI API key
python -m venv .venv && source .venv/bin/activate
pip install -e . # user install (pip install -e '.[dev]' for development)
export OPENAI_API_KEY='sk-...' # required — the run command checks this
Run a benchmark:
benchdeck run \
--agent-a examples/repository-integrity-agent.md \
--model gpt-4o-mini \
--judge-model gpt-4o-mini \
--output-dir benchmark_out
Watch it live (second SSH session):
benchdeck tui benchmark_out
Inspect the results:
benchdeck inspect benchmark_out
TUI Controls
The TUI targets 32-column terminals. Arrow keys and letter keys both work — no mouse or modifier chords needed:
| Key | Action |
|---|---|
1 2 3 4 |
Open overview, cases, detail, or help screen |
h / l or ← / → |
Previous / next screen |
j / k or ↓ / ↑ |
Move selection or scroll |
Enter |
Open selected case |
e |
Export case as Markdown |
n |
Launch a new benchmark run (subprocess) |
x |
Cancel running benchmark (press twice to confirm) |
r |
Reload artifacts |
q / Esc |
Quit |
Recommended Termius settings: UTF-8, monospace font, extra keyboard row with Escape and arrow keys.
CLI Reference
Global flags
benchdeck [--config <file.toml>] [--log-level DEBUG|INFO|WARNING|ERROR|CRITICAL] [--log-file <path>] {run,tui,inspect}
| Flag | Description |
|---|---|
--config |
Path to a TOML configuration file (searched in ~/.config/benchdeck/config.toml, ./benchdeck.toml, then explicit path) |
--log-level |
Logging level (default: WARNING) |
--log-file |
Write JSON-structured logs to a file |
benchdeck run
benchdeck run \
--agent-a <agent.md> # required: first agent Markdown file
--agent-b <agent.md> # optional: second agent for comparison mode
--model gpt-4o-mini # model for agent (default: gpt-4o-mini)
--planner-model gpt-4o-mini # model for plan generation (defaults to --model)
--judge-model gpt-4o-mini # model for judge (default: gpt-4o-mini)
--plan benchmark_plan.json # optional: use a frozen plan instead of generating one
--output-dir benchmark_out # output directory for artifacts (short: -o)
--timeout 90 # API timeout in seconds (default: 90)
--max-retries 3 # max retry attempts per call (default: 3)
--judges 1 # number of independent judge calls per case (default: 1)
--capture-level full # response capture detail: minimal, standard, or full
--resume <run_dir> # resume an interrupted run from the given directory
--overwrite # overwrite if a prior run exists at the exact output path
--max-output-tokens-planner N # budget: max output tokens for the planner
--max-output-tokens-agent N # budget: max output tokens for the agent
--max-output-tokens-judge N # budget: max output tokens for the judge
--max-logical-requests N # budget: max logical (API) requests
--max-http-attempts N # budget: max HTTP attempts (incl. retries)
--max-total-input-tokens N # budget: max total input tokens
--max-total-output-tokens N # budget: max total output tokens
benchdeck tui
benchdeck tui benchmark_out # watch a live run
benchdeck tui fixtures/original_run.zip # open the bundled run
benchdeck inspect
benchdeck inspect fixtures/original_run.zip
Detects incomplete coverage, empty outputs, duplicated judge transcripts, undeclared scoring scales, misleading run status, and validates per-agent tallies against schemas/summary_tally.schema.json.
Using a frozen plan
python - <<'PY'
import json
from pathlib import Path
from benchdeck.loader import load_snapshot
plan = load_snapshot(Path('fixtures/original_run.zip')).plan
Path('/tmp/benchmark_plan.json').write_text(json.dumps(plan, indent=2) + '\n')
PY
benchdeck run --agent-a examples/repository-integrity-agent.md --plan /tmp/benchmark_plan.json -o benchmark_out
Architecture
Agent.md ──► Plan ──► Execute ──► Judge ──► Artifacts ──► Loader ──► TUI
(planner (agent (judge (atomic (ZIP/dir
gateway) gateway) gateway) writes) reader)
│
Gate check (0-4)
Typed rubric (8 dims)
Policy block log
Infra failure log
Eight modules:
- Planning (
prompts.py,openai_gateway.py) — generate or load a versioned benchmark plan from agent Markdown - Execution (
runner.py) — run each case with one clarification turn; retry empty responses; classify failures; budget enforcement; resume interrupted runs - Judging (
runner.py,models/) — evaluate output independently; 8-dimension typed rubric; multi-judge with disagreement detection - Artifacts (
storage.py) — atomically checkpoint JSON; concurrent-reader-safe writes - Loader / UI (
loader.py,tui.py) — safe ZIP/directory artifact loading; 32-column curses TUI with optional color, per-agent views, run-launch and cancel controls - Configuration (
config.py) — TOML config with 3-layer merge (~/.config/benchdeck/,./benchdeck.toml,--config) - Budget (
budget.py) — 7-dimension budget tracker; preflight warning; mid-run enforcement - Logging (
logging_config.py) — JSON-structured log output with configurable level and file destination
See docs/architecture.md, docs/benchmark-contract.md, and docs/mobile-tui.md for details.
Limitations
- No PyPI release or signed artifacts. CI workflows for publish (
publish.yml, supports bothPYPI_API_TOKENand OIDC Trusted Publishing — seedocs/publish.md) and release with SBOM (release.yml) exist; no tag has produced a successful publish yet. - Inspector hardening partial.
inspect.pyvalidates schema and manifest checksums (viamanifest.verify()); referential integrity and counter consistency checks remain pending. - No cross-process run lock.
storage.pyuses atomic writes (os.replace), but concurrent writers to the same output directory could race. - No Windows testing. Developed and tested on Linux only.
- No dependency lock file.
requirements.txtprovides reproducible pins; norequirements.lockoruv.lockexists. dist/artifacts stale. (Built 2026-06-11; source has changed since.) Not committed —dist/is gitignored.
See REMAINING_ISSUES.md for the full list of known limitations.
Known Issues
The CHANGELOG lists issues resolved since the v0.1.0 release. For current limitations, see REMAINING_ISSUES.md.
Development
ruff check . # lint
ruff format --check . # formatting
mypy src/benchdeck/ # type checking (strict; requires types-jsonschema in dev deps)
pytest --cov=src/benchdeck --cov-report=term-missing # 408 tests (2 skipped — live API only)
Or use the Makefile:
make install # pip install -e '.[dev]'
make test # pytest --cov=src/benchdeck --cov-report=term-missing
make lint # ruff check .
make fixture # benchdeck inspect fixtures/original_run.zip
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchdeck-0.1.3.tar.gz.
File metadata
- Download URL: benchdeck-0.1.3.tar.gz
- Upload date:
- Size: 112.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9855a0597612d65dc2a272387d54573d21fe21b60123e39843727fa772dd29af
|
|
| MD5 |
004dc6ba467b10985c59179c22efb1d6
|
|
| BLAKE2b-256 |
5b0bf01c40a0b4fc6c21b3e0578c9a61a4244dcb76111c264a5c1945b833bd22
|
File details
Details for the file benchdeck-0.1.3-py3-none-any.whl.
File metadata
- Download URL: benchdeck-0.1.3-py3-none-any.whl
- Upload date:
- Size: 62.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc616b0dd9117f421ba91f17842f4279301e20d4addbffde7b6dd3011c92f60a
|
|
| MD5 |
d6d44f6e019b88ae8fada3876a3490d9
|
|
| BLAKE2b-256 |
81cb04d6697f77bbca4089d1d55efab8827e43028031bc80046bd6f14a0aa5e7
|