YAML-driven benchmark sweeps: generate env-file combinations, execute a tool across each, and query DuckDB-backed aggregate stats.
Project description
abench-speckz
Generate Docker env-file combinations from a YAML benchmark spec, execute a benchmark tool across every combination, and query the results.
Install
Requires Python 3.10+.
python -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'
Note on examples: files under
examples/reference paths likepython examples/sample_bench.py. Those paths are relative to the repo root, so the examples run only from a checkout — not from an arbitrary working directory afterpip install. Clone the repo andcdinto it to follow the examples verbatim.
Workflow
spec.yaml → abench-speckz gen → out/ (env-files + manifest.json)
↓
abench-speckz run → results/ (runs.jsonl + aggregates.jsonl)
↓
abench-speckz stats → table / JSON / TSV
Commands
gen — generate env-file combinations
abench-speckz gen spec.yaml --out out/ # write env-files
abench-speckz gen spec.yaml --dry-run # print summary table
abench-speckz gen spec.yaml --list # print TSV
abench-speckz gen spec.yaml --profile smoke --out out/
abench-speckz gen spec.yaml --tag stress --out out/
abench-speckz gen spec.yaml --exclude-tag slow --out out/
Each combination is written as a Docker env-file (KEY=value per line). A manifest.json in the output directory maps each filename back to its full variable assignment and tags.
run — execute a tool across every combination
abench-speckz run out/ --tool oha.tool.yaml
abench-speckz run out/ --tool oha.tool.yaml --repeat 5 --warmup 1
abench-speckz run out/ --tool oha.tool.yaml --filter workload=read
abench-speckz run out/ --tool oha.tool.yaml --filter-tag stress
abench-speckz run out/ --tool oha.tool.yaml --filter-exclude-tag slow
abench-speckz run out/ --tool oha.tool.yaml --skip-existing --keep-raw
abench-speckz run out/ --tool oha.tool.yaml --dry-run # print planned commands
Results are written to results/ (configurable with --results).
stats — aggregate and display results
abench-speckz stats results/
abench-speckz stats results/ --group-by workload --group-by concurrency
abench-speckz stats results/ --metric requests_per_sec --metric p50_ms
abench-speckz stats results/ --where workload=read
abench-speckz stats results/ --filter-tag stress
abench-speckz stats results/ --filter-exclude-tag slow
abench-speckz stats results/ --format json
abench-speckz stats results/ --format tsv
abench-speckz stats results/ --pretty # use display names from tool YAML
abench-speckz stats results/ --from-raw # recompute from runs.jsonl
abench-speckz stats results/ --report report.html # self-contained Chart.js HTML
abench-speckz stats results/ --report report.html --plots plots.yaml # override tool YAML plots
--report writes a self-contained HTML file with Chart.js plots. Plot definitions come from the tool YAML's plots: list (see below), or from a separate YAML file via --plots. When no plots are defined, a default per-metric bar chart is rendered.
rebuild-aggregates — regenerate aggregates from raw runs
abench-speckz rebuild-aggregates results/
Spec format
static:
IMAGE: myapp:latest
REGION: us-east-1
variables:
workload: [read, write, mixed]
concurrency: [1, 8, 64]
backend: [postgres, mysql]
# conditional overrides and tagging
when:
- if: { workload: write, backend: mysql }
set: { LOCK_TIMEOUT: "30s" }
tag: [slow, write-heavy]
- if: { concurrency: 64 }
set: { THREAD_POOL: "${concurrency}" }
tag: [stress]
# combos to drop entirely
exclude:
- { backend: mysql, concurrency: 1 }
# tags applied to every combo
tags: [bench]
profiles:
smoke:
variables:
concurrency: [1]
workload: [read]
full: {}
default_profile: smoke
Interpolation: use ${var} to reference other variables and ${env:VAR} to read from the process environment. Use $$ for a literal $.
Variable names starting with _ are reserved and will be rejected at load time. Built-in synthetic variables:
| Variable | Available in | Description |
|---|---|---|
${_envfile} |
command, setup, teardown, post_run, monitor, output_file, setup_per_sweep, teardown_per_sweep |
Absolute path to the current combo's env file. In per_sweep phases, resolves to the first entry's env file in the group. |
${_run_id} |
command, setup, teardown, post_run, monitor, setup_per_sweep, teardown_per_sweep |
UUID for this rep — same value written to runs.jsonl. In per_sweep phases, one UUID is generated per group and shared between setup_per_sweep and teardown_per_sweep. |
${_exit_code} |
post_run |
Benchmark exit code |
${_started_at} |
post_run |
ISO timestamp when the benchmark started |
${_finished_at} |
post_run |
ISO timestamp when the benchmark finished |
${_duration_ms} |
post_run |
Wall-clock duration in milliseconds |
Profiles overlay the base spec — variables, static, when, and exclude lists are merged. The default_profile is used when --profile is not specified.
Tool YAML format
name: oha
command: "oha ${URL} -n ${REQUESTS} -c ${concurrency} --json"
# ${_envfile} is a built-in variable: absolute path to the current combo's env file.
# Example: docker run --env-file ${_envfile} myimage
timeout_seconds: 300
version_command: "oha --version"
# extract metrics from JSON stdout via JSONPath
capture:
requests_per_sec: "$.summary.requestsPerSec"
p50_ms: "$.latencyPercentiles.p50"
errors[]: "$.errors[*].message" # trailing [] collects all matches as a list
# alternative: a custom Python parser function
# parser: "mymodule:parse_fn" # fn(stdout: str) -> dict
# read extraction input from a file the tool writes, instead of stdout
# output_file: "results.json" # interpolates ${var} / ${env:VAR}
# output_format: jsonl # "json" (default) or "jsonl" for one JSON object per line
pretty_names:
requests_per_sec: "Requests/s"
p50_ms: "p50 latency"
units:
p50_ms: ms
higher_is_better:
requests_per_sec: true
p50_ms: false
# optional: run once at the start of the sweep; output captured into env.snapshot.json
# under a "probes" key. Non-zero exit or missing command stores null for that key.
env_probes:
kernel: "uname -r"
cpu: "sysctl -n machdep.cpu.brand_string"
redis_version: "redis-cli --version"
# optional: run once per unique config hash (before the first rep for that combo);
# commands may reference combo vars. Results stored in combo_probes.json and
# embedded in every runs.jsonl row under "combo_probes".
combo_probes:
effective_maxmemory: "redis-cli -p ${PORT} CONFIG GET maxmemory"
row_count: "psql ${DSN} -tAc 'SELECT count(*) FROM events'"
# optional: shell steps run around every rep (warmup and measured)
setup:
- "docker compose up -d redis"
- "sleep 1"
teardown:
- "docker compose down -v"
setup_timeout_seconds: 120 # per-step timeout for setup/teardown/post_run (default 120)
# optional: shell steps run after teardown, always (even on benchmark failure)
# receives run-result vars: ${_run_id}, ${_exit_code}, ${_started_at}, ${_finished_at}, ${_duration_ms}
post_run:
- "prom-query.sh ${_started_at} ${_finished_at} ${_run_id}"
# optional: background processes launched before the benchmark command and
# terminated after it completes (SIGTERM, then SIGKILL after 5s).
# Interpolates combo vars including ${_envfile} and ${_run_id}.
monitor:
- "python collect-metrics.py --run-id ${_run_id}"
- "perf stat -p $(cat service.pid)"
# optional: declarative plots used by `stats --report`
plots:
- id: rps_by_workload
type: bar # bar | stacked-bar | line | scatter
title: "Throughput by workload"
x: workload
y: requests_per_sec
- id: latency_breakdown
type: stacked-bar
title: "Latency percentiles"
x: workload
y: [p50_ms, p95_ms, p99_ms]
- id: rps_vs_concurrency
type: line
title: "Throughput scaling"
x: concurrency # combo variable on x-axis
y: requests_per_sec
group_by: workload # one line per workload value
- id: rps_vs_concurrency_multi
type: line
title: "Throughput scaling by workload + backend"
x: concurrency
y: requests_per_sec
group_by: [workload, backend] # one line per workload+backend combo
- id: throughput_vs_latency
type: scatter
title: "Throughput / latency tradeoff"
x: requests_per_sec # metric on x-axis (not a variable)
y: p95_ms # metric on y-axis
group_by: workload # one labeled point per workload value
group_by in plots. Splits data into multiple series based on combo variable values. Accepts a single variable name or a list; multiple keys are joined with / in the legend label.
bar/stacked-bar/line: withoutgroup_by, eachymetric becomes one series. Withgroup_by, you get one series per (metric, group-value) pair.scatter:xandyare both metric names (not variables). Each unique combination ofgroup_byvalues becomes its own labeled point. Withoutgroup_by, all points collapse into a single"all"series.
Negation in group_by. Prefix a variable name with ! to mean "all variables except this one". Useful when you have many variables and don't want to list them all:
- id: rps_all_configs
type: line
x: concurrency
y: rps
group_by: "!concurrency" # one line per every other variable combination
- id: rps_except_region
type: line
x: concurrency
y: rps
group_by: ["!concurrency", "!region"] # exclude multiple vars; keep the rest
Negation is resolved at report time against the actual variable names in aggregates.jsonl. Unknown excluded names are silently ignored.
Report layout. By default, every plot in plots: is rendered automatically in the order it is defined. To control layout, interleave prose, and publish only a subset of your plots, add a sections: list alongside plots:.
report_title: "Database benchmark — Q3 2025"
report_description: |
Throughput and latency sweeps across three workload types and four concurrency
levels. All runs used PostgreSQL 16 on a c5.4xlarge instance.
plots:
- id: rps_by_workload
type: bar
title: "Throughput by workload"
x: workload
y: requests_per_sec
- id: latency_breakdown
type: stacked-bar
title: "Latency percentile breakdown"
x: workload
y: [p50_ms, p95_ms, p99_ms]
- id: rps_vs_concurrency
type: line
title: "Throughput scaling"
x: concurrency
y: requests_per_sec
group_by: workload
- id: throughput_vs_latency # defined but not referenced below → omitted from report
type: scatter
x: requests_per_sec
y: p95_ms
group_by: workload
sections:
- title: "Throughput"
description: |
Requests per second across all three workload types.
blocks:
- plot_id: rps_by_workload
- plot_id: rps_vs_concurrency
- text: |
**Key finding:** write throughput degrades sharply above 16 connections
due to lock contention in the storage layer.
- title: "Latency"
description: Percentile breakdown of response time by workload.
blocks:
- plot_id: latency_breakdown
- html: <p class="note">Numbers above exclude connection setup time.</p>
- include_html: methodology-table.html # path relative to the YAML file
When sections: is present:
- Plots are not auto-rendered — only plots referenced by a
plot_idblock appear in the report. report_descriptionis rendered as Markdown below thereport_titleheading.- The same
plot_idcan appear in multiple blocks; each reference renders an independent canvas. - Plots not referenced by any block are silently omitted, so you can maintain a plot library and selectively publish a subset.
Each block has exactly one key:
| Key | Value | Renders as |
|---|---|---|
plot_id |
id of a plot in the top-level plots: list |
Chart.js chart |
text |
Markdown string | Formatted prose |
html |
Raw HTML string | Inlined verbatim |
include_html |
File path relative to the YAML file | Contents of that file, inlined verbatim |
Using a separate plots file. Pass --plots plots.yaml to supply plot definitions from a standalone file instead of the tool YAML. The standalone file uses the same format — a mapping with plots:, sections:, report_title:, report_description: — or just a bare list of plot entries for the minimal case:
# Plots come from the tool YAML (default)
abench-speckz stats results/ --report report.html
# Override with a standalone plots file
abench-speckz stats results/ --report report.html --plots editorial.yaml
A bare-list standalone file (no sections, no title):
# editorial.yaml — just a list
- id: rps_by_workload
type: bar
x: workload
y: requests_per_sec
- id: rps_vs_concurrency
type: line
x: concurrency
y: requests_per_sec
group_by: workload
env_probes. A mapping of key → shell command run once at the very start of the sweep (before any rep). The trimmed stdout of each command is stored in env.snapshot.json under "probes". A non-zero exit code or missing command stores null for that key — probes never abort a sweep.
combo_probes. A mapping of key → command template run once per unique config hash, before the first rep for that combo (after per-sweep setup). Commands interpolate combo vars (${var}, ${_envfile}, ${env:VAR}). Results are stored in two places: combo_probes.json (keyed by config hash) and embedded in every runs.jsonl row under "combo_probes". Non-zero exit, missing command, timeout, or interpolation error stores null — probes never abort a sweep. Useful for capturing system or service state that varies per combo (e.g. effective DB config after per-sweep setup seeded a different dataset, kernel tuning parameters set per workload).
// env.snapshot.json (excerpt)
{
"host": "...",
"probes": {
"kernel": "24.2.0",
"redis_version": "Redis server v=7.2.3 sha=...",
"cpu": null
}
}
Setup / teardown / post_run / monitor. The full per-rep lifecycle is:
setup → [monitor start] → command → [monitor stop] → teardown → post_run
Teardown runs in a finally block, so it fires even on benchmark failure or Ctrl-C. Combo vars (${var}), ${_envfile}, ${_run_id}, and ${env:VAR} interpolate in all phases. Steps are split with shlex.split and executed without a shell, so chain via multiple list entries rather than &&.
- Setup failure → the command is skipped, monitor is not started, teardown still runs best-effort,
post_runis skipped, andfailure_reasonis recorded assetup[i]: …. - Teardown failure → the benchmark's
exit_codeand metrics are preserved, butteardown[i]: …is appended tofailure_reason. post_run→ runs after teardown completes, always — including when the benchmark exits non-zero. In addition to combo vars, it receives${_exit_code},${_started_at},${_finished_at}, and${_duration_ms}. Useful for collecting time-windowed metrics from external systems (Prometheus, InfluxDB, etc.) keyed to the exact run via${_run_id}.post_runfailure is appended tofailure_reasonbut does not suppress the benchmark result or its metrics.monitor→ each command is launched as a background process immediately after setup succeeds. Processes receive SIGTERM once the benchmark command finishes; any that don't exit within 5 seconds receive SIGKILL. A monitor that fails to start is recorded but never aborts the run. Interpolates combo vars including${_run_id}and${_envfile}. Start and stop records are written toraw/{run_id}.jsonunder--keep-rawor when any monitor fails to start.
Sweep-scoped setup / teardown. setup_per_sweep and teardown_per_sweep run outside the per-rep loop, useful for expensive prep like seeding a database. By default each fires exactly once for the whole sweep. Set per_sweep_var to group combos and fire the phases once per distinct group.
per_sweep_var accepts three forms:
# Single variable — one group per distinct value of 'backend'
per_sweep_var: backend
# List of variables — one group per unique combination of (backend, workload);
# 'concurrency' varies freely within each group
per_sweep_var: [backend, workload]
# Negation — group by all variables EXCEPT 'concurrency'
per_sweep_var: "!concurrency"
# Negation in a list — explicitly include 'backend', exclude 'concurrency'
per_sweep_var: [backend, "!concurrency"]
Negation entries (!var) are expanded against the full variable set in the manifest, mirroring the group_by negation syntax used in plots. The group slug in raw file names joins all variable values (e.g. postgres-read_heavy).
setup_per_sweep:
- "seed-db.sh ${backend} --mode ${workload}"
teardown_per_sweep:
- "drop-db.sh ${backend}"
per_sweep_var: [backend, workload] # concurrency varies freely within each group
- Without
per_sweep_var: only${_envfile},${_run_id}, and${env:VAR}can be referenced; any other${combo_var}is rejected at sweep start. - With
per_sweep_var: only the listed variables,${_envfile},${_run_id}, and${env:VAR}can be referenced in per_sweep steps; the current group's values are substituted. --skip-existing: if every rep in a group is already recorded, both phases are skipped for that group.- Setup failure: all planned reps in that group get a failure row with
failure_reason="per_sweep_setup[i]: …"; teardown still runs best-effort. Next group proceeds. - Teardown failure: appended to the last rep row in the group's
failure_reason. - Raw record:
raw/sweep.json(no grouping) orraw/sweep-{slug}.json(grouped, slug joins all group variable values) — same shape as per-rep raw files.
Raw output records. When a raw record is written, raw/{run_id}.json is a JSON object with:
stdout,stderr— the tool's own streams (always present).output_file—{path, content}whenoutput_fileis configured in the tool YAML, so the tool's stdout/stderr stay separate from the file content used for extraction.setup,teardown,post_run— one entry per step that ran, each withcommand,exit_code,stdout,stderr.monitor_start— one entry per monitor command withcommand,pid(orerrorif it failed to start).monitor_stop— one entry per running monitor process withpid,exit_code,stdout,stderr.
Results directory layout
results/
runs.jsonl # append-only log, one JSON object per run
aggregates.jsonl # per-combo stats (n, mean, stddev, p50/95/99, CI95)
manifest.snapshot.json # copy of the manifest used
tools/{name}.yaml # copy of the tool YAML used
env.snapshot.json # host info (OS, CPU, git SHA) + env_probes results under "probes"
combo_probes.json # combo_probes results keyed by config hash
pretty_names.json # merged metric display names
raw/{run_id}.json # structured raw record (see below); written with
# --keep-raw, on extract failure, on tool failure,
# or when setup/teardown/post_run failed
raw/sweep[-{slug}].json # per_sweep setup/teardown records; written on
# --keep-raw or any per_sweep phase failure
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abench_speckz-0.4.5.tar.gz.
File metadata
- Download URL: abench_speckz-0.4.5.tar.gz
- Upload date:
- Size: 86.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ad4d46cb67b6343e0d325f7fb76f04511b559a8e14d998b525df10b40654a4b
|
|
| MD5 |
a8d47f75fe9c65702e7ca12a90d164d7
|
|
| BLAKE2b-256 |
c1be9f793b9c5c8c96bac8372e30bf03459c2340bba73b2a5bb5c99640a7c0f6
|
File details
Details for the file abench_speckz-0.4.5-py3-none-any.whl.
File metadata
- Download URL: abench_speckz-0.4.5-py3-none-any.whl
- Upload date:
- Size: 64.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
722ecd3bc64d741f4938efed18b34f8c70f9f305fab90f02e4d59945f6de25fa
|
|
| MD5 |
5511b222e34c28119c06c84f1f71f1d8
|
|
| BLAKE2b-256 |
ea0a1e5c88355a06fd439b4158dc6d364303fb7a4fd2e722f75eb8435b7e40d4
|