YAML-driven benchmark sweeps: generate env-file combinations, execute a tool across each, and query DuckDB-backed aggregate stats.

These details have not been verified by PyPI

Project links

Project description

abench-speckz

Generate Docker env-file combinations from a YAML benchmark spec, execute a benchmark tool across every combination, and query the results.

Install

Requires Python 3.10+.

python -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'

Note on examples: files under examples/ reference paths like python examples/sample_bench.py. Those paths are relative to the repo root, so the examples run only from a checkout — not from an arbitrary working directory after pip install. Clone the repo and cd into it to follow the examples verbatim.

Workflow

spec.yaml  →  abench-speckz gen  →  out/ (env-files + manifest.json)
                                    ↓
              abench-speckz run  →  results/ (runs.jsonl + aggregates.jsonl)
                                    ↓
              abench-speckz stats →  table / JSON / TSV

Commands

`gen` — generate env-file combinations

abench-speckz gen spec.yaml --out out/           # write env-files
abench-speckz gen spec.yaml --dry-run            # print summary table
abench-speckz gen spec.yaml --list               # print TSV
abench-speckz gen spec.yaml --profile smoke --out out/
abench-speckz gen spec.yaml --tag stress --out out/
abench-speckz gen spec.yaml --exclude-tag slow --out out/

Each combination is written as a Docker env-file (KEY=value per line). A manifest.json in the output directory maps each filename back to its full variable assignment and tags.

`run` — execute a tool across every combination

abench-speckz run out/ --tool oha.tool.yaml
abench-speckz run out/ --tool oha.tool.yaml --repeat 5 --warmup 1
abench-speckz run out/ --tool oha.tool.yaml --filter workload=read
abench-speckz run out/ --tool oha.tool.yaml --filter-tag stress
abench-speckz run out/ --tool oha.tool.yaml --filter-exclude-tag slow
abench-speckz run out/ --tool oha.tool.yaml --skip-existing --keep-raw
abench-speckz run out/ --tool oha.tool.yaml --dry-run   # print planned commands

Results are written to results/ (configurable with --results).

`stats` — aggregate and display results

abench-speckz stats results/
abench-speckz stats results/ --group-by workload --group-by concurrency
abench-speckz stats results/ --metric requests_per_sec --metric p50_ms
abench-speckz stats results/ --where workload=read
abench-speckz stats results/ --filter-tag stress
abench-speckz stats results/ --filter-exclude-tag slow
abench-speckz stats results/ --format json
abench-speckz stats results/ --format tsv
abench-speckz stats results/ --pretty            # use display names from tool YAML
abench-speckz stats results/ --from-raw          # recompute from runs.jsonl
abench-speckz stats results/ --report report.html              # self-contained Chart.js HTML
abench-speckz stats results/ --report report.html --plots plots.yaml  # override tool YAML plots

--report writes a self-contained HTML file with Chart.js plots. Plot definitions come from the tool YAML's plots: list (see below), or from a separate YAML file via --plots. When no plots are defined, a default per-metric bar chart is rendered.

`rebuild-aggregates` — regenerate aggregates from raw runs

abench-speckz rebuild-aggregates results/

Spec format

static:
  IMAGE: myapp:latest
  REGION: us-east-1

variables:
  workload:    [read, write, mixed]
  concurrency: [1, 8, 64]
  backend:     [postgres, mysql]

# conditional overrides and tagging
when:
  - if:  { workload: write, backend: mysql }
    set: { LOCK_TIMEOUT: "30s" }
    tag: [slow, write-heavy]
  - if:  { concurrency: 64 }
    set: { THREAD_POOL: "${concurrency}" }
    tag: [stress]

# combos to drop entirely
exclude:
  - { backend: mysql, concurrency: 1 }

# tags applied to every combo
tags: [bench]

profiles:
  smoke:
    variables:
      concurrency: [1]
      workload: [read]
  full: {}

default_profile: smoke

Interpolation: use ${var} to reference other variables and ${env:VAR} to read from the process environment. Use $$ for a literal $.

Variable names starting with _ are reserved and will be rejected at load time. Built-in synthetic variables:

Variable	Available in	Description
`${_envfile}`	`command`, `setup`, `teardown`, `post_run`, `output_file`	Absolute path to the current combo's env file
`${_run_id}`	`command`, `setup`, `teardown`, `post_run`	UUID for this rep — same value written to `runs.jsonl`
`${_exit_code}`	`post_run`	Benchmark exit code
`${_started_at}`	`post_run`	ISO timestamp when the benchmark started
`${_finished_at}`	`post_run`	ISO timestamp when the benchmark finished
`${_duration_ms}`	`post_run`	Wall-clock duration in milliseconds

Profiles overlay the base spec — variables, static, when, and exclude lists are merged. The default_profile is used when --profile is not specified.

Tool YAML format

name: oha
command: "oha ${URL} -n ${REQUESTS} -c ${concurrency} --json"
# ${_envfile} is a built-in variable: absolute path to the current combo's env file.
# Example: docker run --env-file ${_envfile} myimage
timeout_seconds: 300
version_command: "oha --version"

# extract metrics from JSON stdout via JSONPath
capture:
  requests_per_sec: "$.summary.requestsPerSec"
  p50_ms: "$.latencyPercentiles.p50"
  errors[]: "$.errors[*].message"   # trailing [] collects all matches as a list

# alternative: a custom Python parser function
# parser: "mymodule:parse_fn"       # fn(stdout: str) -> dict

# read extraction input from a file the tool writes, instead of stdout
# output_file: "results.json"       # interpolates ${var} / ${env:VAR}
# output_format: jsonl              # "json" (default) or "jsonl" for one JSON object per line

pretty_names:
  requests_per_sec: "Requests/s"
  p50_ms: "p50 latency"
units:
  p50_ms: ms
higher_is_better:
  requests_per_sec: true
  p50_ms: false

# optional: run once at the start of the sweep; output captured into env.snapshot.json
# under a "probes" key. Non-zero exit or missing command stores null for that key.
env_probes:
  kernel:       "uname -r"
  cpu:          "sysctl -n machdep.cpu.brand_string"
  redis_version: "redis-cli --version"

# optional: run once per unique config hash (before the first rep for that combo);
# commands may reference combo vars. Results stored in combo_probes.json and
# embedded in every runs.jsonl row under "combo_probes".
combo_probes:
  effective_maxmemory: "redis-cli -p ${PORT} CONFIG GET maxmemory"
  row_count:           "psql ${DSN} -tAc 'SELECT count(*) FROM events'"

# optional: shell steps run around every rep (warmup and measured)
setup:
  - "docker compose up -d redis"
  - "sleep 1"
teardown:
  - "docker compose down -v"
setup_timeout_seconds: 120   # per-step timeout for setup/teardown/post_run (default 120)

# optional: shell steps run after teardown, always (even on benchmark failure)
# receives run-result vars: ${_run_id}, ${_exit_code}, ${_started_at}, ${_finished_at}, ${_duration_ms}
post_run:
  - "prom-query.sh ${_started_at} ${_finished_at} ${_run_id}"

# optional: declarative plots used by `stats --report`
plots:
  - id: rps_by_workload
    type: bar                        # bar | stacked-bar | line | scatter
    title: "Throughput by workload"
    x: workload
    y: requests_per_sec
  - id: latency_breakdown
    type: stacked-bar
    title: "Latency percentiles"
    x: workload
    y: [p50_ms, p95_ms, p99_ms]
  - id: rps_vs_concurrency
    type: line
    title: "Throughput scaling"
    x: concurrency                        # combo variable on x-axis
    y: requests_per_sec
    group_by: workload                    # one line per workload value
  - id: rps_vs_concurrency_multi
    type: line
    title: "Throughput scaling by workload + backend"
    x: concurrency
    y: requests_per_sec
    group_by: [workload, backend]         # one line per workload+backend combo
  - id: throughput_vs_latency
    type: scatter
    title: "Throughput / latency tradeoff"
    x: requests_per_sec                  # metric on x-axis (not a variable)
    y: p95_ms                            # metric on y-axis
    group_by: workload                   # one labeled point per workload value

group_by in plots. Splits data into multiple series based on combo variable values. Accepts a single variable name or a list; multiple keys are joined with / in the legend label.

bar / stacked-bar / line: without group_by, each y metric becomes one series. With group_by, you get one series per (metric, group-value) pair.
scatter: x and y are both metric names (not variables). Each unique combination of group_by values becomes its own labeled point. Without group_by, all points collapse into a single "all" series.

Negation in group_by. Prefix a variable name with ! to mean "all variables except this one". Useful when you have many variables and don't want to list them all:

- id: rps_all_configs
  type: line
  x: concurrency
  y: rps
  group_by: "!concurrency"       # one line per every other variable combination

- id: rps_except_region
  type: line
  x: concurrency
  y: rps
  group_by: ["!concurrency", "!region"]   # exclude multiple vars; keep the rest

Negation is resolved at report time against the actual variable names in aggregates.jsonl. Unknown excluded names are silently ignored.

env_probes. A mapping of key → shell command run once at the very start of the sweep (before any rep). The trimmed stdout of each command is stored in env.snapshot.json under "probes". A non-zero exit code or missing command stores null for that key — probes never abort a sweep.

combo_probes. A mapping of key → command template run once per unique config hash, before the first rep for that combo (after per-sweep setup). Commands interpolate combo vars (${var}, ${_envfile}, ${env:VAR}). Results are stored in two places: combo_probes.json (keyed by config hash) and embedded in every runs.jsonl row under "combo_probes". Non-zero exit, missing command, timeout, or interpolation error stores null — probes never abort a sweep. Useful for capturing system or service state that varies per combo (e.g. effective DB config after per-sweep setup seeded a different dataset, kernel tuning parameters set per workload).

// env.snapshot.json (excerpt)
{
  "host": "...",
  "probes": {
    "kernel": "24.2.0",
    "redis_version": "Redis server v=7.2.3 sha=...",
    "cpu": null
  }
}

Setup / teardown / post_run. Each rep follows the sequence setup → command → teardown → post_run. Teardown runs in a finally block, so it fires even on benchmark failure or Ctrl-C. Combo vars (${var}), ${_envfile}, ${_run_id}, and ${env:VAR} interpolate in all three phases. Steps are split with shlex.split and executed without a shell, so chain via multiple list entries rather than &&.

Setup failure → the command is skipped, teardown still runs best-effort, post_run is skipped, and failure_reason is recorded as setup[i]: ….
Teardown failure → the benchmark's exit_code and metrics are preserved, but teardown[i]: … is appended to failure_reason.
post_run → runs after teardown completes, always — including when the benchmark exits non-zero. In addition to combo vars, it receives ${_exit_code}, ${_started_at}, ${_finished_at}, and ${_duration_ms}. Useful for collecting time-windowed metrics from external systems (Prometheus, InfluxDB, etc.) keyed to the exact run via ${_run_id}. post_run failure is appended to failure_reason but does not suppress the benchmark result or its metrics.

Sweep-scoped setup / teardown. setup_per_sweep and teardown_per_sweep run outside the per-rep loop, useful for expensive prep like seeding a database. By default each fires exactly once for the whole sweep. Set per_sweep_var: <name> to scope each fire to a single variable: combos are stably grouped by that variable's value, and the phases fire once per distinct value (around the reps for that group).

setup_per_sweep:    ["seed-db.sh"]              # fires once before any rep
teardown_per_sweep: ["drop-db.sh"]
per_sweep_var:      workload                    # optional; one var only

Without per_sweep_var: only ${env:VAR} can be referenced; any ${combo_var} rejected at sweep start.
With per_sweep_var: X: only ${X} and ${env:VAR} can be referenced; the current group's value of X is substituted.
--skip-existing: if every rep in a group is already recorded, both phases are skipped for that group.
Setup failure: all planned reps in that group get a failure row with failure_reason="per_sweep_setup[i]: …"; teardown still runs best-effort. Next group proceeds.
Teardown failure: appended to the last rep row in the group's failure_reason.
Raw record: raw/sweep.json (Mode A) or raw/sweep-{slug(value)}.json (Mode B) — same shape as per-rep raw files.

Raw output records. When a raw record is written, raw/{run_id}.json is a JSON object with:

stdout, stderr — the tool's own streams (always present).
output_file — {path, content} when output_file is configured in the tool YAML, so the tool's stdout/stderr stay separate from the file content used for extraction.
setup, teardown, post_run — one entry per step that ran, each with command, exit_code, stdout, stderr.

Results directory layout

results/
  runs.jsonl              # append-only log, one JSON object per run
  aggregates.jsonl        # per-combo stats (n, mean, stddev, p50/95/99, CI95)
  manifest.snapshot.json  # copy of the manifest used
  tools/{name}.yaml       # copy of the tool YAML used
  env.snapshot.json       # host info (OS, CPU, git SHA) + env_probes results under "probes"
  combo_probes.json       # combo_probes results keyed by config hash
  pretty_names.json       # merged metric display names
  raw/{run_id}.json       # structured raw record (see below); written with
                          # --keep-raw, on extract failure, on tool failure,
                          # or when setup/teardown/post_run failed
  raw/sweep[-{slug}].json # per_sweep setup/teardown records; written on
                          # --keep-raw or any per_sweep phase failure

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.14.1

May 27, 2026

0.14.0

May 27, 2026

0.13.0

May 27, 2026

0.12.2

May 25, 2026

0.12.1

May 25, 2026

0.12.0

May 24, 2026

0.11.0

May 23, 2026

0.10.0

May 23, 2026

0.9.0

May 23, 2026

0.8.0

May 23, 2026

0.7.0

May 22, 2026

0.6.1

May 22, 2026

0.6.0

May 22, 2026

0.5.1

May 22, 2026

0.5.0

May 22, 2026

0.4.9

May 21, 2026

0.4.8

May 21, 2026

0.4.7

May 20, 2026

0.4.6

May 18, 2026

0.4.5

May 18, 2026

0.4.4

May 15, 2026

0.4.3

May 15, 2026

0.4.2

May 14, 2026

0.4.1

May 13, 2026

This version

0.4.0

May 13, 2026

0.3.0

May 13, 2026

0.2.1

May 12, 2026

0.2.0

May 11, 2026

0.1.1

May 11, 2026

0.1.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abench_speckz-0.4.0.tar.gz (60.5 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abench_speckz-0.4.0-py3-none-any.whl (44.4 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file abench_speckz-0.4.0.tar.gz.

File metadata

Download URL: abench_speckz-0.4.0.tar.gz
Upload date: May 13, 2026
Size: 60.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for abench_speckz-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`6d33a1d6f1f8a015bbf0def566613e59d85cab9431432ffd4747e2082f0aae3e`
MD5	`2dbc138c6c9c3c6f1baa5b363eefb915`
BLAKE2b-256	`bde12369efa8b78fb811fc03e5bbad93e48ef3ac7ffe1520f265f18e76bb85a3`

See more details on using hashes here.

File details

Details for the file abench_speckz-0.4.0-py3-none-any.whl.

File metadata

Download URL: abench_speckz-0.4.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 44.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for abench_speckz-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1bbe7e53f7103ab977fd9c42a0dea701991d52e24edd6fc039b6747ab122f05`
MD5	`adbbb462968a4836cfea581a83517598`
BLAKE2b-256	`cc5d759db413b9292d9e0a442c316fd4f333137dddeb096bb72bb0ece384c02d`

See more details on using hashes here.

abench-speckz 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

abench-speckz

Install

Workflow

Commands

`gen` — generate env-file combinations

`run` — execute a tool across every combination

`stats` — aggregate and display results

`rebuild-aggregates` — regenerate aggregates from raw runs

Spec format

Tool YAML format

Results directory layout

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

abench-speckz 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

abench-speckz

Install

Workflow

Commands

gen — generate env-file combinations

run — execute a tool across every combination

stats — aggregate and display results

rebuild-aggregates — regenerate aggregates from raw runs

Spec format

Tool YAML format

Results directory layout

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`gen` — generate env-file combinations

`run` — execute a tool across every combination

`stats` — aggregate and display results

`rebuild-aggregates` — regenerate aggregates from raw runs