Skip to main content

YAML-driven benchmark sweeps: generate env-file combinations, execute a tool across each, and query DuckDB-backed aggregate stats.

Project description

abench-speckz

Generate Docker env-file combinations from a YAML benchmark spec, execute a benchmark tool across every combination, and query the results.

Install

Requires Python 3.10+.

python -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'

Note on examples: files under examples/ reference paths like python examples/sample_bench.py. Those paths are relative to the repo root, so the examples run only from a checkout — not from an arbitrary working directory after pip install. Clone the repo and cd into it to follow the examples verbatim.

Workflow

spec.yaml  →  abench-speckz gen  →  out/ (env-files + manifest.json)
                                    ↓
              abench-speckz run  →  results/ (runs.jsonl + aggregates.jsonl)
                                    ↓
              abench-speckz stats →  table / JSON / TSV

Commands

gen — generate env-file combinations

abench-speckz gen spec.yaml --out out/           # write env-files
abench-speckz gen spec.yaml --dry-run            # print summary table
abench-speckz gen spec.yaml --list               # print TSV
abench-speckz gen spec.yaml --profile smoke --out out/
abench-speckz gen spec.yaml --tag stress --out out/
abench-speckz gen spec.yaml --exclude-tag slow --out out/

Each combination is written as a Docker env-file (KEY=value per line). A manifest.json in the output directory maps each filename back to its full variable assignment and tags.

run — execute a tool across every combination

abench-speckz run out/ --tool oha.tool.yaml
abench-speckz run out/ --tool oha.tool.yaml --repeat 5 --warmup 1
abench-speckz run out/ --tool oha.tool.yaml --filter workload=read
abench-speckz run out/ --tool oha.tool.yaml --filter-tag stress
abench-speckz run out/ --tool oha.tool.yaml --filter-exclude-tag slow
abench-speckz run out/ --tool oha.tool.yaml --skip-existing --keep-raw
abench-speckz run out/ --tool oha.tool.yaml --dry-run   # print planned commands

Results are written to results/ (configurable with --results).

stats — aggregate and display results

abench-speckz stats results/
abench-speckz stats results/ --group-by workload --group-by concurrency
abench-speckz stats results/ --metric requests_per_sec --metric p50_ms
abench-speckz stats results/ --where workload=read
abench-speckz stats results/ --filter-tag stress
abench-speckz stats results/ --filter-exclude-tag slow
abench-speckz stats results/ --format json
abench-speckz stats results/ --format tsv
abench-speckz stats results/ --pretty            # use display names from tool YAML
abench-speckz stats results/ --from-raw          # recompute from runs.jsonl
abench-speckz stats results/ --report report.html              # self-contained Chart.js HTML
abench-speckz stats results/ --report report.html --plots plots.yaml  # override tool YAML plots

--report writes a self-contained HTML file with Chart.js plots. Plot definitions come from the tool YAML's plots: list (see below), or from a separate YAML file via --plots. When no plots are defined, a default per-metric bar chart is rendered.

rebuild-aggregates — regenerate aggregates from raw runs

abench-speckz rebuild-aggregates results/

Spec format

static:
  IMAGE: myapp:latest
  REGION: us-east-1

variables:
  workload:    [read, write, mixed]
  concurrency: [1, 8, 64]
  backend:     [postgres, mysql]

# conditional overrides and tagging
when:
  - if:  { workload: write, backend: mysql }
    set: { LOCK_TIMEOUT: "30s" }
    tag: [slow, write-heavy]
  - if:  { concurrency: 64 }
    set: { THREAD_POOL: "${concurrency}" }
    tag: [stress]

# combos to drop entirely
exclude:
  - { backend: mysql, concurrency: 1 }

# tags applied to every combo
tags: [bench]

profiles:
  smoke:
    variables:
      concurrency: [1]
      workload: [read]
  full: {}

default_profile: smoke

Interpolation: use ${var} to reference other variables and ${env:VAR} to read from the process environment. Use $$ for a literal $.

Profiles overlay the base spec — variables, static, when, and exclude lists are merged. The default_profile is used when --profile is not specified.

Tool YAML format

name: oha
command: "oha ${URL} -n ${REQUESTS} -c ${concurrency} --json"
timeout_seconds: 300
version_command: "oha --version"

# extract metrics from JSON stdout via JSONPath
capture:
  requests_per_sec: "$.summary.requestsPerSec"
  p50_ms: "$.latencyPercentiles.p50"
  errors[]: "$.errors[*].message"   # trailing [] collects all matches as a list

# alternative: a custom Python parser function
# parser: "mymodule:parse_fn"       # fn(stdout: str) -> dict

# read extraction input from a file the tool writes, instead of stdout
# output_file: "results.json"       # interpolates ${var} / ${env:VAR}
# output_format: jsonl              # "json" (default) or "jsonl" for one JSON object per line

pretty_names:
  requests_per_sec: "Requests/s"
  p50_ms: "p50 latency"
units:
  p50_ms: ms
higher_is_better:
  requests_per_sec: true
  p50_ms: false

# optional: shell steps run around every rep (warmup and measured)
setup:
  - "docker compose up -d redis"
  - "sleep 1"
teardown:
  - "docker compose down -v"
setup_timeout_seconds: 120   # per-step timeout for setup/teardown (default 120)

# optional: declarative plots used by `stats --report`
plots:
  - id: rps_by_workload
    type: bar                        # bar | stacked-bar | line | scatter
    title: "Throughput by workload"
    x: workload
    y: requests_per_sec
  - id: latency_breakdown
    type: stacked-bar
    title: "Latency percentiles"
    x: workload
    y: [p50_ms, p95_ms, p99_ms]
  - id: rps_vs_concurrency
    type: line
    title: "Throughput scaling"
    x: concurrency
    y: requests_per_sec
    group_by: workload

Setup / teardown. Each rep is wrapped setup → command → teardown. Teardown runs in a finally block, so it also fires on benchmark failure or Ctrl-C. Combo vars (${var}) and ${env:VAR} interpolate in setup/teardown commands. Steps are split with shlex.split and executed without a shell, so chain via multiple list entries rather than &&.

  • Setup failure → the command is skipped, teardown still runs best-effort, and failure_reason is recorded as setup[i]: ….
  • Teardown failure → the benchmark's exit_code and metrics are preserved, but teardown[i]: … is appended to failure_reason (so the run is counted as failed).

Results directory layout

results/
  runs.jsonl              # append-only log, one JSON object per run
  aggregates.jsonl        # per-combo stats (n, mean, stddev, p50/95/99, CI95)
  manifest.snapshot.json  # copy of the manifest used
  tools/{name}.yaml       # copy of the tool YAML used
  env.snapshot.json       # host info (OS, CPU, git SHA)
  pretty_names.json       # merged metric display names
  raw/{run_id}.json       # raw stdout (only with --keep-raw or on extract failure)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abench_speckz-0.1.0.tar.gz (45.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abench_speckz-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file abench_speckz-0.1.0.tar.gz.

File metadata

  • Download URL: abench_speckz-0.1.0.tar.gz
  • Upload date:
  • Size: 45.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for abench_speckz-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cda74a047c5cfcc1b286b9fc364ac106545eedfb112ceee40af3a0349cc14eda
MD5 5dcf0c3592eeed266ff2d9b680f2e11d
BLAKE2b-256 08e0f9bec34043ff889ca3e7b55f10255a9fd08edfd48b08e8ef7f518b25bd86

See more details on using hashes here.

File details

Details for the file abench_speckz-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: abench_speckz-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for abench_speckz-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 411be5b909ba28684162d494a7a777de87171f7952a00e896b8e3cdea71967ca
MD5 dae8481b8599d206d1f35a7eb830fa74
BLAKE2b-256 516d27b98e28dc7660bc53e0ceece477f967116dee8c22b23ff29ee71cda5ded

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page