Async HTTP benchmarking utility with pluggable workloads and load models.

These details have not been verified by PyPI

Project description

benchmaker

Async HTTP benchmarking with pluggable workload-types (protocols), workloads (datasets), load models, hooks, and optional periodic monitors.

+--------+   item   +---------------+   request   +-----------+   +---------+
|workload|--------->| workload-type |------------>| pre-hooks |-->| aiohttp |
|(dataset|          | (protocol)    |             +-----------+   +---------+
| / log) |          | make_request  |                                 |
+--------+          | make_sample   |              +------------+     v
   ^                +---------------+              | post-hooks |<----+
   |                                               +------------+
   +-- load model decides WHEN to fire ----+              v
                                           |        +----------+
              monitors run alongside ------+------->| metrics  |
              (Prometheus, NVML, ...)               | aggregator|
                                                    +----------+

Install

pip install -e .
pip install -e .[dev]   # for tests

This installs the benchmaker Python package and the benchmaker CLI.

30-second tour

import asyncio
from benchmaker import BenchConfig, BenchRunner, ConstantRPS, HttpWorkloadType

async def main():
    cfg = BenchConfig(
        workload_type=HttpWorkloadType(url="https://httpbin.org/get"),
        load=ConstantRPS(rps=50, duration_s=10),
    )
    result = await BenchRunner(cfg).run()
    print(result.summary)

asyncio.run(main())

Or via the CLI. Workload-specific benchmarks are exposed as recipes — benchmaker <recipe> --args (http, llm, sandbox, swebench, sglang, trajectory-replay):

benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s

Walkthrough: benchmarking an LLM endpoint with ShareGPT

A realistic LLM benchmark needs a real prompt distribution. ShareGPT V3 is a common choice — multi-turn human/assistant conversations scraped from real ChatGPT users. A cleaned, benchmark-ready copy is published at researchcomputer/llmsys-bench (split="sharegpt"), with one row per conversation:

{"id": "...", "messages": [{"role": "user", "content": "..."},
                           {"role": "assistant", "content": "..."},
                           {"role": "user", "content": "..."}]}

messages is the only content field — it's everything a chat benchmark needs. Each row is truncated to end on a user turn, so it's a valid generation request: the server completes the final assistant reply given the prior history. Short source conversations collapse to a single user turn (a plain single-turn prompt); longer ones carry multi-turn context.

Load it directly from the Hub

Pull the published split and feed each row's messages list straight into the chat workload-type (pip install -e .[hf]):

import asyncio
from datasets import load_dataset
from benchmaker import (
    BenchConfig, BenchRunner, OpenAIChatWorkloadType,
    IterableWorkload, parse_rate_spec,
)

async def main():
    ds = load_dataset("researchcomputer/llmsys-bench", split="sharegpt")
    cfg = BenchConfig(
        workload_type=OpenAIChatWorkloadType(
            url="http://localhost:8000/v1/chat/completions",
            model="meta-llama/Llama-3.1-8B-Instruct",
            max_tokens=256,
        ),
        workload=IterableWorkload(row["messages"] for row in ds),
        load=parse_rate_spec("poisson:8", duration_s=60),
        timeout_s=600,
    )
    result = await BenchRunner(cfg).run()
    print(result.summary)

asyncio.run(main())

OpenAIChatWorkloadType receives the message list as-is, so single-turn rows send one user message and multi-turn rows replay the full history before the server generates the final assistant turn. TTFT, inter-token latency, and tokens/sec are captured the same way in both cases. URL / model / API key can also come from .env via OpenAIChatWorkloadType.from_env(...).

Rebuild or customize it yourself

The published split is produced by tools/sharegpt/prepare.py, which downloads the upstream JSON once into .local/ (gitignored) and converts it to the JSONL shape above. Run it when you want a subset, different filtering, or a refresh:

# Defaults: .local/sharegpt_v3_raw.json  ->  .local/sharegpt_v3.jsonl
python tools/sharegpt/prepare.py

# A quick subset for smoke tests:
python tools/sharegpt/prepare.py --max-items 2000

The raw download is ~700 MB. Use --min-chars / --max-chars to drop empty or pathologically long conversations (measured over total message content per row). Point any workload at the local file with JsonlWorkload(path=..., field="messages"), or on the CLI:

benchmaker llm \
    --url   http://localhost:8000/v1/chat/completions \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --prompts-jsonl .local/sharegpt_v3.jsonl \
    --prompt-field  messages \
    --max-tokens 256 \
    --rate poisson:8 --duration 60s \
    --out-dir ./runs --label dataset=sharegpt

To re-publish after regenerating, tools/sharegpt/upload_hf.py pushes the JSONL back to the Hub (needs a write token).

Documentation

Full docs live in docs/:

Quickstart
Concepts — WorkloadType, Workload, LoadModel, Monitor
Load models — rate-spec syntax, open vs closed loop
Workloads & workload-types — built-ins and custom subclasses
Hooks — pre/post request processing
Monitors — vLLM /metrics, GPU telemetry, custom samplers
Metrics & output — summary structure, JSONL dumps
Correctness / accuracy eval — grade responses against references
CLI & YAML reference
ShareGPT benchmark — self-contained end-to-end walkthrough
benchmaker sglang — native SGLang /generate benchmark (see docs/sglang.md).
benchmaker trajectory-replay — multi-turn prefix-cache parity replay of trajectory datasets like SWE-smith (see docs/trajectory-replay.md).

Deterministic replay (`swebench-replay`)

Re-run a recorded SWE-bench job with the LLM mocked from its own logs — the real pi + sandbox + verifier pipeline still runs, only the model is served back from recorded outputs, so re-runs are deterministic and free of model cost/variance. Vary --concurrency (or --sweep) to study the rest of the pipeline without the model's stochasticity as a confound. Still needs FLASH_SANDBOX_URL (the sandbox + verifier are real).

# 1) (optional) convert a job's pi logs to a replay store — the recipe can also
#    do this inline via --job.
python -m benchmaker.swebench.trajectory jobs/2026-06-08__05-24-01_b352cb \
    -o replay-trajectories.jsonl

# 2) replay (host mode, localhost) across a concurrency sweep
FLASH_SANDBOX_URL=http://localhost:8080 \
  benchmaker swebench-replay --trajectories replay-trajectories.jsonl \
    --mode pi-host --sweep 1,5,25

# container mode: bind 0.0.0.0 and tell the sandbox how to reach the server
FLASH_SANDBOX_URL=http://localhost:8080 \
  benchmaker swebench-replay --job jobs/2026-06-08__05-24-01_b352cb \
    --mode pi-container --host 0.0.0.0 --reachable-host "$(hostname -I | awk '{print $1}')"

The replay server is stateless: it picks each response by the task's identity (the # Task: line, falling back to a hash of the full prompt when the recorded run lacked an instance id) plus the count of assistant messages already in the request — so it is correct at any concurrency. A MISSES column in the summary flags any divergence (a request beyond the recorded turns).

Examples

Under examples/:

simple_get.py — minimal library usage
custom_hooks.py — request signing + response parsing
llm_chat.py — OpenAI-compatible LLM endpoint with streaming
vllm_with_monitor.py — LLM benchmark with concurrent vLLM /metrics scrape
sandbox_exec.py — Flash Sandbox /exec latency benchmark
sandbox_lifecycle.py — full create → exec → delete cold-start benchmark
llm_eval.py — LLM benchmark + accuracy grading (exact/regex/judge)
gsm8k_eval.py — GSM8K from HuggingFace + integer-match scorer
config.yaml — generic HTTP YAML config
config_llm.yaml — LLM YAML config with a Prometheus monitor

Helper tooling under tools/, grouped by purpose:

sharegpt/ — prepare.py (fetch ShareGPT V3 → JSONL) + upload_hf.py (push to the HF Hub with a write token)
swe_images/ — mirror SWE-bench/R2E-Gym container images to ghcr (publish.py) and list the published refs (pull.py)
agent_warmup/ — build the agent-warmup SFT dataset (python -m tools.agent_warmup.cli)
start_local_llm.sh — example local SGLang launch command

Project layout

benchmaker/          # library code
  __init__.py        #   public API (re-exports); cli.py — the `benchmaker` CLI
  config.py  env.py  #   YAML config loading + .env interpolation
  core/              #   engine: types, load models, runner, metrics, monitors, trace
  io/                #   run output: per-run bundle + cross-run collection
  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay) + registry
  swebench/          #   SWE-bench coding agent + grading + harbor adapters
examples/            # runnable examples (incl. swebench/ coding-agent config)
tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
tests/               # pytest smoke tests
docs/                # reference docs

Run the tests

pytest -q

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Jun 11, 2026

0.1.1

Jun 10, 2026

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmaker-0.1.2.tar.gz (166.8 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchmaker-0.1.2-py3-none-any.whl (145.4 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file benchmaker-0.1.2.tar.gz.

File metadata

Download URL: benchmaker-0.1.2.tar.gz
Upload date: Jun 11, 2026
Size: 166.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for benchmaker-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`eeb906f881da7de3f2aed932efc0e7763fd2437d131a9f85df3a6201c6090f62`
MD5	`fa2d187d75597d7ef17079238386c3d0`
BLAKE2b-256	`a120c61ee242f0d8f687ca6bd75875c1495dc9a74f6017b1e1109a59ac2d14e6`

See more details on using hashes here.

File details

Details for the file benchmaker-0.1.2-py3-none-any.whl.

File metadata

Download URL: benchmaker-0.1.2-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 145.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for benchmaker-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d15367736a4ad3120b01a9d4125be188d826f7785fd2601c99c418aec675da6`
MD5	`f3c1443a286a093fa77d39de00340103`
BLAKE2b-256	`349ea4eb33d9ff7d5e765f7186a7eb4482671f4bd06b01631d37b7b9c041bccd`

See more details on using hashes here.

benchmaker 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

benchmaker

Install

30-second tour

Walkthrough: benchmarking an LLM endpoint with ShareGPT

Load it directly from the Hub

Rebuild or customize it yourself

Documentation

Deterministic replay (`swebench-replay`)

Examples

Project layout

Run the tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

benchmaker 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

benchmaker

Install

30-second tour

Walkthrough: benchmarking an LLM endpoint with ShareGPT

Load it directly from the Hub

Rebuild or customize it yourself

Documentation

Deterministic replay (swebench-replay)

Examples

Project layout

Run the tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Deterministic replay (`swebench-replay`)