Skip to main content

Async HTTP benchmarking utility with pluggable workloads and load models.

Project description

benchmaker

Async HTTP benchmarking with pluggable workload-types (protocols), workloads (datasets), load models, hooks, and optional periodic monitors.

+--------+   item   +---------------+   request   +-----------+   +---------+
|workload|--------->| workload-type |------------>| pre-hooks |-->| aiohttp |
|(dataset|          | (protocol)    |             +-----------+   +---------+
| / log) |          | make_request  |                                 |
+--------+          | make_sample   |              +------------+     v
   ^                +---------------+              | post-hooks |<----+
   |                                               +------------+
   +-- load model decides WHEN to fire ----+              v
                                           |        +----------+
              monitors run alongside ------+------->| metrics  |
              (Prometheus, NVML, ...)               | aggregator|
                                                    +----------+

Install

pip install -e .
pip install -e .[dev]   # for tests

This installs the benchmaker Python package and the benchmaker CLI.

30-second tour

import asyncio
from benchmaker import BenchConfig, BenchRunner, ConstantRPS, HttpWorkloadType

async def main():
    cfg = BenchConfig(
        workload_type=HttpWorkloadType(url="https://httpbin.org/get"),
        load=ConstantRPS(rps=50, duration_s=10),
    )
    result = await BenchRunner(cfg).run()
    print(result.summary)

asyncio.run(main())

Or via the CLI. Workload-specific benchmarks are exposed as recipesbenchmaker <recipe> --args (http, llm, sandbox, swebench, sglang, trajectory-replay):

benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s

Walkthrough: benchmarking an LLM endpoint with ShareGPT

A realistic LLM benchmark needs a real prompt distribution. ShareGPT V3 is a common choice — multi-turn human/assistant conversations scraped from real ChatGPT users. A cleaned, benchmark-ready copy is published at researchcomputer/llmsys-bench (split="sharegpt"), with one row per conversation:

{"id": "...", "messages": [{"role": "user", "content": "..."},
                           {"role": "assistant", "content": "..."},
                           {"role": "user", "content": "..."}]}

messages is the only content field — it's everything a chat benchmark needs. Each row is truncated to end on a user turn, so it's a valid generation request: the server completes the final assistant reply given the prior history. Short source conversations collapse to a single user turn (a plain single-turn prompt); longer ones carry multi-turn context.

Load it directly from the Hub

Pull the published split and feed each row's messages list straight into the chat workload-type (pip install -e .[hf]):

import asyncio
from datasets import load_dataset
from benchmaker import (
    BenchConfig, BenchRunner, OpenAIChatWorkloadType,
    IterableWorkload, parse_rate_spec,
)

async def main():
    ds = load_dataset("researchcomputer/llmsys-bench", split="sharegpt")
    cfg = BenchConfig(
        workload_type=OpenAIChatWorkloadType(
            url="http://localhost:8000/v1/chat/completions",
            model="meta-llama/Llama-3.1-8B-Instruct",
            max_tokens=256,
        ),
        workload=IterableWorkload(row["messages"] for row in ds),
        load=parse_rate_spec("poisson:8", duration_s=60),
        timeout_s=600,
    )
    result = await BenchRunner(cfg).run()
    print(result.summary)

asyncio.run(main())

OpenAIChatWorkloadType receives the message list as-is, so single-turn rows send one user message and multi-turn rows replay the full history before the server generates the final assistant turn. TTFT, inter-token latency, and tokens/sec are captured the same way in both cases. URL / model / API key can also come from .env via OpenAIChatWorkloadType.from_env(...).

Rebuild or customize it yourself

The published split is produced by tools/sharegpt/prepare.py, which downloads the upstream JSON once into .local/ (gitignored) and converts it to the JSONL shape above. Run it when you want a subset, different filtering, or a refresh:

# Defaults: .local/sharegpt_v3_raw.json  ->  .local/sharegpt_v3.jsonl
python tools/sharegpt/prepare.py

# A quick subset for smoke tests:
python tools/sharegpt/prepare.py --max-items 2000

The raw download is ~700 MB. Use --min-chars / --max-chars to drop empty or pathologically long conversations (measured over total message content per row). Point any workload at the local file with JsonlWorkload(path=..., field="messages"), or on the CLI:

benchmaker llm \
    --url   http://localhost:8000/v1/chat/completions \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --prompts-jsonl .local/sharegpt_v3.jsonl \
    --prompt-field  messages \
    --max-tokens 256 \
    --rate poisson:8 --duration 60s \
    --out-dir ./runs --label dataset=sharegpt

To re-publish after regenerating, tools/sharegpt/upload_hf.py pushes the JSONL back to the Hub (needs a write token).

Documentation

Full docs live in docs/:

Deterministic replay (swebench-replay)

Re-run a recorded SWE-bench job with the LLM mocked from its own logs — the real pi + sandbox + verifier pipeline still runs, only the model is served back from recorded outputs, so re-runs are deterministic and free of model cost/variance. Vary --concurrency (or --sweep) to study the rest of the pipeline without the model's stochasticity as a confound. Still needs FLASH_SANDBOX_URL (the sandbox + verifier are real).

# 1) (optional) convert a job's pi logs to a replay store — the recipe can also
#    do this inline via --job.
python -m benchmaker.swebench.trajectory jobs/2026-06-08__05-24-01_b352cb \
    -o replay-trajectories.jsonl

# 2) replay (host mode, localhost) across a concurrency sweep
FLASH_SANDBOX_URL=http://localhost:8080 \
  benchmaker swebench-replay --trajectories replay-trajectories.jsonl \
    --mode pi-host --sweep 1,5,25

# container mode: bind 0.0.0.0 and tell the sandbox how to reach the server
FLASH_SANDBOX_URL=http://localhost:8080 \
  benchmaker swebench-replay --job jobs/2026-06-08__05-24-01_b352cb \
    --mode pi-container --host 0.0.0.0 --reachable-host "$(hostname -I | awk '{print $1}')"

The replay server is stateless: it picks each response by the task's identity (the # Task: line, falling back to a hash of the full prompt when the recorded run lacked an instance id) plus the count of assistant messages already in the request — so it is correct at any concurrency. A MISSES column in the summary flags any divergence (a request beyond the recorded turns).

Examples

Under examples/:

  • simple_get.py — minimal library usage
  • custom_hooks.py — request signing + response parsing
  • llm_chat.py — OpenAI-compatible LLM endpoint with streaming
  • vllm_with_monitor.py — LLM benchmark with concurrent vLLM /metrics scrape
  • sandbox_exec.py — Flash Sandbox /exec latency benchmark
  • sandbox_lifecycle.py — full create → exec → delete cold-start benchmark
  • llm_eval.py — LLM benchmark + accuracy grading (exact/regex/judge)
  • gsm8k_eval.py — GSM8K from HuggingFace + integer-match scorer
  • config.yaml — generic HTTP YAML config
  • config_llm.yaml — LLM YAML config with a Prometheus monitor

Helper tooling under tools/, grouped by purpose:

  • sharegpt/prepare.py (fetch ShareGPT V3 → JSONL) + upload_hf.py (push to the HF Hub with a write token)
  • swe_images/ — mirror SWE-bench/R2E-Gym container images to ghcr (publish.py) and list the published refs (pull.py)
  • agent_warmup/ — build the agent-warmup SFT dataset (python -m tools.agent_warmup.cli)
  • start_local_llm.sh — example local SGLang launch command

Project layout

benchmaker/          # library code
  __init__.py        #   public API (re-exports); cli.py — the `benchmaker` CLI
  config.py  env.py  #   YAML config loading + .env interpolation
  core/              #   engine: types, load models, runner, metrics, monitors, trace
  io/                #   run output: per-run bundle + cross-run collection
  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay) + registry
  swebench/          #   SWE-bench coding agent + grading + harbor adapters
examples/            # runnable examples (incl. swebench/ coding-agent config)
tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
tests/               # pytest smoke tests
docs/                # reference docs

Run the tests

pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmaker-0.1.2.tar.gz (166.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchmaker-0.1.2-py3-none-any.whl (145.4 kB view details)

Uploaded Python 3

File details

Details for the file benchmaker-0.1.2.tar.gz.

File metadata

  • Download URL: benchmaker-0.1.2.tar.gz
  • Upload date:
  • Size: 166.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for benchmaker-0.1.2.tar.gz
Algorithm Hash digest
SHA256 eeb906f881da7de3f2aed932efc0e7763fd2437d131a9f85df3a6201c6090f62
MD5 fa2d187d75597d7ef17079238386c3d0
BLAKE2b-256 a120c61ee242f0d8f687ca6bd75875c1495dc9a74f6017b1e1109a59ac2d14e6

See more details on using hashes here.

File details

Details for the file benchmaker-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: benchmaker-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 145.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for benchmaker-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5d15367736a4ad3120b01a9d4125be188d826f7785fd2601c99c418aec675da6
MD5 f3c1443a286a093fa77d39de00340103
BLAKE2b-256 349ea4eb33d9ff7d5e765f7186a7eb4482671f4bd06b01631d37b7b9c041bccd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page