Async HTTP benchmarking utility with pluggable workloads and load models.
Project description
benchmaker
Async HTTP benchmarking with pluggable workload-types (protocols), workloads (datasets), load models, hooks, and optional periodic monitors.
+--------+ item +---------------+ request +-----------+ +---------+
|workload|--------->| workload-type |------------>| pre-hooks |-->| aiohttp |
|(dataset| | (protocol) | +-----------+ +---------+
| / log) | | make_request | |
+--------+ | make_sample | +------------+ v
^ +---------------+ | post-hooks |<----+
| +------------+
+-- load model decides WHEN to fire ----+ v
| +----------+
monitors run alongside ------+------->| metrics |
(Prometheus, NVML, ...) | aggregator|
+----------+
Install
pip install -e .
pip install -e .[dev] # for tests
This installs the benchmaker Python package and the benchmaker CLI.
30-second tour
import asyncio
from benchmaker import BenchConfig, BenchRunner, ConstantRPS, HttpWorkloadType
async def main():
cfg = BenchConfig(
workload_type=HttpWorkloadType(url="https://httpbin.org/get"),
load=ConstantRPS(rps=50, duration_s=10),
)
result = await BenchRunner(cfg).run()
print(result.summary)
asyncio.run(main())
Or via the CLI. Workload-specific benchmarks are exposed as recipes —
benchmaker <recipe> --args (http, llm, sandbox, swebench):
benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s
Walkthrough: benchmarking an LLM endpoint with ShareGPT
A realistic LLM benchmark needs a real prompt distribution.
ShareGPT V3
is a common choice — multi-turn human/assistant conversations scraped from real
ChatGPT users. A cleaned, benchmark-ready copy is published at
researchcomputer/llmsys-bench
(split="sharegpt"), with one row per conversation:
{"id": "...", "messages": [{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "..."}]}
messages is the only content field — it's everything a chat benchmark needs.
Each row is truncated to end on a user turn, so it's a valid generation
request: the server completes the final assistant reply given the prior
history. Short source conversations collapse to a single user turn (a plain
single-turn prompt); longer ones carry multi-turn context.
Load it directly from the Hub
Pull the published split and feed each row's messages list straight into the
chat workload-type (pip install -e .[hf]):
import asyncio
from datasets import load_dataset
from benchmaker import (
BenchConfig, BenchRunner, OpenAIChatWorkloadType,
IterableWorkload, parse_rate_spec,
)
async def main():
ds = load_dataset("researchcomputer/llmsys-bench", split="sharegpt")
cfg = BenchConfig(
workload_type=OpenAIChatWorkloadType(
url="http://localhost:8000/v1/chat/completions",
model="meta-llama/Llama-3.1-8B-Instruct",
max_tokens=256,
),
workload=IterableWorkload(row["messages"] for row in ds),
load=parse_rate_spec("poisson:8", duration_s=60),
timeout_s=600,
)
result = await BenchRunner(cfg).run()
print(result.summary)
asyncio.run(main())
OpenAIChatWorkloadType receives the message list as-is, so single-turn rows
send one user message and multi-turn rows replay the full history before the
server generates the final assistant turn. TTFT, inter-token latency, and
tokens/sec are captured the same way in both cases. URL / model / API key can
also come from .env via OpenAIChatWorkloadType.from_env(...).
Rebuild or customize it yourself
The published split is produced by tools/sharegpt/prepare.py, which downloads
the upstream JSON once into .local/ (gitignored) and converts it to the JSONL
shape above. Run it when you want a subset, different filtering, or a refresh:
# Defaults: .local/sharegpt_v3_raw.json -> .local/sharegpt_v3.jsonl
python tools/sharegpt/prepare.py
# A quick subset for smoke tests:
python tools/sharegpt/prepare.py --max-items 2000
The raw download is ~700 MB. Use --min-chars / --max-chars to drop empty or
pathologically long conversations (measured over total message content per
row). Point any workload at the local file with JsonlWorkload(path=..., field="messages"), or on the CLI:
benchmaker llm \
--url http://localhost:8000/v1/chat/completions \
--model meta-llama/Llama-3.1-8B-Instruct \
--prompts-jsonl .local/sharegpt_v3.jsonl \
--prompt-field messages \
--max-tokens 256 \
--rate poisson:8 --duration 60s \
--out-dir ./runs --label dataset=sharegpt
To re-publish after regenerating, tools/sharegpt/upload_hf.py pushes the
JSONL back to the Hub (needs a write token).
Documentation
Full docs live in docs/:
- Quickstart
- Concepts — WorkloadType, Workload, LoadModel, Monitor
- Load models — rate-spec syntax, open vs closed loop
- Workloads & workload-types — built-ins and custom subclasses
- Hooks — pre/post request processing
- Monitors — vLLM
/metrics, GPU telemetry, custom samplers - Metrics & output — summary structure, JSONL dumps
- Correctness / accuracy eval — grade responses against references
- CLI & YAML reference
- ShareGPT benchmark — self-contained end-to-end walkthrough
Examples
Under examples/:
simple_get.py— minimal library usagecustom_hooks.py— request signing + response parsingllm_chat.py— OpenAI-compatible LLM endpoint with streamingvllm_with_monitor.py— LLM benchmark with concurrent vLLM/metricsscrapesandbox_exec.py— Flash Sandbox/execlatency benchmarksandbox_lifecycle.py— full create → exec → delete cold-start benchmarkllm_eval.py— LLM benchmark + accuracy grading (exact/regex/judge)gsm8k_eval.py— GSM8K from HuggingFace + integer-match scorerconfig.yaml— generic HTTP YAML configconfig_llm.yaml— LLM YAML config with a Prometheus monitor
Helper tooling under tools/, grouped by purpose:
sharegpt/—prepare.py(fetch ShareGPT V3 → JSONL) +upload_hf.py(push to the HF Hub with a write token)swe_images/— mirror SWE-bench/R2E-Gym container images to ghcr (publish.py) and list the published refs (pull.py)agent_warmup/— build the agent-warmup SFT dataset (python -m tools.agent_warmup.cli)start_local_llm.sh— example local SGLang launch command
Project layout
benchmaker/ # library code
__init__.py # public API (re-exports); cli.py — the `benchmaker` CLI
config.py env.py # YAML config loading + .env interpolation
core/ # engine: types, load models, runner, metrics, monitors, trace
io/ # run output: per-run bundle + cross-run collection
workloads/ # workload-types (http, llm, sandbox, agent, hf, eval)
recipes/ # CLI recipes (http, llm, sandbox, swebench) + registry
swebench/ # SWE-bench coding agent + grading + harbor adapters
examples/ # runnable examples (incl. swebench/ coding-agent config)
tools/ # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
tests/ # pytest smoke tests
docs/ # reference docs
Run the tests
pytest -q
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchmaker-0.1.1.tar.gz.
File metadata
- Download URL: benchmaker-0.1.1.tar.gz
- Upload date:
- Size: 137.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcdab0db183ce3ce1193c8b6462381a4873e87d72f68718ed03f210038ba4abc
|
|
| MD5 |
9716a05acd066bb749e4f1423e0e844e
|
|
| BLAKE2b-256 |
da124fe78cf19a174e4bb04e5007b6169ea6103c0001d142e815b28132157e9d
|
File details
Details for the file benchmaker-0.1.1-py3-none-any.whl.
File metadata
- Download URL: benchmaker-0.1.1-py3-none-any.whl
- Upload date:
- Size: 121.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af8eba81eefae037223eb35240f4fda9f985d792e6ef4ba657fcbbee64463fe9
|
|
| MD5 |
19a4144add0b09a70f88f836aea44e11
|
|
| BLAKE2b-256 |
3cc6c78616a40c568324e3829e8b6ff7351a9f1a2b293e8a94b8374a4d984f72
|