Skip to main content

Async HTTP benchmarking utility with pluggable workloads and load models.

Project description

benchmaker

Async HTTP benchmarking with pluggable workload-types (protocols), workloads (datasets), load models, hooks, and optional periodic monitors.

+--------+   item   +---------------+   request   +-----------+   +---------+
|workload|--------->| workload-type |------------>| pre-hooks |-->| aiohttp |
|(dataset|          | (protocol)    |             +-----------+   +---------+
| / log) |          | make_request  |                                 |
+--------+          | make_sample   |              +------------+     v
   ^                +---------------+              | post-hooks |<----+
   |                                               +------------+
   +-- load model decides WHEN to fire ----+              v
                                           |        +----------+
              monitors run alongside ------+------->| metrics  |
              (Prometheus, NVML, ...)               | aggregator|
                                                    +----------+

Install

pip install -e .
pip install -e .[dev]   # for tests

This installs the benchmaker Python package and the benchmaker CLI.

30-second tour

import asyncio
from benchmaker import BenchConfig, BenchRunner, ConstantRPS, HttpWorkloadType

async def main():
    cfg = BenchConfig(
        workload_type=HttpWorkloadType(url="https://httpbin.org/get"),
        load=ConstantRPS(rps=50, duration_s=10),
    )
    result = await BenchRunner(cfg).run()
    print(result.summary)

asyncio.run(main())

Or via the CLI. Workload-specific benchmarks are exposed as recipesbenchmaker <recipe> --args (http, llm, sandbox, swebench):

benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s

Walkthrough: benchmarking an LLM endpoint with ShareGPT

A realistic LLM benchmark needs a real prompt distribution. ShareGPT V3 is a common choice — multi-turn human/assistant conversations scraped from real ChatGPT users. A cleaned, benchmark-ready copy is published at researchcomputer/llmsys-bench (split="sharegpt"), with one row per conversation:

{"id": "...", "messages": [{"role": "user", "content": "..."},
                           {"role": "assistant", "content": "..."},
                           {"role": "user", "content": "..."}]}

messages is the only content field — it's everything a chat benchmark needs. Each row is truncated to end on a user turn, so it's a valid generation request: the server completes the final assistant reply given the prior history. Short source conversations collapse to a single user turn (a plain single-turn prompt); longer ones carry multi-turn context.

Load it directly from the Hub

Pull the published split and feed each row's messages list straight into the chat workload-type (pip install -e .[hf]):

import asyncio
from datasets import load_dataset
from benchmaker import (
    BenchConfig, BenchRunner, OpenAIChatWorkloadType,
    IterableWorkload, parse_rate_spec,
)

async def main():
    ds = load_dataset("researchcomputer/llmsys-bench", split="sharegpt")
    cfg = BenchConfig(
        workload_type=OpenAIChatWorkloadType(
            url="http://localhost:8000/v1/chat/completions",
            model="meta-llama/Llama-3.1-8B-Instruct",
            max_tokens=256,
        ),
        workload=IterableWorkload(row["messages"] for row in ds),
        load=parse_rate_spec("poisson:8", duration_s=60),
        timeout_s=600,
    )
    result = await BenchRunner(cfg).run()
    print(result.summary)

asyncio.run(main())

OpenAIChatWorkloadType receives the message list as-is, so single-turn rows send one user message and multi-turn rows replay the full history before the server generates the final assistant turn. TTFT, inter-token latency, and tokens/sec are captured the same way in both cases. URL / model / API key can also come from .env via OpenAIChatWorkloadType.from_env(...).

Rebuild or customize it yourself

The published split is produced by tools/sharegpt/prepare.py, which downloads the upstream JSON once into .local/ (gitignored) and converts it to the JSONL shape above. Run it when you want a subset, different filtering, or a refresh:

# Defaults: .local/sharegpt_v3_raw.json  ->  .local/sharegpt_v3.jsonl
python tools/sharegpt/prepare.py

# A quick subset for smoke tests:
python tools/sharegpt/prepare.py --max-items 2000

The raw download is ~700 MB. Use --min-chars / --max-chars to drop empty or pathologically long conversations (measured over total message content per row). Point any workload at the local file with JsonlWorkload(path=..., field="messages"), or on the CLI:

benchmaker llm \
    --url   http://localhost:8000/v1/chat/completions \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --prompts-jsonl .local/sharegpt_v3.jsonl \
    --prompt-field  messages \
    --max-tokens 256 \
    --rate poisson:8 --duration 60s \
    --out-dir ./runs --label dataset=sharegpt

To re-publish after regenerating, tools/sharegpt/upload_hf.py pushes the JSONL back to the Hub (needs a write token).

Documentation

Full docs live in docs/:

Examples

Under examples/:

  • simple_get.py — minimal library usage
  • custom_hooks.py — request signing + response parsing
  • llm_chat.py — OpenAI-compatible LLM endpoint with streaming
  • vllm_with_monitor.py — LLM benchmark with concurrent vLLM /metrics scrape
  • sandbox_exec.py — Flash Sandbox /exec latency benchmark
  • sandbox_lifecycle.py — full create → exec → delete cold-start benchmark
  • llm_eval.py — LLM benchmark + accuracy grading (exact/regex/judge)
  • gsm8k_eval.py — GSM8K from HuggingFace + integer-match scorer
  • config.yaml — generic HTTP YAML config
  • config_llm.yaml — LLM YAML config with a Prometheus monitor

Helper tooling under tools/, grouped by purpose:

  • sharegpt/prepare.py (fetch ShareGPT V3 → JSONL) + upload_hf.py (push to the HF Hub with a write token)
  • swe_images/ — mirror SWE-bench/R2E-Gym container images to ghcr (publish.py) and list the published refs (pull.py)
  • agent_warmup/ — build the agent-warmup SFT dataset (python -m tools.agent_warmup.cli)
  • start_local_llm.sh — example local SGLang launch command

Project layout

benchmaker/          # library code
  __init__.py        #   public API (re-exports); cli.py — the `benchmaker` CLI
  config.py  env.py  #   YAML config loading + .env interpolation
  core/              #   engine: types, load models, runner, metrics, monitors, trace
  io/                #   run output: per-run bundle + cross-run collection
  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
  recipes/           #   CLI recipes (http, llm, sandbox, swebench) + registry
  swebench/          #   SWE-bench coding agent + grading + harbor adapters
examples/            # runnable examples (incl. swebench/ coding-agent config)
tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
tests/               # pytest smoke tests
docs/                # reference docs

Run the tests

pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmaker-0.1.1.tar.gz (137.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchmaker-0.1.1-py3-none-any.whl (121.9 kB view details)

Uploaded Python 3

File details

Details for the file benchmaker-0.1.1.tar.gz.

File metadata

  • Download URL: benchmaker-0.1.1.tar.gz
  • Upload date:
  • Size: 137.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for benchmaker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bcdab0db183ce3ce1193c8b6462381a4873e87d72f68718ed03f210038ba4abc
MD5 9716a05acd066bb749e4f1423e0e844e
BLAKE2b-256 da124fe78cf19a174e4bb04e5007b6169ea6103c0001d142e815b28132157e9d

See more details on using hashes here.

File details

Details for the file benchmaker-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: benchmaker-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 121.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for benchmaker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af8eba81eefae037223eb35240f4fda9f985d792e6ef4ba657fcbbee64463fe9
MD5 19a4144add0b09a70f88f836aea44e11
BLAKE2b-256 3cc6c78616a40c568324e3829e8b6ff7351a9f1a2b293e8a94b8374a4d984f72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page