Skip to main content

4-dimensional LLM inference benchmark — multi-turn, multi-agent, parallel dispatch with tool calling

Project description

PawBench 🐾

CI codecov Quality Gate Status PyPI version License: MIT Python 3.10+ Documentation

    / \__
   (    @\___    PawBench
  /         O   4-dimensional LLM inference benchmark
 /   (_____/    "More bark than bite"
/_____/   U

Because your model deserves a benchmark with more bark than bite.

4-dimensional LLM inference benchmark for OpenAI-compatible endpoints. Multi-turn, multi-agent, parallel dispatch with tool calling.

Tests your model with realistic coding agent workloads — not synthetic single-turn completions.

Meet Lola

PawBench is inspired by Lola (@_justlolathings) — the most fashionable pup on Instagram. The built-in scenarios revolve around building her boutique dog apparel store, PawStyle by Lola. Every product, every size guide, every "Lola's Pick" badge traces back to this style icon on four legs.

Follow Lola: https://www.instagram.com/_justlolathings/

Install

pip install pawbench
# or
uv pip install pawbench

Quick Start

# Benchmark your local vLLM
pawbench --endpoint http://localhost:8000

# Against any OpenAI-compatible endpoint
pawbench --endpoint https://api.openai.com/v1 --tag gpt4o

# Just throughput saturation (no scenarios)
pawbench --saturation-only --concurrency 1,2,4,8,16

# JSON output for CI/autoresearch
pawbench --json --output results/

# Custom scenario
pawbench --scenario my_scenario.json

What It Measures

4 Dimensions

Dimension Metrics
Throughput Single-agent tok/s, parallel saturation curve (1->N), TTFT, peak concurrency
Quality Tool call accuracy, instruction following, format compliance, keyword matching
Efficiency Useful token ratio (code in tool args vs filler preamble), tokens per turn
Adaptability Steering event response, mid-conversation context injection, nudge quality delta

Built-in Scenarios: PawStyle by Lola

Two parallel agents build Lola's boutique dog apparel e-commerce store — "Where every pup is a fashionista":

  • pawstyle-independent — Frontend and backend work independently on Lola's shop. Pure parallel throughput + quality baseline.
  • pawstyle — Backend gets a steering event mid-task ("frontend added a Size Guide button — implement Lola's breed-specific sizing endpoint").
  • pawstyle-nudge — Frontend adds Lola's Favorites (wishlist) and Compare features that require backend changes. Backend receives nudges and adapts.

Each scenario is 3 turns x 2 agents, with tool calls (write_file, read_file, run_command) and injected tool results. Products include Lola's Signature Bandana, Cozy Knit Sweater, Rainy Day Raincoat, Adventure Booties, Dapper Bow Tie, and Walk-in-Style Harness — with "Lola's Pick" badges on her personal favorites.

Server Metrics (optional)

If the endpoint exposes /metrics (vLLM, TGI), PawBench scrapes:

  • KV cache usage and prefix cache hit rate
  • Speculative decoding acceptance rate
  • GPU cache pressure

Custom Scenarios

Scenarios are JSON files:

{
  "id": "my-scenario",
  "name": "My Custom Scenario",
  "agents": [
    {
      "id": "agent-1",
      "name": "My Agent",
      "turns": [
        {
          "turn": 1,
          "role": "user",
          "content": "Build a REST API with Flask...",
          "tools": ["write_file"],
          "expect": {
            "tool_calls_min": 1,
            "tool_name_any": ["write_file"],
            "output_mentions": ["flask", "api"]
          }
        }
      ]
    }
  ],
  "tools_schema": [...]
}

Comparing Configs

pawbench --tag baseline --output results/
# ... change model config ...
pawbench --tag eagle3 --output results/

python -m pawbench.compare results/pawbench_baseline_*.json results/pawbench_eagle3_*.json

Output Format

JSON results include full model card (architecture, quantization, GPU, serving params) for reproducibility:

{
  "tag": "fp8-eagle3-spec3",
  "model_card": {
    "model_name": "qwen3-coder",
    "model_config": {"architectures": ["Qwen3NextForCausalLM"], "num_experts": 512, "...": "..."},
    "tuning": {"kv_cache_dtype": "fp8_e4m3", "speculative_config": "eagle3", "...": "..."},
    "gpu": {"name": "NVIDIA GB10", "...": "..."}
  },
  "dim1_throughput": {"avg_single_tok_s": 69.0, "raw_peak_tok_s": 469.3, "...": "..."},
  "dim2_quality": {"avg_quality": 0.81, "tool_accuracy": 0.96, "...": "..."},
  "saturation_curve": [{"concurrency": 1, "tok_s": 69.3}, {"concurrency": 8, "tok_s": 469.3}],
  "server_metrics": {"spec_acceptance_rate": 0.72, "gpu_prefix_cache_hit_rate": 0.92}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawbench-1.1.4.tar.gz (67.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pawbench-1.1.4-py3-none-any.whl (57.8 kB view details)

Uploaded Python 3

File details

Details for the file pawbench-1.1.4.tar.gz.

File metadata

  • Download URL: pawbench-1.1.4.tar.gz
  • Upload date:
  • Size: 67.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawbench-1.1.4.tar.gz
Algorithm Hash digest
SHA256 88c063b3bd8a7047edf42b06d9dc7d49aa34606af07a0f175c3b5991d01f154d
MD5 cbf8d690c253270e02140057cc66c1b4
BLAKE2b-256 c676c1505c8003f38db2075db14dd85d90b20c897b249386fa793d81750206a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawbench-1.1.4.tar.gz:

Publisher: release.yml on zenprocess/pawbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pawbench-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: pawbench-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 57.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawbench-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c67c930c77ae796bf6481d1007e6ad73d85f6ac54098bae1508ef463cd5e36cb
MD5 3274612c3423eee89a1f34f71d07f1a1
BLAKE2b-256 e6ff5745a1e31293325d89169c93e99116bfc4e401996cee436d11af92a31c38

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawbench-1.1.4-py3-none-any.whl:

Publisher: release.yml on zenprocess/pawbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page