4-dimensional LLM inference benchmark — multi-turn, multi-agent, parallel dispatch with tool calling
Project description
PawBench 🐾
/ \__
( @\___ PawBench
/ O 4-dimensional LLM inference benchmark
/ (_____/ "More bark than bite"
/_____/ U
Because your model deserves a benchmark with more bark than bite.
4-dimensional LLM inference benchmark for OpenAI-compatible endpoints. Multi-turn, multi-agent, parallel dispatch with tool calling.
Tests your model with realistic coding agent workloads — not synthetic single-turn completions.
Meet Lola
PawBench is inspired by Lola (@_justlolathings) — the most fashionable pup on Instagram. The built-in scenarios revolve around building her boutique dog apparel store, PawStyle by Lola. Every product, every size guide, every "Lola's Pick" badge traces back to this style icon on four legs.
Follow Lola: https://www.instagram.com/_justlolathings/
Install
pip install pawbench
# or
uv pip install pawbench
Quick Start
# Benchmark your local vLLM
pawbench --endpoint http://localhost:8000
# Against any OpenAI-compatible endpoint
pawbench --endpoint https://api.openai.com/v1 --tag gpt4o
# Just throughput saturation (no scenarios)
pawbench --saturation-only --concurrency 1,2,4,8,16
# JSON output for CI/autoresearch
pawbench --json --output results/
# Custom scenario
pawbench --scenario my_scenario.json
What It Measures
4 Dimensions
| Dimension | Metrics |
|---|---|
| Throughput | Single-agent tok/s, parallel saturation curve (1->N), TTFT, peak concurrency |
| Quality | Tool call accuracy, instruction following, format compliance, keyword matching |
| Efficiency | Useful token ratio (code in tool args vs filler preamble), tokens per turn |
| Adaptability | Steering event response, mid-conversation context injection, nudge quality delta |
Built-in Scenarios: PawStyle by Lola
Two parallel agents build Lola's boutique dog apparel e-commerce store — "Where every pup is a fashionista":
pawstyle-independent— Frontend and backend work independently on Lola's shop. Pure parallel throughput + quality baseline.pawstyle— Backend gets a steering event mid-task ("frontend added a Size Guide button — implement Lola's breed-specific sizing endpoint").pawstyle-nudge— Frontend adds Lola's Favorites (wishlist) and Compare features that require backend changes. Backend receives nudges and adapts.
Each scenario is 3 turns x 2 agents, with tool calls (write_file, read_file, run_command) and injected tool results. Products include Lola's Signature Bandana, Cozy Knit Sweater, Rainy Day Raincoat, Adventure Booties, Dapper Bow Tie, and Walk-in-Style Harness — with "Lola's Pick" badges on her personal favorites.
Server Metrics (optional)
If the endpoint exposes /metrics (vLLM, TGI), PawBench scrapes:
- KV cache usage and prefix cache hit rate
- Speculative decoding acceptance rate
- GPU cache pressure
Custom Scenarios
Scenarios are JSON files:
{
"id": "my-scenario",
"name": "My Custom Scenario",
"agents": [
{
"id": "agent-1",
"name": "My Agent",
"turns": [
{
"turn": 1,
"role": "user",
"content": "Build a REST API with Flask...",
"tools": ["write_file"],
"expect": {
"tool_calls_min": 1,
"tool_name_any": ["write_file"],
"output_mentions": ["flask", "api"]
}
}
]
}
],
"tools_schema": [...]
}
Comparing Configs
pawbench --tag baseline --output results/
# ... change model config ...
pawbench --tag eagle3 --output results/
python -m pawbench.compare results/pawbench_baseline_*.json results/pawbench_eagle3_*.json
Output Format
JSON results include full model card (architecture, quantization, GPU, serving params) for reproducibility:
{
"tag": "fp8-eagle3-spec3",
"model_card": {
"model_name": "qwen3-coder",
"model_config": {"architectures": ["Qwen3NextForCausalLM"], "num_experts": 512, "...": "..."},
"tuning": {"kv_cache_dtype": "fp8_e4m3", "speculative_config": "eagle3", "...": "..."},
"gpu": {"name": "NVIDIA GB10", "...": "..."}
},
"dim1_throughput": {"avg_single_tok_s": 69.0, "raw_peak_tok_s": 469.3, "...": "..."},
"dim2_quality": {"avg_quality": 0.81, "tool_accuracy": 0.96, "...": "..."},
"saturation_curve": [{"concurrency": 1, "tok_s": 69.3}, {"concurrency": 8, "tok_s": 469.3}],
"server_metrics": {"spec_acceptance_rate": 0.72, "gpu_prefix_cache_hit_rate": 0.92}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pawbench-1.1.4.tar.gz.
File metadata
- Download URL: pawbench-1.1.4.tar.gz
- Upload date:
- Size: 67.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88c063b3bd8a7047edf42b06d9dc7d49aa34606af07a0f175c3b5991d01f154d
|
|
| MD5 |
cbf8d690c253270e02140057cc66c1b4
|
|
| BLAKE2b-256 |
c676c1505c8003f38db2075db14dd85d90b20c897b249386fa793d81750206a3
|
Provenance
The following attestation bundles were made for pawbench-1.1.4.tar.gz:
Publisher:
release.yml on zenprocess/pawbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pawbench-1.1.4.tar.gz -
Subject digest:
88c063b3bd8a7047edf42b06d9dc7d49aa34606af07a0f175c3b5991d01f154d - Sigstore transparency entry: 1187031376
- Sigstore integration time:
-
Permalink:
zenprocess/pawbench@da48ab250416ef596b59409aa43c6c5266b9e1de -
Branch / Tag:
refs/tags/v1.1.4 - Owner: https://github.com/zenprocess
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@da48ab250416ef596b59409aa43c6c5266b9e1de -
Trigger Event:
push
-
Statement type:
File details
Details for the file pawbench-1.1.4-py3-none-any.whl.
File metadata
- Download URL: pawbench-1.1.4-py3-none-any.whl
- Upload date:
- Size: 57.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c67c930c77ae796bf6481d1007e6ad73d85f6ac54098bae1508ef463cd5e36cb
|
|
| MD5 |
3274612c3423eee89a1f34f71d07f1a1
|
|
| BLAKE2b-256 |
e6ff5745a1e31293325d89169c93e99116bfc4e401996cee436d11af92a31c38
|
Provenance
The following attestation bundles were made for pawbench-1.1.4-py3-none-any.whl:
Publisher:
release.yml on zenprocess/pawbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pawbench-1.1.4-py3-none-any.whl -
Subject digest:
c67c930c77ae796bf6481d1007e6ad73d85f6ac54098bae1508ef463cd5e36cb - Sigstore transparency entry: 1187031395
- Sigstore integration time:
-
Permalink:
zenprocess/pawbench@da48ab250416ef596b59409aa43c6c5266b9e1de -
Branch / Tag:
refs/tags/v1.1.4 - Owner: https://github.com/zenprocess
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@da48ab250416ef596b59409aa43c6c5266b9e1de -
Trigger Event:
push
-
Statement type: