Open-source benchmark for LLM inference on agentic swarm workloads
Project description
The open-source benchmark for LLM inference under agentic swarm workloads
Created by — the AI-native cloud for agentic workloads
Quick Start • Why Agentic Swarm • Modes • Record & Replay • Tasks • Context Control • Reports • Docker
Why Agentic Swarm?
When Claude Code opens a file, reads 2,000 lines, edits three functions, runs tests, and reads the error output - that's 5+ LLM round-trips with 40-100K token contexts growing each turn. Every turn adds tool results, file contents, and error traces to the conversation.
No existing benchmark simulates this.
- SWE-bench measures model quality on GitHub issues. It doesn't measure inference speed.
- LMSys / Chatbot Arena measures chatbot throughput at ~2K context. Agentic swarm contexts are 20-80x larger.
- Generic LLM benchmarks send uniform requests. Agentic swarm workloads have system prompts with tool schemas, multi-turn history, code files, and growing context windows.
AgenticSwarmBench fills that gap - it benchmarks your LLM serving stack under the exact access patterns that Claude Code, Cursor, Windsurf, and Copilot generate.
| What makes it different | |
|---|---|
| Agentic swarm context | Pads requests with real-looking agentic sessions - system prompts with tool definitions, prior conversation turns, code files, tool call results, error traces |
| Growing context simulation | Profiles simulate how context grows during a real coding session: fresh (6K) → short (20K) → medium (40K) → long (70K) → full (100K) → xl (200K) → xxl (400K) |
| Prefix cache defeat | Unique per-request salt ensures you measure true cold-start inference, not cache hits |
| Cache impact measurement | --cache-mode both runs cold + warm to show exact prefix cache speedup (10x cost difference on Anthropic) |
| Reasoning token detection | Automatically detects thinking/reasoning tokens (DeepSeek R1, o3, Claude Extended Thinking) and reports thinking overhead vs visible output latency |
| 110 agentic swarm tasks | 5 difficulty tiers, 5 languages (Python, TypeScript, Rust, Go, SQL) - from single-function fixes to full-stack refactors |
| Record & replay | Capture real coding sessions as replayable workloads, then benchmark them against any endpoint |
| Five CLI modes | Speed, eval, agent, record, and replay - plus reporting and comparison |
| Docker one-liner | Point at any vLLM / SGLang / TGI / OpenAI-compatible endpoint and go |
Quick Start
Install
pip install agentic-swarm-bench
Or with proxy support (for agentic mode):
pip install "agentic-swarm-bench[proxy]"
Run against your own endpoint
# Quick speed test - 1 and 8 concurrent agents at fresh (6K) context
asb speed \
--endpoint http://localhost:8000 \
--model my-model \
--suite quick
# Full suite with report - sweeps all context sizes and concurrency levels
asb speed \
--endpoint http://localhost:8000 \
--model my-model \
--suite full \
--output report.md
asb is the short alias. agentic-swarm-bench also works.
Endpoint URL: Pass any URL. If it doesn't end with /v1/chat/completions, the path is appended automatically. Both of these work:
asb speed -e http://localhost:8000 -m my-model
asb speed -e https://api.example.com/v1/chat/completions -m my-model
Authentication: By default, --api-key is sent as Authorization: Bearer <key>. If your endpoint uses a different header:
asb speed -e URL -m MODEL -k MY_KEY --api-key-header X-API-Key
Dry run: Preview what will be sent without making requests:
asb speed -e URL -m MODEL --dry-run
Note: Some inference endpoints may not return detailed error messages on failure. Use
--dry-runto validate your configuration before running a full benchmark.
Docker
docker run --rm -v $(pwd)/results:/results \
swarmone/agentic-swarm-bench speed \
--endpoint http://host.docker.internal:8000 \
--model my-model \
--suite quick \
--output /results/report.md
Benchmark Modes
┌──────────────────────────────────────────────────────────────────────────────────────┐
│ AgenticSwarmBench │
├──────────────┬──────────────────┬──────────────────┬─────────────────────────────────┤
│ speed │ eval │ record / replay │ agent │
│ │ │ │ │
│ Direct to │ Send swarm │ Capture real │ Recording proxy between │
│ endpoint │ tasks, validate │ agentic sessions│ agent and endpoint │
│ │ generated code │ as JSONL, then │ │
│ Measures: │ Measures: │ replay anywhere │ Measures: │
│ TTFT │ Syntax pass │ │ Real agentic session metrics │
│ tok/s │ Execution pass │ Measures: │ Multi-turn latency growth │
│ ITL │ Functional │ Same as speed, │ Tool call overhead │
│ Prefill │ correctness │ from real data │ Context window scaling │
└──────────────┴──────────────────┴──────────────────┴─────────────────────────────────┘
asb speed - Inference Speed Under Agentic Load
Sends streaming requests with realistic agentic swarm context (system prompts, tool schemas, file contents, conversation history) directly to any OpenAI-compatible endpoint.
# Default: realistic context sweep simulating a coding session growing over time
asb speed -e http://localhost:8000 -m my-model
# Specific concurrency (32 concurrent agents) at long context
asb speed -e http://localhost:8000 -m my-model -u 32 -p long
# Fixed token count - stress test at exactly 50K tokens
asb speed -e http://localhost:8000 -m my-model -c 50000 -u 16
# Cap max users - run a full suite but limit concurrency to 16
asb speed -e http://localhost:8000 -m my-model --suite full --max-users 16
# Measure prefix cache impact - runs cold then warm
asb speed -e http://localhost:8000 -m my-model --cache-mode both
# JSON-only output (for CI/CD pipelines)
asb speed -e http://localhost:8000 -m my-model --format json -o results.json
# Randomize context per request (tests diverse prefill patterns)
asb speed -e http://localhost:8000 -m my-model --random-context
Metrics: TTFT, decode tok/s per user, prefill tok/s, ITL (p50/p95/p99), aggregate throughput, reasoning token overhead. When the endpoint returns prompt_tokens in the response, actual token counts are shown alongside estimates.
asb eval - Code Correctness
Sends agentic swarm tasks and validates the generated code at three levels:
# Syntax validation (does it parse?)
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntax
# Execution validation (does it run?)
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v execution
# Functional validation (does it produce correct output?)
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v functional
asb agent - Full Agentic Session Benchmark
Runs a recording proxy between a real agent (Claude Code) and your endpoint, measuring actual multi-turn agentic sessions:
asb agent \
-e http://localhost:8000 \
-m my-model \
-t p1-p10
The proxy translates Anthropic Messages API → OpenAI Chat Completions API and records per-request timing, context growth, and tool call patterns.
asb list-tasks - Browse Available Tasks
asb list-tasks # Show all 110 tasks
asb list-tasks -t trivial # Filter by tier
asb list-tasks --tags typescript,rust # Filter by language
asb list-tasks --format json # JSON output
Workload Recording & Replay
Synthetic benchmarks are useful, but nothing beats measuring with your actual coding sessions. Record a real session, then replay it against any endpoint.
asb record - Capture a Real Session
Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL line:
# Record with an OpenAI-compatible upstream
asb record \
-e http://your-gpu-server:8000 \
-m your-model
# Record with Anthropic (auto-detected from URL)
asb record \
-e https://api.anthropic.com \
-m claude-sonnet-4-20250514 \
-k $ANTHROPIC_API_KEY \
--api-key-header x-api-key \
-o my-session.jsonl
# Custom output file and port
asb record \
-e http://your-gpu-server:8000 \
-m your-model \
-o my-session.jsonl \
-P 9000
Then point Claude Code at the proxy:
ANTHROPIC_BASE_URL=http://localhost:19000 claude
The recorder supports two upstream modes:
- OpenAI-compatible (default): translates Anthropic Messages API → OpenAI format before forwarding
- Anthropic passthrough: forwards requests natively to Anthropic's API - no translation, full fidelity. Auto-detected when the endpoint is
api.anthropic.com, or set explicitly with--upstream-api anthropic.
Both modes save the workload in OpenAI format for replay. Stop with Ctrl+C when done.
asb replay - Replay Against Any Endpoint
Take a recorded workload and replay it against a different endpoint, hardware, or configuration:
# Replay a session against a new endpoint
asb replay \
-e http://new-server:8000 \
-m my-model \
-w my-session.jsonl
# Generate a full report
asb replay \
-e http://new-server:8000 \
-m my-model \
-w my-session.jsonl \
-o report.md
# Preview without sending requests
asb replay -e URL -m MODEL -w session.jsonl --dry-run
# Replay just the beginning of a session (up to 1M cumulative prompt tokens)
asb replay -e URL -m MODEL -w session.jsonl --slice-tokens 1000000
Slicing workloads: Real sessions grow from small contexts to large ones. --slice-tokens N replays requests from the start until cumulative prompt tokens reach N - preserving the natural context growth while capping how much you send through the endpoint. Useful for targeting specific model context limits or keeping replay costs down.
Requests are grouped by context size and produce the same metrics as asb speed - TTFT, tok/s, ITL, and aggregate throughput.
asb list-workloads - Browse Built-in Workloads
asb list-workloads
asb list-workloads --format json
The 110 Tasks
Tasks simulate real agentic coding scenarios across 5 difficulty tiers and 5 languages:
| Tier | Range | What it simulates |
|---|---|---|
| 1 - Trivial | P1-P10 | Quick fixes: rename a variable, add a type hint, write a one-liner |
| 2 - Easy | P11-P25 | Single-file tasks: implement a function, write a CLI tool, parse a file |
| 3 - Medium | P26-P50 | Multi-function work: build an API endpoint, write tests, refactor a module |
| 4 - Hard | P51-P75 | Complex tasks: networking, concurrency, database queries, full programs |
| 5 - Expert | P76-P100 | Real-world projects: multi-file apps, distributed systems, full-stack |
| Multi-lang | P101-P110 | TypeScript, Rust, Go, SQL tasks across all difficulty levels |
Languages: Python (P1-P100), TypeScript, Rust, Go, SQL (P101-P110). Filter with --tags typescript,rust,go.
Tasks define what to generate. Context size is controlled separately - so you can benchmark a trivial fix inside a massive 100K-token coding session.
Context Control
Context size simulates where you are in a real coding session:
| Profile | Tokens | What it simulates |
|---|---|---|
fresh |
~6K | Just opened the project - system prompt + first question |
short |
~20K | A few turns in - read a couple files, made one edit |
medium |
~40K | Mid-session - several file reads, tool calls, error traces |
long |
~70K | Deep session - many edits, test runs, debugging cycles |
full |
~100K | Long session approaching context limit - everything accumulated |
xl |
~200K | Extended session - large codebases, long test output, multi-file edits |
xxl |
~400K | Maximum depth - for models with 400K+ context windows |
realistic |
Mixed | Sweeps fresh → full (default) - simulates a full session lifecycle |
Every request is padded with content that looks like a real agentic coding session:
- System prompt with tool schemas (Read, Write, Edit, Bash, Grep, etc.)
- Prior conversation turns with file contents
- Tool call results and error traces
- Growing context that mimics how sessions actually evolve
# Simulate a deep coding session (70K context)
asb speed -e URL -m MODEL --context-profile long
# Long-context models: test at 200K or 400K
asb speed -e URL -m MODEL --context-profile xl
asb speed -e URL -m MODEL --context-profile xxl
# Exact token count
asb speed -e URL -m MODEL --context-tokens 50000
# Default: sweeps fresh → short → medium → long → full
asb speed -e URL -m MODEL
Model Context Window
Use --model-context-length to tell the benchmark your model's maximum context window. Any profiles that exceed it are automatically skipped:
# Model supports up to 128K - xl (200K) and xxl (400K) are skipped automatically
asb speed -e URL -m MODEL --suite full --model-context-length 128000
# Model supports 400K - run everything including xxl
asb speed -e URL -m MODEL --context-profile xxl --model-context-length 400000
This is useful when running suites or realistic sweeps against models with different context limits - no need to manually pick a profile.
Prefix Cache Defeat
Every request includes a unique salt:
[session_id=abc123... ts=1234567890 rand=847291...]
This ensures prefix caching cannot mask cold-start prefill costs. Every measurement reflects true inference performance.
# Default: cache defeat enabled (cold-start measurement)
asb speed -e URL -m MODEL --defeat-cache
# Measure production-like performance with caching
asb speed -e URL -m MODEL --allow-cache
# Measure BOTH - shows exact cache speedup
asb speed -e URL -m MODEL --cache-mode both
--cache-mode both runs each scenario twice (first cold, then warm) and reports the delta. Anthropic charges 10x less for cached tokens ($0.30 vs $3.00/M), so knowing your cache hit rate matters.
Reasoning Token Detection
Models like DeepSeek R1, o3, and Claude Extended Thinking produce thinking tokens before visible output. AgenticSwarmBench automatically detects reasoning_content in the streaming response and reports:
| Metric | Description |
|---|---|
| TTFT (thinking) | Time to first reasoning token |
| TTFT (visible) | Time to first visible output token |
| Thinking overhead | Extra latency from reasoning before visible output |
| Thinking tokens | Count of reasoning tokens generated |
This is critical for agentic swarm workloads - a reasoning model that takes 5 seconds to "think" before emitting code changes the UX of the entire editing session.
What Good Looks Like
Reference ranges from common setups (your numbers will vary by hardware, model size, and serving stack):
| Setup | Context | Users | TTFT | Tok/s/user | Notes |
|---|---|---|---|---|---|
| vLLM on 1x A100 (80GB), 7B model | 6K | 1 | ~100ms | ~80-120 | Baseline: fast model, short context |
| vLLM on 1x A100 (80GB), 7B model | 40K | 8 | ~2-4s | ~20-40 | Typical agentic workload |
| vLLM on 1x A100 (80GB), 7B model | 100K | 32 | ~8-15s | ~5-15 | Stress test |
| SGLang on 1x H100, 70B model | 6K | 1 | ~200ms | ~40-60 | Larger model, faster GPU |
| SGLang on 1x H100, 70B model | 40K | 8 | ~3-6s | ~10-25 | Agentic sweet spot |
| API provider (e.g. Together, Fireworks) | 40K | 8 | ~2-8s | ~15-40 | Varies by provider/load |
Rules of thumb for agentic swarm workloads:
- TTFT < 3s at 40K context → responsive editing experience
- Tok/s > 30/user → code appears to stream smoothly
- TTFT < 10s at 100K context → acceptable for deep sessions
- Agg tok/s scales sub-linearly with users - expect ~60-70% efficiency at 8x concurrency
A note on ITL: Inter-Token Latency measures the gap between SSE
data:lines as received by the client. Very low values (< 1ms) typically reflect HTTP/TCP buffering, not actual token generation speed. Use tok/s as the primary throughput metric; ITL is best for relative comparisons across scenarios.
Reports
Reports are designed to answer one question fast: is this endpoint good enough for agentic swarm workloads?
Every report includes:
- Verdict - 🟢 GOOD / 🟡 MARGINAL / 🔴 POOR for agentic swarm workloads, with the key numbers
- Key Findings - auto-generated insights: TTFT scaling ratio, throughput range, concurrency efficiency, thinking overhead
- Summary table - TTFT, tok/s, ITL at each concurrency level and context size, with color-coded grade icons per row
- What This Means for Agentic Swarm - maps raw metrics to user experience ("instant response", "smooth streaming", "sluggish, frustrating")
- Context Scaling chart - ASCII chart showing how TTFT and tok/s change as context grows
- Concurrency Scaling - efficiency percentages at each concurrency level with color grades
- Per-profile breakdown - detailed numbers per context size
- Reasoning token analysis - thinking overhead when using reasoning models
- Methodology - what was measured, how, and what the grade thresholds mean
The CLI also prints a final verdict line after every benchmark:
Verdict: GOOD for agentic swarm workloads at medium context
Compare two runs (includes head-to-head table, ASCII bar chart, and winner summary):
asb compare --baseline a.json --candidate b.json -o comparison.md
Docker
Build
docker build -t swarmone/agentic-swarm-bench .
Run
# Speed benchmark
docker run --rm -v $(pwd)/results:/results \
swarmone/agentic-swarm-bench speed \
-e http://host.docker.internal:8000 \
-m my-model --suite full \
-o /results/report.md
# Recording proxy for agentic mode
docker-compose up proxy
Docker Compose
export ASB_ENDPOINT=http://your-gpu-server:8000
export ASB_MODEL=your-model-name
docker-compose run agentic-swarm-bench
Configuration
AgenticSwarmBench merges configuration from four sources (highest priority first):
- CLI arguments -
--endpoint,--model,--context-tokens, etc. - Environment variables -
ASB_ENDPOINT,ASB_MODEL, etc. - YAML config file -
asb --config bench.yml speed ... - Defaults - sensible defaults for everything
YAML Config Example
# bench.yml
endpoint: http://my-gpu-server:8000
model: my-model
suite: standard
defeat_cache: true
Environment Variables
| Variable | Description |
|---|---|
ASB_ENDPOINT |
OpenAI-compatible endpoint URL |
ASB_MODEL |
Model name |
ASB_API_KEY |
API key for the endpoint |
ASB_CONTEXT_TOKENS |
Default context size in tokens |
ASB_CONTEXT_PROFILE |
Default context profile |
ASB_MODEL_CONTEXT_LENGTH |
Model's max context window - skips larger scenarios |
ASB_DEFEAT_CACHE |
Defeat prefix caching (true/false) |
Architecture
agentic-swarm-bench/
agentic_swarm_bench/
cli.py ← Click CLI (asb speed | eval | agent | record | replay | ...)
config.py ← Config: CLI > env > YAML > defaults
tasks/
tasks.json ← 110 agentic swarm tasks, P1-P110
registry.py ← Load/filter tasks by tier, range, tags, language
context/
codebase_context.py ← Agentic session context: tool schemas, file contents, cache defeat salt
runner/
direct.py ← Speed mode: direct endpoint benchmark with agentic context
eval_runner.py ← Eval mode: code correctness validation
claude_code.py ← Agent mode: Claude Code orchestration through recording proxy
workloads/
recorder.py ← Recording proxy: captures real sessions as JSONL workloads
player.py ← Replay engine: replays workloads against any endpoint
registry.py ← Load/list/resolve workloads (file path or built-in name)
data/ ← Built-in workload files
proxy/
server.py ← Agent-mode proxy (FastAPI) - Anthropic ↔ OpenAI translation
padding.py ← Context padding for proxy mode
translators.py ← API format translation
metrics/
collector.py ← Per-request metrics: TTFT, tok/s, ITL, thinking tokens
stats.py ← Statistical analysis (p50, p95, p99, distributions)
report/
markdown.py ← Markdown report: verdict, insights, grades, ASCII charts
skill/
SKILL.md ← Claude Code skill: auto-optimize LLM deployments using asb
Claude Code Optimization Skill
The repo includes a Claude Code skill (skill/SKILL.md) that turns Claude Code into an automated deployment optimizer. Point it at your serving stack and it will:
- Run
asb speedto establish a baseline - Read the verdict and key findings
- Identify the bottleneck (prefill-bound, decode-bound, scheduling, or context scaling)
- Tweak one deployment knob (tensor parallelism, batch size, chunked prefill, etc.)
- Re-run and compare - repeat until targets are met or 5 iterations show no improvement
# Add the skill to Claude Code, then ask:
# "Optimize my vLLM deployment at http://localhost:8000 for agentic workload"
See skill/SKILL.md for the full skill definition and available knobs.
Contributing
We welcome contributions! Here's how to get started:
git clone https://github.com/swarmone/agentic-swarm-bench.git
cd agentic-swarm-bench
pip install -e ".[dev,proxy]"
make test
Development
make lint # Check code style
make format # Auto-format
make test # Run tests
Adding tasks
Tasks are defined in agentic_swarm_bench/tasks/tasks.json. Each task has:
id: P1 through P110tier: trivial, easy, medium, hard, expertprompt: the agentic swarm tasktags: categorization (language, domain)max_output_tokens: token limit for the response
License
Apache 2.0 - see LICENSE.
Built by SwarmOne
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentic_swarm_bench-1.0.0.tar.gz.
File metadata
- Download URL: agentic_swarm_bench-1.0.0.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2536b4e0185cbb571f5a096881213b12c91c334d7dda82cdf0685d5f759057a9
|
|
| MD5 |
c20790e5558accf7eb83743ae45d1de6
|
|
| BLAKE2b-256 |
878ff616fd78ca06facd08f1ec9dd256bc3d7e1ea990d5cbc915222c100b8128
|
File details
Details for the file agentic_swarm_bench-1.0.0-py3-none-any.whl.
File metadata
- Download URL: agentic_swarm_bench-1.0.0-py3-none-any.whl
- Upload date:
- Size: 74.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efffccba2d7a176dac2c2dd6e62db35d0fe99cadf61015f935e41ef670d05a64
|
|
| MD5 |
0b388d952ce27f4f02ed4a5e76e64b3d
|
|
| BLAKE2b-256 |
41082f4e14003527231047d8eb6629a895bb8fe788d9e2d0addd80e871852597
|