Open-source benchmark for LLM inference on agentic scenarios
Project description
The open-source benchmark for LLM inference under agentic scenarios
Created by - the AI-native cloud for agentic scenarios
Quick Start • Why Agentic Swarm • Built-in Scenarios • Record & Replay • Modes • Tasks • Context Control • Cache Poisoning • Reports • Docker
Why Agentic Swarm?
When Claude Code opens a file, reads 2,000 lines, edits three functions, runs tests, and reads the error output - that's 5+ LLM round-trips with 40-100K token contexts growing each turn. Every turn adds tool results, file contents, and error traces to the conversation.
No existing benchmark simulates this.
- SWE-bench measures model quality on GitHub issues. It doesn't measure inference speed.
- LMSys / Chatbot Arena measures chatbot throughput at ~2K context. Agentic swarm contexts are 20-80x larger.
- Generic LLM benchmarks send uniform requests. Agentic swarm scenarios have system prompts with tool schemas, multi-turn history, code files, and growing context windows.
AgenticSwarmBench fills that gap - it benchmarks your LLM serving stack under the exact access patterns that Claude Code, Cursor, Windsurf, and Copilot generate.
| What makes it different | |
|---|---|
| Record & replay | The headline feature. Capture real coding sessions as replayable JSONL scenarios, then benchmark them against any endpoint. Your actual traffic, your actual context patterns - no synthetic approximation needed. |
| Three benchmark modes | Record/replay (your real sessions), speed (synthetic load), agent (real multi-turn) - plus reporting and comparison |
| Agentic swarm context | Pads requests with real-looking agentic sessions - system prompts with tool definitions, prior conversation turns, code files, tool call results, error traces |
| Growing context simulation | Profiles simulate how context grows during a real coding session: fresh (6K) → short (20K) → medium (40K) → long (70K) → full (100K) → xl (200K) → xxl (400K) |
| Prefix cache poisoning | Built-in scenarios ship with pre-poisoned recordings that break prefix caching without adding artificial content, ensuring valid cold-start measurements for one pass per task |
| Reasoning token detection | Automatically detects thinking/reasoning tokens (DeepSeek R1, o3, Claude Extended Thinking) and reports thinking overhead vs visible output latency |
| 110 agentic swarm tasks | 5 difficulty tiers, 5 languages (Python, TypeScript, Rust, Go, SQL) - from single-function fixes to full-stack refactors |
| Docker one-liner | Point at any vLLM / SGLang / TGI / OpenAI-compatible endpoint and go |
Quick Start
Install
uv pip install agentic-swarm-bench # with uv (recommended)
pip install agentic-swarm-bench # or with pip
uv pip install "agentic-swarm-bench[proxy]" # add proxy support (for agent / record modes)
Quick smoke test (30 seconds)
Verify your endpoint works before doing anything else - replay the built-in trivial-qa scenario:
asb replay -e http://your-server:8000 -m your-model --scenario trivial-qa
This fires 5 trivial single-turn requests (~20 tokens each) and reports TTFT and tok/s. If numbers come back, your setup works. Then try the real thing:
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus
This replays 5 multi-turn agentic coding sessions (REST API, CLI tool, WebSocket chat, audit trail, search/batch) recorded with Claude Opus 4.6 - growing context from ~1K to ~40K chars across 4 turns each.
Record a real session, then replay it anywhere
The fastest way to get meaningful numbers: record what you actually do, then replay it.
# 1. Start the recording proxy
asb record -e http://your-gpu-server:8000 -m your-model
# 2. Point your agent at the proxy (runs on localhost:19000)
ANTHROPIC_BASE_URL=http://localhost:19000 claude
# 3. Do your normal work. Ctrl+C when done. You now have a .jsonl recording.
# 4. Replay that session against any endpoint
asb replay -e http://new-server:8000 -m my-model --scenario my-session.jsonl
This captures your real context patterns, real token counts, and real multi-turn behavior - then lets you A/B test endpoints with your actual workload.
Run a synthetic speed test
If you don't have a recording yet, the speed mode generates realistic agentic context synthetically:
# Quick speed test - 1 and 8 concurrent agents at fresh (6K) context
asb speed \
--endpoint http://localhost:8000 \
--model my-model \
--suite quick
# Full suite with report - sweeps all context sizes and concurrency levels
asb speed \
--endpoint http://localhost:8000 \
--model my-model \
--suite full \
--output report.md
asb is the short alias. agentic-swarm-bench also works.
Endpoint URL: Pass any URL. If it doesn't end with /v1/chat/completions, the path is appended automatically. Both of these work:
asb speed -e http://localhost:8000 -m my-model
asb speed -e https://api.example.com/v1/chat/completions -m my-model
Authentication: By default, --api-key is sent as Authorization: Bearer <key>. If your endpoint uses a different header:
asb speed -e URL -m MODEL -k MY_KEY --api-key-header X-API-Key
Dry run: Preview what will be sent without making requests:
asb speed -e URL -m MODEL --dry-run
Note: Some inference endpoints may not return detailed error messages on failure. Use
--dry-runto validate your configuration before running a full benchmark.
Docker
docker run --rm -v $(pwd)/results:/results \
swarmone/agentic-swarm-bench speed \
--endpoint http://host.docker.internal:8000 \
--model my-model \
--suite quick \
--output /results/report.md
Built-in Scenarios
Two ready-made scenarios ship with the package so you can benchmark immediately - no recording needed:
| Scenario | Type | Tasks | Turns/task | Context | What it measures |
|---|---|---|---|---|---|
trivial-qa |
Non-agentic baseline | 5 | 1 | ~20 tokens each | Raw single-turn speed (TTFT, tok/s) |
js-coding-opus |
Real agentic sessions | 5 | 4 | ~1K → ~40K chars | Multi-turn agentic performance with growing context |
# List all built-in scenarios
asb list-scenarios
# Quick smoke test - 5 trivial questions, ~20 tokens each
asb replay -e http://your-server:8000 -m your-model --scenario trivial-qa
# Real agentic workload - 5 JS coding sessions recorded with Claude Opus 4.6
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus
# Replay a single task from a scenario
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus --task build-rest-api
# Run multiple repetitions for stable numbers
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus --repetitions 3
trivial-qa - Five trivial single-turn questions (capital of France, largest planet, boiling point of water, speed of light, binary conversion). Non-agentic baseline with minimal context. Useful as a quick smoke test and for comparing agentic vs non-agentic performance on the same endpoint.
js-coding-opus - Five independent JavaScript coding sessions (rate limiting middleware, CLI admin tool, WebSocket real-time updates, activity log/audit trail, search & batch operations). Each task has 4 turns of real multi-turn conversation with growing context. Recorded with Claude Opus 4.6 against a TaskFlow API project.
Benchmark Modes
┌───────────────────────────────────────────────────────────────────────────────┐
│ AgenticSwarmBench │
├──────────────────────┬──────────────────┬─────────────────────────────────────┤
│ asb record / replay │ asb speed │ asb agent │
│ ★ recommended │ │ │
│ │ Synthetic │ Runs Claude Code (or any agent) │
│ Capture YOUR real │ agentic context │ end-to-end with benchmark tasks │
│ coding sessions as │ → endpoint │ through a metrics proxy │
│ JSONL, then replay │ │ │
│ against any endpoint│ 1 request per │ 5-15 real requests per task │
│ │ measurement │ with tool use, file I/O, │
│ Real multi-turn │ │ growing context │
│ conversations with │ Measures: │ │
│ your actual context │ TTFT, tok/s │ Measures: │
│ │ ITL, prefill │ Multi-turn latency compounding │
│ Measures: │ │ Context growth over a session │
│ Same metrics, but │ │ │
│ from YOUR real data │ │ │
└──────────────────────┴──────────────────┴─────────────────────────────────────┘
Scenario Recording & Replay
This is the most valuable way to benchmark. Synthetic load tells you what an endpoint can do in theory. Record/replay tells you what it actually does with your traffic. Record a real coding session once, then replay that exact sequence of requests against any endpoint, hardware config, or model - same context, same token counts, same multi-turn patterns.
Why this matters: agentic sessions have a unique shape. Context starts small and grows unpredictably. Some turns are tiny follow-ups; others dump 20K tokens of file contents. Synthetic benchmarks can approximate this, but a recording captures the real thing.
asb record - Capture a Real Session
Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL recording:
# Record with an OpenAI-compatible upstream
asb record \
-e http://your-gpu-server:8000 \
-m your-model
# Record with Anthropic (auto-detected from URL)
asb record \
-e https://api.anthropic.com \
-m claude-sonnet-4-20250514 \
-k $ANTHROPIC_API_KEY \
--api-key-header x-api-key \
-o my-session.jsonl
# Custom output file and port
asb record \
-e http://your-gpu-server:8000 \
-m your-model \
-o my-session.jsonl \
-P 9000
Then point Claude Code at the proxy:
ANTHROPIC_BASE_URL=http://localhost:19000 claude
The recorder supports two upstream modes:
- OpenAI-compatible (default): translates Anthropic Messages API → OpenAI format before forwarding
- Anthropic passthrough: forwards requests natively to Anthropic's API - no translation, full fidelity. Auto-detected when the endpoint is
api.anthropic.com, or set explicitly with--upstream-api anthropic.
Both modes save the recording in OpenAI format for replay. Stop with Ctrl+C when done.
asb replay - Replay Against Any Endpoint
Take a recorded scenario and replay it against a different endpoint, hardware, or configuration:
# Replay a session against a new endpoint
asb replay \
-e http://new-server:8000 \
-m my-model \
--scenario my-session.jsonl
# Replay a scenario directory with schedule
asb replay \
-e http://new-server:8000 \
-m my-model \
--scenario ./scenarios/my-scenario/ \
--repetitions 3 --max-concurrent 5 --policy sequential
# Replay a built-in scenario (recordings include cache-defeat treatments)
asb replay -e URL -m MODEL --scenario scenario
# Preview without sending requests
asb replay -e URL -m MODEL --scenario session.jsonl --dry-run
# Replay just the beginning of a session (up to 1M cumulative prompt tokens)
asb replay -e URL -m MODEL --scenario session.jsonl --slice-tokens 1000000
Scheduling: Control how tasks execute with --repetitions, --max-concurrent, and --policy (round_robin, sequential).
Cache defeat: Built-in scenarios include cache-defeat treatments in their recordings. Results are valid for one pass per task. User-recorded scenarios (asb record output) replay as-is with no cache-defeat treatment. See Prefix Cache Poisoning for details.
History mode: The default (--history-mode live) captures the server's actual responses during streaming and feeds them into the next turn's conversation history. This is essential for correct prefix-cache measurement when replaying against a model different from the one that made the recording - without it, recorded assistant messages from the original model cause KV-cache prefix mismatches on every turn. Use --history-mode recorded for the legacy behavior of sending each entry's recorded messages verbatim.
Slicing scenarios: Real sessions grow from small contexts to large ones. --slice-tokens N replays requests from the start until cumulative prompt tokens reach N.
Output modes: --verbose (-V) shows a Rich live-updating table with per-task progress, phase, request counts, and decode tok/s.
Early abort: --max-consecutive-failures N stops the entire run if any worker slot hits N consecutive failures (HTTP errors, timeouts). Useful when pointing at an endpoint that may be down or producing garbage:
asb replay -e URL -m MODEL --scenario js-coding-opus --max-consecutive-failures 5
asb list-scenarios - Browse Built-in Scenarios
asb list-scenarios
asb list-scenarios --format json
asb speed - Inference Speed Under Agentic Load
When you don't have a recording yet, or want to test at specific context sizes and concurrency levels, asb speed generates realistic agentic context synthetically. Each request is padded so the model sees what it would see in a real coding session - system prompts with tool schemas, multi-turn conversation history, file contents, and error traces:
┌─ system ───────────────────────────────────────────────────────────────────┐
│ "You are an expert software engineer assistant integrated into a code │
│ editor. You have access to the user's full project codebase..." │
└────────────────────────────────────────────────────────────────────────────┘
┌─ user ─────────────────────────────────────────────────────────────────────┐
│ <tool name="Read">...</tool> ← tool definitions (Read, Write, │
│ <tool name="Write">...</tool> Edit, Bash, Grep, etc.) │
│ │
│ <user_turn> Review src/auth/middleware.py ← synthetic prior conversation │
│ ```python turns with code files, error │
│ def handle_request(...) traces, and assistant replies │
│ ``` (repeated to fill target │
│ </user_turn> context size) │
│ <assistant_turn> I can see the issue... │
│ </assistant_turn> │
│ │
│ --- │
│ Based on the codebase above, <task prompt from tasks.json> │
└────────────────────────────────────────────────────────────────────────────┘
The task prompt (e.g. "Write a Python function that takes a list of integers and returns the largest one") comes from the 110 built-in tasks. The padding around it is what makes this an agentic benchmark - it simulates the accumulated context of a real coding session.
# Default: sweeps context sizes from fresh (6K) → full (100K)
asb speed -e http://localhost:8000 -m my-model
# Specific concurrency (32 concurrent agents) at long context
asb speed -e http://localhost:8000 -m my-model -u 32 -p long
# Fixed token count - stress test at exactly 50K tokens
asb speed -e http://localhost:8000 -m my-model -c 50000 -u 16
# Cap max users - run a full suite but limit concurrency to 16
asb speed -e http://localhost:8000 -m my-model --suite full --max-users 16
# Measure prefix cache impact - runs allcold then allwarm
asb speed -e http://localhost:8000 -m my-model --cache-mode realistic
# JSON-only output (for CI/CD pipelines)
asb speed -e http://localhost:8000 -m my-model --format json -o results.json
Metrics: TTFT, decode tok/s per user, prefill tok/s, ITL (p50/p95/p99), aggregate throughput, reasoning token overhead. When the endpoint returns prompt_tokens in the response, actual token counts are shown alongside estimates.
asb agent - End-to-End Agent Benchmark
The other modes measure individual requests. asb agent measures what it feels like to use an endpoint - it runs a real agent process (Claude Code by default) end-to-end and records timing for every LLM call across the entire multi-turn session.
Here's what a single task run looks like:
You run: asb agent -e http://localhost:8000 -m my-model -t p1-p10
What happens for each task:
┌─────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Claude Code │ ──────► │ ASB proxy │ ──────► │ Your endpoint│
│ (real agent)│ ◄────── │ (translates │ ◄────── │ (vLLM, etc.) │
│ │ │ Anthropic → │ │ │
│ reads files │ │ OpenAI, logs │ │ │
│ writes code │ │ per-request │ │ │
│ runs tests │ │ timing) │ │ │
│ iterates │ │ │ │ │
└─────────────┘ └─────────────────┘ └──────────────┘
Turn 1: Claude reads the task → 6K context → TTFT 200ms
Turn 2: Claude reads 3 files → 25K context → TTFT 800ms
Turn 3: Claude writes code → 35K context → TTFT 1.2s
Turn 4: Claude runs tests, gets errors → 50K context → TTFT 2.1s
Turn 5: Claude fixes the code → 60K context → TTFT 3.5s
Turn 6: Claude runs tests again → 70K context → TTFT 4.8s
...
This captures latency compounding over a real session. Each turn's context naturally grows because it includes prior turns, file contents, tool outputs, and error traces. The proxy records TTFT, tok/s, and context size for every request.
asb agent -e http://localhost:8000 -m my-model -t p1-p10
# Use a different agent (any CLI that accepts a prompt)
asb agent -e http://localhost:8000 -m my-model -t p1-p10 --agent-cmd my-agent
Record/Replay vs Speed vs Agent:
record / replay |
speed |
agent |
|
|---|---|---|---|
| What talks to your endpoint | You during record, ASB during replay |
ASB directly (one synthetic request) | A real agent (Claude Code) through a proxy |
| Number of requests per task | Whatever the real session had | 1 | 5-15+ (real tool-use turns) |
| Context | Your actual session context | Synthetic padding to target size | Grows naturally as the agent works |
| Use case | Benchmark with your real traffic | Raw throughput at controlled sizes | "What does it feel like to use this endpoint?" |
asb eval - Code Correctness (experimental)
Optional mode that sends the same tasks with agentic context, but validates the generated code instead of measuring speed. Useful for checking if your model still produces correct code under large-context pressure.
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntax # does it parse?
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v execution # does it run?
asb list-tasks - Browse Available Tasks
asb list-tasks # Show all 110 tasks
asb list-tasks -t trivial # Filter by tier
asb list-tasks --tags typescript,rust # Filter by language
asb list-tasks --format json # JSON output
The 110 Tasks
Tasks simulate real agentic coding scenarios across 5 difficulty tiers and 5 languages:
| Tier | Range | What it simulates |
|---|---|---|
| 1 - Trivial | P1-P10 | Quick fixes: rename a variable, add a type hint, write a one-liner |
| 2 - Easy | P11-P25 | Single-file tasks: implement a function, write a CLI tool, parse a file |
| 3 - Medium | P26-P50 | Multi-function work: build an API endpoint, write tests, refactor a module |
| 4 - Hard | P51-P75 | Complex tasks: networking, concurrency, database queries, full programs |
| 5 - Expert | P76-P100 | Real-world projects: multi-file apps, distributed systems, full-stack |
| Multi-lang | P101-P110 | TypeScript, Rust, Go, SQL tasks across all difficulty levels |
Languages: Python (P1-P100), TypeScript, Rust, Go, SQL (P101-P110). Filter with --tags typescript,rust,go.
Tasks define what to generate. Context size is controlled separately - so you can benchmark a trivial fix inside a massive 100K-token coding session.
Context Control
Context size simulates where you are in a real coding session:
| Profile | Tokens | What it simulates |
|---|---|---|
fresh |
~6K | Just opened the project - system prompt + first question |
short |
~20K | A few turns in - read a couple files, made one edit |
medium |
~40K | Mid-session - several file reads, tool calls, error traces |
long |
~70K | Deep session - many edits, test runs, debugging cycles |
full |
~100K | Long session approaching context limit - everything accumulated |
xl |
~200K | Extended session - large codebases, long test output, multi-file edits |
xxl |
~400K | Maximum depth - for models with 400K+ context windows |
realistic |
Mixed | Sweeps fresh → full (default) - simulates a full session lifecycle |
Every request is padded with content that looks like a real agentic coding session:
- System prompt with tool schemas (Read, Write, Edit, Bash, Grep, etc.)
- Prior conversation turns with file contents
- Tool call results and error traces
- Growing context that mimics how sessions actually evolve
# Simulate a deep coding session (70K context)
asb speed -e URL -m MODEL --context-profile long
# Long-context models: test at 200K or 400K
asb speed -e URL -m MODEL --context-profile xl
asb speed -e URL -m MODEL --context-profile xxl
# Exact token count
asb speed -e URL -m MODEL --context-tokens 50000
# Default: sweeps fresh → short → medium → long → full
asb speed -e URL -m MODEL
Model Context Window
Use --model-context-length to tell the benchmark your model's maximum context window. Any profiles that exceed it are automatically skipped:
# Model supports up to 128K - xl (200K) and xxl (400K) are skipped automatically
asb speed -e URL -m MODEL --suite full --model-context-length 128000
# Model supports 400K - run everything including xxl
asb speed -e URL -m MODEL --context-profile xxl --model-context-length 400000
This is useful when running suites or realistic sweeps against models with different context limits - no need to manually pick a profile.
Prefix Cache Poisoning
LLM inference engines cache the KV state of common prefixes so repeated requests skip prefill. This makes benchmarks look artificially fast - you're measuring cache hits, not real inference.
AgenticSwarmBench defeats the prefix cache using pre-poisoned recordings. Each task in a built-in scenario has a unique text treatment applied offline that invalidates the KV cache without altering the semantic content the model sees.
This mimics what actually happens in real coding sessions: when an agent edits a file mid-conversation, the context changes from the edit point onward, breaking the cache naturally.
- Built-in scenarios ship with pre-poisoned recordings. Each task has a unique treatment, so results are valid for one pass per task.
- User-recorded scenarios (
asb recordoutput) replay as-is with no cache-defeat treatment.
Reasoning Token Detection
Models like DeepSeek R1, o3, and Claude Extended Thinking produce thinking tokens before visible output. AgenticSwarmBench automatically detects reasoning_content in the streaming response and reports:
| Metric | Description |
|---|---|
| TTFT (thinking) | Time to first reasoning token |
| TTFT (visible) | Time to first visible output token |
| Thinking overhead | Extra latency from reasoning before visible output |
| Thinking tokens | Count of reasoning tokens generated |
This is critical for agentic swarm scenarios - a reasoning model that takes 5 seconds to "think" before emitting code changes the UX of the entire editing session.
What Good Looks Like
Reference ranges from common setups (your numbers will vary by hardware, model size, and serving stack):
| Setup | Context | Users | TTFT | Tok/s/user | Notes |
|---|---|---|---|---|---|
| vLLM on 1x A100 (80GB), 7B model | 6K | 1 | ~100ms | ~80-120 | Baseline: fast model, short context |
| vLLM on 1x A100 (80GB), 7B model | 40K | 8 | ~2-4s | ~20-40 | Typical agentic scenario |
| vLLM on 1x A100 (80GB), 7B model | 100K | 32 | ~8-15s | ~5-15 | Stress test |
| SGLang on 1x H100, 70B model | 6K | 1 | ~200ms | ~40-60 | Larger model, faster GPU |
| SGLang on 1x H100, 70B model | 40K | 8 | ~3-6s | ~10-25 | Agentic sweet spot |
| API provider (e.g. Together, Fireworks) | 40K | 8 | ~2-8s | ~15-40 | Varies by provider/load |
Rules of thumb for agentic swarm scenarios:
- TTFT < 3s at 40K context → responsive editing experience
- Tok/s > 30/user → code appears to stream smoothly
- TTFT < 10s at 100K context → acceptable for deep sessions
- Agg tok/s scales sub-linearly with users - expect ~60-70% efficiency at 8x concurrency
A note on ITL: Inter-Token Latency measures the gap between SSE
data:lines as received by the client. Very low values (< 1ms) typically reflect HTTP/TCP buffering, not actual token generation speed. Use tok/s as the primary throughput metric; ITL is best for relative comparisons across scenarios.
Reports
Reports are designed to answer one question fast: is this endpoint good enough for agentic swarm scenarios?
Every report includes:
- Verdict - 🟢 GOOD / 🟡 MARGINAL / 🔴 POOR for agentic swarm scenarios, with the key numbers
- Key Findings - auto-generated insights: TTFT scaling ratio, throughput range, concurrency efficiency, thinking overhead
- Summary table - TTFT, tok/s, ITL at each concurrency level and context size, with color-coded grade icons per row
- What This Means for Agentic Swarm - maps raw metrics to user experience ("instant response", "smooth streaming", "sluggish, frustrating")
- Context Scaling chart - ASCII chart showing how TTFT and tok/s change as context grows
- Concurrency Scaling - efficiency percentages at each concurrency level with color grades
- Per-profile breakdown - detailed numbers per context size
- Reasoning token analysis - thinking overhead when using reasoning models
- Methodology - what was measured, how, and what the grade thresholds mean
The CLI also prints a final verdict line after every benchmark:
Verdict: GOOD for agentic swarm scenarios at medium context
Compare two runs (includes head-to-head table, ASCII bar chart, and winner summary):
asb compare --baseline a.json --candidate b.json -o comparison.md
Docker
Build
docker build -t swarmone/agentic-swarm-bench .
Run
# Speed benchmark
docker run --rm -v $(pwd)/results:/results \
swarmone/agentic-swarm-bench speed \
-e http://host.docker.internal:8000 \
-m my-model --suite full \
-o /results/report.md
# Recording proxy for agentic mode
docker-compose up proxy
Docker Compose
export ASB_ENDPOINT=http://your-gpu-server:8000
export ASB_MODEL=your-model-name
docker-compose run agentic-swarm-bench
Configuration
AgenticSwarmBench merges configuration from four sources (highest priority first):
- CLI arguments -
--endpoint,--model,--context-tokens, etc. - Environment variables -
ASB_ENDPOINT,ASB_MODEL, etc. - YAML config file -
asb --config bench.yml speed ... - Defaults - sensible defaults for everything
YAML Config Example
# bench.yml
endpoint: http://my-gpu-server:8000
model: my-model
suite: standard
Environment Variables
| Variable | Description |
|---|---|
ASB_ENDPOINT |
OpenAI-compatible endpoint URL |
ASB_MODEL |
Model name |
ASB_API_KEY |
API key for the endpoint |
ASB_CONTEXT_TOKENS |
Default context size in tokens |
ASB_CONTEXT_PROFILE |
Default context profile |
ASB_MODEL_CONTEXT_LENGTH |
Model's max context window - skips larger scenarios |
Architecture
agentic-swarm-bench/
agentic_swarm_bench/
cli.py ← Click CLI (asb record | replay | speed | agent | eval | ...)
config.py ← Config: CLI > env > YAML > defaults
scenarios/
recorder.py ← Recording proxy: captures real sessions as JSONL recordings
player.py ← Replay engine: replays scenarios against any endpoint
registry.py ← Load/list/resolve scenarios (file path or built-in name)
schedule.py ← Execution schedule: repetitions, concurrency, ordering policy
poison.py ← Prefix-cache poisoning hooks (uses pre-processed recordings)
data/
trivial-qa/ ← Non-agentic baseline (5 single-turn Q&A tasks, with evaluations)
js-coding-opus/ ← Agentic JS coding sessions (5 multi-turn tasks)
tasks/
tasks.json ← 110 agentic swarm tasks, P1-P110
registry.py ← Load/filter tasks by tier, range, tags, language
context/
codebase_context.py ← Agentic session context: tool schemas, file contents, conversation turns
runner/
direct.py ← Speed mode: direct endpoint benchmark with agentic context
eval_runner.py ← Eval mode: code correctness validation
claude_code.py ← Agent mode: Claude Code orchestration through recording proxy
proxy/
server.py ← Agent-mode proxy (FastAPI) - Anthropic ↔ OpenAI translation
padding.py ← Context padding for proxy mode
translators.py ← API format translation
metrics/
collector.py ← Per-request metrics: TTFT, tok/s, ITL, thinking tokens
stats.py ← Statistical analysis (p50, p95, p99, distributions)
report/
markdown.py ← Markdown report: verdict, insights, grades, ASCII charts
Contributing
We welcome contributions! Here's how to get started:
git clone https://github.com/swarmone/agentic-swarm-bench.git
cd agentic-swarm-bench
# With uv (recommended)
uv sync --all-extras
uv run pytest tests/ -v
# Or with pip
pip install -e ".[dev,proxy]"
make test
Development
make lint # Check code style
make format # Auto-format
make test # Run tests
Adding tasks
Tasks are defined in agentic_swarm_bench/tasks/tasks.json. Each task has:
id: P1 through P110tier: trivial, easy, medium, hard, expertprompt: the agentic swarm tasktags: categorization (language, domain)max_output_tokens: token limit for the response
License
Apache 2.0 - see LICENSE.
Built by SwarmOne
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentic_swarm_bench-4.0.4.tar.gz.
File metadata
- Download URL: agentic_swarm_bench-4.0.4.tar.gz
- Upload date:
- Size: 15.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c8bbec31723c58a833d8adefe327eb5155abde88db762d1382e53ba1fd3edff
|
|
| MD5 |
0d69cb745e255af24fc922c0f4bfa514
|
|
| BLAKE2b-256 |
f76d2e230469228e137fc03967a57b7c423f67aea91b649e6b1dc7b066dbfb1e
|
File details
Details for the file agentic_swarm_bench-4.0.4-py3-none-any.whl.
File metadata
- Download URL: agentic_swarm_bench-4.0.4-py3-none-any.whl
- Upload date:
- Size: 13.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1ea96cb14dfb4b246315ee784bf973c7e5fbc18ff06bff0904a500f5daee07a
|
|
| MD5 |
c141051e0a980960bcd9d06c930ec851
|
|
| BLAKE2b-256 |
d178fb3d90cd3cde070359bbe45534bf42c0b3f4c6564501da66f906477f2b7e
|