agentic-swarm-bench

Open-source benchmark for LLM inference on agentic scenarios

These details have not been verified by PyPI

Project links

Project description

AgenticSwarmBench

The open-source benchmark for LLM inference under agentic scenarios
Created by - the AI-native cloud for agentic scenarios

Quick Start • Why Agentic Swarm • Built-in Scenarios • Record & Replay • Modes • Tasks • Context Control • Cache Poisoning • Reports • Docker

AgenticSwarmBench demo

Why Agentic Swarm?

When Claude Code opens a file, reads 2,000 lines, edits three functions, runs tests, and reads the error output - that's 5+ LLM round-trips with 40-100K token contexts growing each turn. Every turn adds tool results, file contents, and error traces to the conversation.

No existing benchmark simulates this.

SWE-bench measures model quality on GitHub issues. It doesn't measure inference speed.
LMSys / Chatbot Arena measures chatbot throughput at ~2K context. Agentic swarm contexts are 20-80x larger.
Generic LLM benchmarks send uniform requests. Agentic swarm scenarios have system prompts with tool schemas, multi-turn history, code files, and growing context windows.

AgenticSwarmBench fills that gap - it benchmarks your LLM serving stack under the exact access patterns that Claude Code, Cursor, Windsurf, and Copilot generate.

What makes it different
Record & replay	The headline feature. Capture real coding sessions as replayable JSONL scenarios, then benchmark them against any endpoint. Your actual traffic, your actual context patterns - no synthetic approximation needed.
Three benchmark modes	Record/replay (your real sessions), speed (synthetic load), agent (real multi-turn) - plus reporting and comparison
Agentic swarm context	Pads requests with real-looking agentic sessions - system prompts with tool definitions, prior conversation turns, code files, tool call results, error traces
Growing context simulation	Profiles simulate how context grows during a real coding session: fresh (6K) → short (20K) → medium (40K) → long (70K) → full (100K) → xl (200K) → xxl (400K)
Prefix cache poisoning	Built-in scenarios ship with pre-poisoned recordings that break prefix caching without adding artificial content, ensuring valid cold-start measurements for one pass per task
Reasoning token detection	Automatically detects thinking/reasoning tokens (DeepSeek R1, o3, Claude Extended Thinking) and reports thinking overhead vs visible output latency
110 agentic swarm tasks	5 difficulty tiers, 5 languages (Python, TypeScript, Rust, Go, SQL) - from single-function fixes to full-stack refactors
Docker one-liner	Point at any vLLM / SGLang / TGI / OpenAI-compatible endpoint and go

Quick Start

Install

uv pip install agentic-swarm-bench             # with uv (recommended)
pip install agentic-swarm-bench                # or with pip

uv pip install "agentic-swarm-bench[proxy]"    # add proxy support (for agent / record modes)

Quick smoke test (30 seconds)

Verify your endpoint works before doing anything else - replay the built-in trivial-qa scenario:

asb replay -e http://your-server:8000 -m your-model --scenario trivial-qa

This fires 5 trivial single-turn requests (~20 tokens each) and reports TTFT and tok/s. If numbers come back, your setup works. Then try the real thing:

asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus

This replays 5 multi-turn agentic coding sessions (REST API, CLI tool, WebSocket chat, audit trail, search/batch) recorded with Claude Opus 4.6 - growing context from ~1K to ~40K chars across 4 turns each.

Record a real session, then replay it anywhere

The fastest way to get meaningful numbers: record what you actually do, then replay it.

# 1. Start the recording proxy
asb record -e http://your-gpu-server:8000 -m your-model

# 2. Point your agent at the proxy (runs on localhost:19000)
ANTHROPIC_BASE_URL=http://localhost:19000 claude

# 3. Do your normal work. Ctrl+C when done. You now have a .jsonl recording.

# 4. Replay that session against any endpoint
asb replay -e http://new-server:8000 -m my-model --scenario my-session.jsonl

This captures your real context patterns, real token counts, and real multi-turn behavior - then lets you A/B test endpoints with your actual workload.

Run a synthetic speed test

If you don't have a recording yet, the speed mode generates realistic agentic context synthetically:

# Quick speed test - 1 and 8 concurrent agents at fresh (6K) context
asb speed \
  --endpoint http://localhost:8000 \
  --model my-model \
  --suite quick

# Full suite with report - sweeps all context sizes and concurrency levels
asb speed \
  --endpoint http://localhost:8000 \
  --model my-model \
  --suite full \
  --output report.md

asb is the short alias. agentic-swarm-bench also works.

Endpoint URL: Pass any URL. If it doesn't end with /v1/chat/completions, the path is appended automatically. Both of these work:

asb speed -e http://localhost:8000 -m my-model
asb speed -e https://api.example.com/v1/chat/completions -m my-model

Authentication: By default, --api-key is sent as Authorization: Bearer <key>. If your endpoint uses a different header:

asb speed -e URL -m MODEL -k MY_KEY --api-key-header X-API-Key

Dry run: Preview what will be sent without making requests:

asb speed -e URL -m MODEL --dry-run

Note: Some inference endpoints may not return detailed error messages on failure. Use --dry-run to validate your configuration before running a full benchmark.

Docker

docker run --rm -v $(pwd)/results:/results \
  swarmone/agentic-swarm-bench speed \
  --endpoint http://host.docker.internal:8000 \
  --model my-model \
  --suite quick \
  --output /results/report.md

Built-in Scenarios

Two ready-made scenarios ship with the package so you can benchmark immediately - no recording needed:

Scenario	Type	Tasks	Turns/task	Context	What it measures
`trivial-qa`	Non-agentic baseline	5	1	~20 tokens each	Raw single-turn speed (TTFT, tok/s)
`js-coding-opus`	Real agentic sessions	5	4	~1K → ~40K chars	Multi-turn agentic performance with growing context

# List all built-in scenarios
asb list-scenarios

# Quick smoke test - 5 trivial questions, ~20 tokens each
asb replay -e http://your-server:8000 -m your-model --scenario trivial-qa

# Real agentic workload - 5 JS coding sessions recorded with Claude Opus 4.6
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus

# Replay a single task from a scenario
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus --task build-rest-api

# Run multiple repetitions for stable numbers
asb replay -e http://your-server:8000 -m your-model --scenario js-coding-opus --repetitions 3

trivial-qa - Five trivial single-turn questions (capital of France, largest planet, boiling point of water, speed of light, binary conversion). Non-agentic baseline with minimal context. Useful as a quick smoke test and for comparing agentic vs non-agentic performance on the same endpoint.

js-coding-opus - Five independent JavaScript coding sessions (rate limiting middleware, CLI admin tool, WebSocket real-time updates, activity log/audit trail, search & batch operations). Each task has 4 turns of real multi-turn conversation with growing context. Recorded with Claude Opus 4.6 against a TaskFlow API project.

Benchmark Modes

┌───────────────────────────────────────────────────────────────────────────────┐
│                             AgenticSwarmBench                                 │
├──────────────────────┬──────────────────┬─────────────────────────────────────┤
│  asb record / replay │  asb speed       │  asb agent                          │
│  ★ recommended       │                  │                                     │
│                      │  Synthetic       │  Runs Claude Code (or any agent)    │
│  Capture YOUR real   │  agentic context │  end-to-end with benchmark tasks    │
│  coding sessions as  │  → endpoint      │  through a metrics proxy            │
│  JSONL, then replay  │                  │                                     │
│  against any endpoint│  1 request per   │  5-15 real requests per task        │
│                      │  measurement     │  with tool use, file I/O,           │
│  Real multi-turn     │                  │  growing context                    │
│  conversations with  │  Measures:       │                                     │
│  your actual context │  TTFT, tok/s     │  Measures:                          │
│                      │  ITL, prefill    │  Multi-turn latency compounding     │
│  Measures:           │                  │  Context growth over a session      │
│  Same metrics, but   │                  │                                     │
│  from YOUR real data │                  │                                     │
└──────────────────────┴──────────────────┴─────────────────────────────────────┘

Scenario Recording & Replay

This is the most valuable way to benchmark. Synthetic load tells you what an endpoint can do in theory. Record/replay tells you what it actually does with your traffic. Record a real coding session once, then replay that exact sequence of requests against any endpoint, hardware config, or model - same context, same token counts, same multi-turn patterns.

Why this matters: agentic sessions have a unique shape. Context starts small and grows unpredictably. Some turns are tiny follow-ups; others dump 20K tokens of file contents. Synthetic benchmarks can approximate this, but a recording captures the real thing.

`asb record` - Capture a Real Session

Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL recording:

# Record with an OpenAI-compatible upstream
asb record \
  -e http://your-gpu-server:8000 \
  -m your-model

# Record with Anthropic (auto-detected from URL)
asb record \
  -e https://api.anthropic.com \
  -m claude-sonnet-4-20250514 \
  -k $ANTHROPIC_API_KEY \
  --api-key-header x-api-key \
  -o my-session.jsonl

# Custom output file and port
asb record \
  -e http://your-gpu-server:8000 \
  -m your-model \
  -o my-session.jsonl \
  -P 9000

Then point Claude Code at the proxy:

ANTHROPIC_BASE_URL=http://localhost:19000 claude

The recorder supports two upstream modes:

OpenAI-compatible (default): translates Anthropic Messages API → OpenAI format before forwarding
Anthropic passthrough: forwards requests natively to Anthropic's API - no translation, full fidelity. Auto-detected when the endpoint is api.anthropic.com, or set explicitly with --upstream-api anthropic.

Both modes save the recording in OpenAI format for replay. Stop with Ctrl+C when done.

`asb replay` - Replay Against Any Endpoint

Take a recorded scenario and replay it against a different endpoint, hardware, or configuration:

# Replay a session against a new endpoint
asb replay \
  -e http://new-server:8000 \
  -m my-model \
  --scenario my-session.jsonl

# Replay a scenario directory with schedule
asb replay \
  -e http://new-server:8000 \
  -m my-model \
  --scenario ./scenarios/my-scenario/ \
  --repetitions 3 --max-concurrent 5 --policy sequential

# Replay a built-in scenario (recordings include cache-defeat treatments)
asb replay -e URL -m MODEL --scenario scenario

# Preview without sending requests
asb replay -e URL -m MODEL --scenario session.jsonl --dry-run

# Replay just the beginning of a session (up to 1M cumulative prompt tokens)
asb replay -e URL -m MODEL --scenario session.jsonl --slice-tokens 1000000

Scheduling: Control how tasks execute with --repetitions, --max-concurrent, and --policy (round_robin, sequential).

Cache defeat: Built-in scenarios include cache-defeat treatments in their recordings. Results are valid for one pass per task. User-recorded scenarios (asb record output) replay as-is with no cache-defeat treatment. See Prefix Cache Poisoning for details.

History mode: The default (--history-mode live) captures the server's actual responses during streaming and feeds them into the next turn's conversation history. This is essential for correct prefix-cache measurement when replaying against a model different from the one that made the recording - without it, recorded assistant messages from the original model cause KV-cache prefix mismatches on every turn. Use --history-mode recorded for the legacy behavior of sending each entry's recorded messages verbatim.

Slicing scenarios: Real sessions grow from small contexts to large ones. --slice-tokens N replays requests from the start until cumulative prompt tokens reach N.

Output modes: --verbose (-V) shows a Rich live-updating table with per-task progress, phase, request counts, and decode tok/s.

Early abort: --max-consecutive-failures N stops the entire run if any worker slot hits N consecutive failures (HTTP errors, timeouts). Useful when pointing at an endpoint that may be down or producing garbage:

asb replay -e URL -m MODEL --scenario js-coding-opus --max-consecutive-failures 5

`asb list-scenarios` - Browse Built-in Scenarios

asb list-scenarios
asb list-scenarios --format json

`asb speed` - Inference Speed Under Agentic Load

When you don't have a recording yet, or want to test at specific context sizes and concurrency levels, asb speed generates realistic agentic context synthetically. Each request is padded so the model sees what it would see in a real coding session - system prompts with tool schemas, multi-turn conversation history, file contents, and error traces:

┌─ system ───────────────────────────────────────────────────────────────────┐
│ "You are an expert software engineer assistant integrated into a code      │
│  editor. You have access to the user's full project codebase..."           │
└────────────────────────────────────────────────────────────────────────────┘
┌─ user ─────────────────────────────────────────────────────────────────────┐
│ <tool name="Read">...</tool>              ← tool definitions (Read, Write, │
│ <tool name="Write">...</tool>               Edit, Bash, Grep, etc.)        │
│                                                                            │
│ <user_turn> Review src/auth/middleware.py  ← synthetic prior conversation  │
│   ```python                                 turns with code files, error   │
│   def handle_request(...)                   traces, and assistant replies  │
│   ```                                       (repeated to fill target       │
│ </user_turn>                                 context size)                 │
│ <assistant_turn> I can see the issue...                                    │
│ </assistant_turn>                                                          │
│                                                                            │
│ ---                                                                        │
│ Based on the codebase above, <task prompt from tasks.json>                 │
└────────────────────────────────────────────────────────────────────────────┘

The task prompt (e.g. "Write a Python function that takes a list of integers and returns the largest one") comes from the 110 built-in tasks. The padding around it is what makes this an agentic benchmark - it simulates the accumulated context of a real coding session.

# Default: sweeps context sizes from fresh (6K) → full (100K)
asb speed -e http://localhost:8000 -m my-model

# Specific concurrency (32 concurrent agents) at long context
asb speed -e http://localhost:8000 -m my-model -u 32 -p long

# Fixed token count - stress test at exactly 50K tokens
asb speed -e http://localhost:8000 -m my-model -c 50000 -u 16

# Cap max users - run a full suite but limit concurrency to 16
asb speed -e http://localhost:8000 -m my-model --suite full --max-users 16

# Measure prefix cache impact - runs allcold then allwarm
asb speed -e http://localhost:8000 -m my-model --cache-mode realistic

# JSON-only output (for CI/CD pipelines)
asb speed -e http://localhost:8000 -m my-model --format json -o results.json

Metrics: TTFT, decode tok/s per user, prefill tok/s, ITL (p50/p95/p99), aggregate throughput, reasoning token overhead. When the endpoint returns prompt_tokens in the response, actual token counts are shown alongside estimates.

`asb agent` - End-to-End Agent Benchmark

The other modes measure individual requests. asb agent measures what it feels like to use an endpoint - it runs a real agent process (Claude Code by default) end-to-end and records timing for every LLM call across the entire multi-turn session.

Here's what a single task run looks like:

You run:    asb agent -e http://localhost:8000 -m my-model -t p1-p10

What happens for each task:

  ┌─────────────┐         ┌─────────────────┐         ┌──────────────┐
  │ Claude Code │ ──────► │  ASB proxy      │ ──────► │ Your endpoint│
  │ (real agent)│ ◄────── │  (translates    │ ◄────── │ (vLLM, etc.) │
  │             │         │   Anthropic →   │         │              │
  │ reads files │         │   OpenAI, logs  │         │              │
  │ writes code │         │   per-request   │         │              │
  │ runs tests  │         │   timing)       │         │              │
  │ iterates    │         │                 │         │              │
  └─────────────┘         └─────────────────┘         └──────────────┘

  Turn 1:  Claude reads the task           →  6K context   →  TTFT 200ms
  Turn 2:  Claude reads 3 files            →  25K context  →  TTFT 800ms
  Turn 3:  Claude writes code              →  35K context  →  TTFT 1.2s
  Turn 4:  Claude runs tests, gets errors  →  50K context  →  TTFT 2.1s
  Turn 5:  Claude fixes the code           →  60K context  →  TTFT 3.5s
  Turn 6:  Claude runs tests again         →  70K context  →  TTFT 4.8s
  ...

This captures latency compounding over a real session. Each turn's context naturally grows because it includes prior turns, file contents, tool outputs, and error traces. The proxy records TTFT, tok/s, and context size for every request.

asb agent -e http://localhost:8000 -m my-model -t p1-p10

# Use a different agent (any CLI that accepts a prompt)
asb agent -e http://localhost:8000 -m my-model -t p1-p10 --agent-cmd my-agent

Record/Replay vs Speed vs Agent:

	`record` / `replay`	`speed`	`agent`
What talks to your endpoint	You during `record`, ASB during `replay`	ASB directly (one synthetic request)	A real agent (Claude Code) through a proxy
Number of requests per task	Whatever the real session had	1	5-15+ (real tool-use turns)
Context	Your actual session context	Synthetic padding to target size	Grows naturally as the agent works
Use case	Benchmark with your real traffic	Raw throughput at controlled sizes	"What does it feel like to use this endpoint?"

`asb eval` - Code Correctness (experimental)

Optional mode that sends the same tasks with agentic context, but validates the generated code instead of measuring speed. Useful for checking if your model still produces correct code under large-context pressure.

asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntax      # does it parse?
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v execution   # does it run?

`asb list-tasks` - Browse Available Tasks

asb list-tasks                        # Show all 110 tasks
asb list-tasks -t trivial             # Filter by tier
asb list-tasks --tags typescript,rust  # Filter by language
asb list-tasks --format json          # JSON output

The 110 Tasks

Tasks simulate real agentic coding scenarios across 5 difficulty tiers and 5 languages:

Tier	Range	What it simulates
1 - Trivial	P1-P10	Quick fixes: rename a variable, add a type hint, write a one-liner
2 - Easy	P11-P25	Single-file tasks: implement a function, write a CLI tool, parse a file
3 - Medium	P26-P50	Multi-function work: build an API endpoint, write tests, refactor a module
4 - Hard	P51-P75	Complex tasks: networking, concurrency, database queries, full programs
5 - Expert	P76-P100	Real-world projects: multi-file apps, distributed systems, full-stack
Multi-lang	P101-P110	TypeScript, Rust, Go, SQL tasks across all difficulty levels

Languages: Python (P1-P100), TypeScript, Rust, Go, SQL (P101-P110). Filter with --tags typescript,rust,go.

Tasks define what to generate. Context size is controlled separately - so you can benchmark a trivial fix inside a massive 100K-token coding session.

Context Control

Context size simulates where you are in a real coding session:

Profile	Tokens	What it simulates
`fresh`	~6K	Just opened the project - system prompt + first question
`short`	~20K	A few turns in - read a couple files, made one edit
`medium`	~40K	Mid-session - several file reads, tool calls, error traces
`long`	~70K	Deep session - many edits, test runs, debugging cycles
`full`	~100K	Long session approaching context limit - everything accumulated
`xl`	~200K	Extended session - large codebases, long test output, multi-file edits
`xxl`	~400K	Maximum depth - for models with 400K+ context windows
`realistic`	Mixed	Sweeps fresh → full (default) - simulates a full session lifecycle

Every request is padded with content that looks like a real agentic coding session:

System prompt with tool schemas (Read, Write, Edit, Bash, Grep, etc.)
Prior conversation turns with file contents
Tool call results and error traces
Growing context that mimics how sessions actually evolve

# Simulate a deep coding session (70K context)
asb speed -e URL -m MODEL --context-profile long

# Long-context models: test at 200K or 400K
asb speed -e URL -m MODEL --context-profile xl
asb speed -e URL -m MODEL --context-profile xxl

# Exact token count
asb speed -e URL -m MODEL --context-tokens 50000

# Default: sweeps fresh → short → medium → long → full
asb speed -e URL -m MODEL

Model Context Window

Use --model-context-length to tell the benchmark your model's maximum context window. Any profiles that exceed it are automatically skipped:

# Model supports up to 128K - xl (200K) and xxl (400K) are skipped automatically
asb speed -e URL -m MODEL --suite full --model-context-length 128000

# Model supports 400K - run everything including xxl
asb speed -e URL -m MODEL --context-profile xxl --model-context-length 400000

This is useful when running suites or realistic sweeps against models with different context limits - no need to manually pick a profile.

Prefix Cache Poisoning

LLM inference engines cache the KV state of common prefixes so repeated requests skip prefill. This makes benchmarks look artificially fast - you're measuring cache hits, not real inference.

AgenticSwarmBench defeats the prefix cache using pre-poisoned recordings. Each task in a built-in scenario has a unique text treatment applied offline that invalidates the KV cache without altering the semantic content the model sees.

This mimics what actually happens in real coding sessions: when an agent edits a file mid-conversation, the context changes from the edit point onward, breaking the cache naturally.

Built-in scenarios ship with pre-poisoned recordings. Each task has a unique treatment, so results are valid for one pass per task.
User-recorded scenarios (asb record output) replay as-is with no cache-defeat treatment.

Reasoning Token Detection

Models like DeepSeek R1, o3, and Claude Extended Thinking produce thinking tokens before visible output. AgenticSwarmBench automatically detects reasoning_content in the streaming response and reports:

Metric	Description
TTFT (thinking)	Time to first reasoning token
TTFT (visible)	Time to first visible output token
Thinking overhead	Extra latency from reasoning before visible output
Thinking tokens	Count of reasoning tokens generated

This is critical for agentic swarm scenarios - a reasoning model that takes 5 seconds to "think" before emitting code changes the UX of the entire editing session.

What Good Looks Like

Reference ranges from common setups (your numbers will vary by hardware, model size, and serving stack):

Setup	Context	Users	TTFT	Tok/s/user	Notes
vLLM on 1x A100 (80GB), 7B model	6K	1	~100ms	~80-120	Baseline: fast model, short context
vLLM on 1x A100 (80GB), 7B model	40K	8	~2-4s	~20-40	Typical agentic scenario
vLLM on 1x A100 (80GB), 7B model	100K	32	~8-15s	~5-15	Stress test
SGLang on 1x H100, 70B model	6K	1	~200ms	~40-60	Larger model, faster GPU
SGLang on 1x H100, 70B model	40K	8	~3-6s	~10-25	Agentic sweet spot
API provider (e.g. Together, Fireworks)	40K	8	~2-8s	~15-40	Varies by provider/load

Rules of thumb for agentic swarm scenarios:

TTFT < 3s at 40K context → responsive editing experience
Tok/s > 30/user → code appears to stream smoothly
TTFT < 10s at 100K context → acceptable for deep sessions
Agg tok/s scales sub-linearly with users - expect ~60-70% efficiency at 8x concurrency

A note on ITL: Inter-Token Latency measures the gap between SSE data: lines as received by the client. Very low values (< 1ms) typically reflect HTTP/TCP buffering, not actual token generation speed. Use tok/s as the primary throughput metric; ITL is best for relative comparisons across scenarios.

Reports

Reports are designed to answer one question fast: is this endpoint good enough for agentic swarm scenarios?

Every report includes:

Verdict - 🟢 GOOD / 🟡 MARGINAL / 🔴 POOR for agentic swarm scenarios, with the key numbers
Key Findings - auto-generated insights: TTFT scaling ratio, throughput range, concurrency efficiency, thinking overhead
Summary table - TTFT, tok/s, ITL at each concurrency level and context size, with color-coded grade icons per row
What This Means for Agentic Swarm - maps raw metrics to user experience ("instant response", "smooth streaming", "sluggish, frustrating")
Context Scaling chart - ASCII chart showing how TTFT and tok/s change as context grows
Concurrency Scaling - efficiency percentages at each concurrency level with color grades
Per-profile breakdown - detailed numbers per context size
Reasoning token analysis - thinking overhead when using reasoning models
Methodology - what was measured, how, and what the grade thresholds mean

The CLI also prints a final verdict line after every benchmark:

  Verdict: GOOD for agentic swarm scenarios at medium context

Compare two runs (includes head-to-head table, ASCII bar chart, and winner summary):

asb compare --baseline a.json --candidate b.json -o comparison.md

Docker

Build

docker build -t swarmone/agentic-swarm-bench .

Run

# Speed benchmark
docker run --rm -v $(pwd)/results:/results \
  swarmone/agentic-swarm-bench speed \
  -e http://host.docker.internal:8000 \
  -m my-model --suite full \
  -o /results/report.md

# Recording proxy for agentic mode
docker-compose up proxy

Docker Compose

export ASB_ENDPOINT=http://your-gpu-server:8000
export ASB_MODEL=your-model-name

docker-compose run agentic-swarm-bench

Configuration

AgenticSwarmBench merges configuration from four sources (highest priority first):

CLI arguments - --endpoint, --model, --context-tokens, etc.
Environment variables - ASB_ENDPOINT, ASB_MODEL, etc.
YAML config file - asb --config bench.yml speed ...
Defaults - sensible defaults for everything

YAML Config Example

# bench.yml
endpoint: http://my-gpu-server:8000
model: my-model
suite: standard

Environment Variables

Variable	Description
`ASB_ENDPOINT`	OpenAI-compatible endpoint URL
`ASB_MODEL`	Model name
`ASB_API_KEY`	API key for the endpoint
`ASB_CONTEXT_TOKENS`	Default context size in tokens
`ASB_CONTEXT_PROFILE`	Default context profile
`ASB_MODEL_CONTEXT_LENGTH`	Model's max context window - skips larger scenarios

Architecture

agentic-swarm-bench/
  agentic_swarm_bench/
    cli.py              ← Click CLI (asb record | replay | speed | agent | eval | ...)
    config.py           ← Config: CLI > env > YAML > defaults

    scenarios/
      recorder.py       ← Recording proxy: captures real sessions as JSONL recordings
      player.py         ← Replay engine: replays scenarios against any endpoint
      registry.py       ← Load/list/resolve scenarios (file path or built-in name)
      schedule.py       ← Execution schedule: repetitions, concurrency, ordering policy
      poison.py         ← Prefix-cache poisoning hooks (uses pre-processed recordings)
      data/
        trivial-qa/     ← Non-agentic baseline (5 single-turn Q&A tasks, with evaluations)
        js-coding-opus/ ← Agentic JS coding sessions (5 multi-turn tasks)

    tasks/
      tasks.json        ← 110 agentic swarm tasks, P1-P110
      registry.py       ← Load/filter tasks by tier, range, tags, language
      context/
        codebase_context.py ← Agentic session context: tool schemas, file contents, conversation turns

    runner/
      direct.py         ← Speed mode: direct endpoint benchmark with agentic context
      eval_runner.py    ← Eval mode: code correctness validation
      claude_code.py    ← Agent mode: Claude Code orchestration through recording proxy

    proxy/
      server.py         ← Agent-mode proxy (FastAPI) - Anthropic ↔ OpenAI translation
      padding.py        ← Context padding for proxy mode
      translators.py    ← API format translation

    metrics/
      collector.py      ← Per-request metrics: TTFT, tok/s, ITL, thinking tokens
      stats.py          ← Statistical analysis (p50, p95, p99, distributions)

    report/
      markdown.py       ← Markdown report: verdict, insights, grades, ASCII charts

Contributing

We welcome contributions! Here's how to get started:

git clone https://github.com/swarmone/agentic-swarm-bench.git
cd agentic-swarm-bench

# With uv (recommended)
uv sync --all-extras
uv run pytest tests/ -v

# Or with pip
pip install -e ".[dev,proxy]"
make test

Development

make lint      # Check code style
make format    # Auto-format
make test      # Run tests

Adding tasks

Tasks are defined in agentic_swarm_bench/tasks/tasks.json. Each task has:

id: P1 through P110
tier: trivial, easy, medium, hard, expert
prompt: the agentic swarm task
tags: categorization (language, domain)
max_output_tokens: token limit for the response

License

Apache 2.0 - see LICENSE.

Built by SwarmOne

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.0.4

Apr 30, 2026

This version

4.0.3

Apr 22, 2026

4.0.2

Apr 22, 2026

4.0.1

Apr 21, 2026

4.0.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_swarm_bench-4.0.3.tar.gz (15.2 MB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_swarm_bench-4.0.3-py3-none-any.whl (13.6 MB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file agentic_swarm_bench-4.0.3.tar.gz.

File metadata

Download URL: agentic_swarm_bench-4.0.3.tar.gz
Upload date: Apr 22, 2026
Size: 15.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for agentic_swarm_bench-4.0.3.tar.gz
Algorithm	Hash digest
SHA256	`9e1e0447134751457268cb18461f37259db5dda76345aa44b95a5a8fda48e267`
MD5	`cdf181a117c3a589922983d9126cb973`
BLAKE2b-256	`516ddb82e59d65e870812427f0d9e0fa0d1181c6dd948c00083232a4d4aa859c`

See more details on using hashes here.

File details

Details for the file agentic_swarm_bench-4.0.3-py3-none-any.whl.

File metadata

Download URL: agentic_swarm_bench-4.0.3-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 13.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for agentic_swarm_bench-4.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88c419de927fdcfc6917409c1573299046c3d3fe0aefd3ea582f86221fd6945e`
MD5	`35e5faba06897624a6b3d705195548b7`
BLAKE2b-256	`933b941afed2d14531592173d2357febb9917400bda1dcca3a381845759b9a99`

See more details on using hashes here.

agentic-swarm-bench 4.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why Agentic Swarm?

Quick Start

Install

Quick smoke test (30 seconds)

Record a real session, then replay it anywhere

Run a synthetic speed test

Docker

Built-in Scenarios

Benchmark Modes

Scenario Recording & Replay

asb record - Capture a Real Session

asb replay - Replay Against Any Endpoint

asb list-scenarios - Browse Built-in Scenarios

asb speed - Inference Speed Under Agentic Load

asb agent - End-to-End Agent Benchmark

asb eval - Code Correctness (experimental)

asb list-tasks - Browse Available Tasks

The 110 Tasks

Context Control

Model Context Window

Prefix Cache Poisoning

Reasoning Token Detection

What Good Looks Like

Reports

Docker

Build

Run

Docker Compose

Configuration

YAML Config Example

Environment Variables

Architecture

Contributing

Development

Adding tasks

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`asb record` - Capture a Real Session

`asb replay` - Replay Against Any Endpoint

`asb list-scenarios` - Browse Built-in Scenarios

`asb speed` - Inference Speed Under Agentic Load

`asb agent` - End-to-End Agent Benchmark

`asb eval` - Code Correctness (experimental)

`asb list-tasks` - Browse Available Tasks