Benchmark harness measuring AI coding tool+workflow performance, not just model capability. 100 tasks, sigmoid scoring, 12 capability dimensions, gap analysis.
Project description
AI Workflow Benchmark (AWB)
Measure AI coding tool+workflow performance, not just model capability.
Install from PyPI, validate 100 tasks, run vanilla vs custom, get capability profiles and improvement suggestions.
Why This Exists
SWE-bench tests models. AWB tests workflows. The same model running vanilla Claude Code vs. a purpose-built setup with a tuned CLAUDE.md, hooks, and structured agents produces meaningfully different results on real engineering tasks. No existing benchmark captures that gap — they all evaluate the model in isolation.
AWB benchmarks the full stack: tool + configuration + workflow + model, together, on 100 tasks drawn from real open-source repositories.
Quick Start
pip install awb
awb quickstart # verify your setup
awb warmup # pre-build workspace templates (one-time, ~5 min)
awb run --fast-check claude-code-custom # 8 tasks, ~15 min, ~$4 (quick signal)
awb run --progressive --adaptive claude-code-custom # full suite with early exit + smart re-runs
awb gap results/runs/<run_dir>/ # analyze capability gaps
New in v1.1.0: awb warmup caches workspaces for 10-30x faster setup. --fast-check gives a quick signal in 15 min for ~$4. --progressive stops early on weak tools. --use-uv swaps pip for uv. See Execution Modes below.
How It Works
Clone repo at pinned SHA
→ Run setup commands
→ Capture baseline lint/security counts
→ Execute tool with task prompt
→ Run test suite + partial credit rubric
→ Sigmoid-normalize 7 metrics
→ Produce weighted composite + capability profile
Each task starts from a fresh git clone at a pinned commit. Every tool gets the same prompt, the same timeout, and the same verification suite. Results are scored with sigmoid normalization so scores are never negative and never collapse at the boundary.
Scoring System
Seven dimensions, sigmoid-normalized with per-task baselines derived from difficulty:
| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 55% | Pass/fail (60%) + partial credit rubric (40%) |
| Cost efficiency | 15% | Estimated USD per task |
| Speed | 10% | Wall-clock seconds vs. estimated task time |
| Code quality | 10% | Lint warning delta (pre vs. post) |
| Reliability | 5% | Pre-existing tests broken by the change |
| Security | 3% | New security issues introduced |
| Efficiency | 2% | Blend of iteration count and tokens-per-iteration |
Weight profiles (select with load_weight_profile(name)):
| Profile | Focus | Use When |
|---|---|---|
default |
Balanced | Standard evaluation |
correctness_focused |
70% correctness | Research-grade rigor |
production |
45% correctness, 20% cost, 10% reliability, 8% security | Shipping to users |
token_efficient |
25% cost, 15% efficiency | Tight API budgets |
rate_limited |
30% cost, 15% efficiency | Hitting TPM/RPM limits |
Sigmoid curve: score = 100 / (1 + exp(k * (value - baseline)))
- Optimal performance (excellent) → ~95
- Baseline performance (adequate) → ~50
- Above baseline → smooth decay, never negative
Difficulty-weighted aggregation: hard tasks count 2.5×, medium 1.5×, easy 1.0×. A tool that solves hard tasks beats one that only solves easy ones even if the easy-task count is higher.
Per-task baselines by difficulty:
| Metric | Easy | Medium | Hard |
|---|---|---|---|
| Cost optimal / baseline | $0.05 / $0.30 | $0.20 / $1.00 | $1.00 / $3.00 |
| Speed | 50% / 100% of estimated_minutes | same | same |
| Iterations | 3 / max_iters | 8 / max_iters | 15 / max_iters |
The 100 Tasks
Real open-source repos, pinned to release tag SHAs. Setup runs in under 15 seconds via venv + pip (Python) or npm (TypeScript).
| Category | Count | Easy / Med / Hard | What It Tests |
|---|---|---|---|
| bug-fix | 12 | 7 / 1 / 4 | Root cause analysis, test-first diagnosis, N+1 queries |
| feature-addition | 9 | 3 / 0 / 6 | Convention adherence, ambiguous requirements, Dockerfiles, TypeScript typing |
| refactoring | 11 | 5 / 2 / 4 | Multi-file consistency, O(n^2) optimization, CI/CD config, async migration |
| code-review | 9 | 4 / 2 / 3 | Security review (report-only), concurrency analysis, migration guides, OWASP |
| debugging | 10 | 7 / 0 / 3 | Performance profiling, regression bisection, stack trace diagnosis |
| multi-file | 7 | 4 / 0 / 3 | Merge conflicts, plugin systems, auth chains |
| legacy-code | 12 | 9 / 0 / 3 | SQLAlchemy 2.0 migration, 20-file codebase navigation, dead code removal |
| workflow | 30 | 9 / 12 / 9 | Completeness tracking, convention discovery, security methodology, context utilization, async safety, config extraction, test-driven implementation |
Repos used: FastAPI, httpx, Flask, Starlette, Click, Pydantic, SQLAlchemy 2.0, Hono
Task IDs:
BF-001–014 · FA-001–010 · RF-001–012 · CR-001–010 · DB-001–011 · MF-001–009 · LC-001–012 · WF-001–030
Capability Profiles
Each task maps to 1–3 capabilities, producing a radar chart of tool strengths:
| Capability | Tasks | What It Measures |
|---|---|---|
| code_comprehension | 41 | Understanding existing code before modifying |
| framework_knowledge | 35 | Knowing API patterns (Pydantic v2, async SQLAlchemy, etc.) |
| bug_diagnosis | 26 | Structured root cause analysis, test-first diagnosis |
| refactoring_discipline | 26 | Changing code without breaking behavior |
| multi_file_reasoning | 23 | Coordinating changes across multiple files |
| completeness_tracking | 10 | Following all requirements, not stopping at 80% |
| convention_adherence | 10 | Discovering and following project conventions |
| context_discovery | 10 | Reading project docs and config before editing |
| test_writing | 10 | Writing correct, meaningful tests |
| security_awareness | 10 | Identifying and fixing vulnerabilities |
| security_methodology | 10 | Applying security checklists systematically |
| cost_discipline | derived | Token efficiency across all tasks |
Example awb gap output:
Capability Profile
------------------
code_comprehension ████████████████████ 82.4 (n=27, conf=high)
framework_knowledge ████████████████░░░░ 68.1 (n=26, conf=high)
refactoring_discipline████████████████░░░░ 65.3 (n=23, conf=high)
multi_file_reasoning ████████████░░░░░░░░ 51.2 (n=20, conf=high)
bug_diagnosis ███████████████░░░░░ 63.7 (n=17, conf=med)
test_writing ██████████░░░░░░░░░░ 44.1 (n=8, conf=low)
security_awareness █████████████░░░░░░░ 55.8 (n=8, conf=low)
Systematic Patterns
-------------------
- Fails 70%+ of multi_file_reasoning tasks → consider multi-agent workflows
- Token spend on failed hard tasks: $4.20 → add early-exit heuristics
- No failures on easy tasks → baseline is solid
Top Suggestions
---------------
1. Enable subagent mode for tasks spanning >3 files (impact: high)
2. Add repo-level CLAUDE.md with architecture overview (impact: medium)
3. Use --think flag for debugging tasks (impact: medium)
Vanilla vs Custom
AWB ships two Claude Code adapters that run the same model with different configurations:
| Vanilla | Custom | |
|---|---|---|
| Hooks | Disabled | Your full hook suite |
| Skills | Disabled | Your registered skills |
| Auto-memory | Disabled | Active |
| System prompt | Generic | Default (loads CLAUDE.md) |
Both use the same model, same API, same task prompts. The only difference is whether your workflow automation (hooks, skills, memory) is active. This isolates the contribution of workflow configuration from model capability.
Workflow Lift Score
When awb run executes both vanilla and custom (the default), it produces a Workflow Lift — a single number measuring how much your workflow configuration improves over the baseline:
Workflow Lift: +4.2 pts (p=0.031, significant)
Pass rate: vanilla 62% vs custom 68%
Wins: custom 8 / vanilla 3 / ties 69
Where your workflow helps:
bug diagnosis +12.3 pts (17 tasks)
multi file reasoning +8.1 pts (20 tasks)
security awareness +5.4 pts (10 tasks)
Where it hurts:
cost discipline -4.2 pts (100 tasks)
Biggest task-level differences:
BF-014 +40 (V=35 C=75)
LC-012 +15 (V=65 C=80)
The lift is computed per-task (configured score minus vanilla score), averaged across all tasks, and tested for statistical significance. Capability-level breakdowns show where your workflow configuration actually helps vs. adds overhead.
CLI Reference
awb run — Run benchmark tasks
awb run # all tools, all tasks, 3 runs (vanilla vs custom comparison)
awb run claude-code-custom # single tool
awb run -t BF-001 # single task
awb run --category legacy-code # filter by category
awb run --difficulty hard # filter by difficulty
awb run --capability bug_diagnosis # filter by capability
awb run --runs 1 --dry-run # preview without executing
awb run --resume # skip tasks with existing results
awb run --parallel -j 4 # run 4 tasks concurrently
awb run --adaptive # re-run near-miss tasks (60-99%) after initial pass
awb run --progressive # easy → medium → hard, stop early if pass rate too low
awb run --fast-check # 8 representative tasks, 1 run (~15 min, ~$4)
awb run --use-uv # use uv instead of pip for 10-30x faster installs
Execution Modes
AWB v1.1 ships four execution modes tuned for different evaluation scenarios:
| Mode | Tasks run | Wall clock | Token cost | Use when |
|---|---|---|---|---|
| Full suite | 300 (100 × 3 runs) | ~3 hrs | ~$150 | Final evaluation, publishing results |
| Full + adaptive | ~180 | ~1.5 hrs | ~$100 | Standard workflow, strong tools |
| Progressive | ~150 on weak tools | ~1 hr | ~$40-75 | Unknown/mediocre tools |
| Fast-check | 8 | ~15 min | ~$4 | PR gates, iterating on config |
awb warmup — Pre-build workspace templates
awb warmup # build templates for all 63 unique (repo, commit, setup) combos
awb warmup --dry-run # show combos without building
awb warmup --clear # reset template cache
awb warmup --use-uv # use uv for faster initial builds
Workspace templates are cached at ~/.cache/awb/templates/. First build takes ~5 min; subsequent awb run invocations copy templates in ~2s instead of running pip install from scratch. Cuts ~55 min off a full benchmark run with 74 FastAPI tasks.
awb gap — Capability gap analysis
Analyzes results to produce a capability radar, failure classification, systematic patterns, and ranked improvement suggestions.
awb compare — Compare two runs
Side-by-side comparison of two benchmark runs with significance testing.
awb tools — List adapters
Shows all registered tool adapters and their availability status.
awb validate — Validate task YAMLs
Checks all 100 task YAML files against the schema, including partial credit sum-to-100 validation.
awb info — Task details
Displays full details for a specific task including repo, capabilities, and partial credit rubric.
awb stability — Score stability report
Per-task score variance across multiple runs. Flags unstable tasks for prompt clarification or tighter verification.
awb leaderboard — Generate HTML leaderboard
Generates a static HTML site with Chart.js radar chart, CSV export, and historical run tracking.
awb calibrate-difficulty — Recalibrate difficulty labels
Recalibrates task difficulty labels from empirical pass rates. Use --apply to write changes back to task YAMLs.
awb calibrate-timeouts — Tighten timeouts
Recomputes task timeouts from empirical p95 wall-clock data. Use --apply to write changes.
Other commands
| Command | Description | Demo |
|---|---|---|
awb quickstart |
Verify setup: tools available, tasks load | demo |
awb export <run_dir> -o file.json |
Export results in submission format | demo |
awb submit <file.json> |
Validate an external submission | demo |
awb compare-submissions <a> <b> |
Cross-tool comparison with statistics | demo |
awb migrate-results <old_dir> |
Convert v0.5.x results to v1.0 format | demo |
awb workflow <subcommand> |
Export, validate, diff, or init descriptors | demo |
awb --version |
Show version | demo |
awb run --dry-run |
Preview tasks without executing | demo |
Adding Tasks
Tasks live in awb/tasks/<category>/. Copy awb/tasks/_template.yaml:
id: BF-012
category: bug-fix
title: "Fix response_model silently dropping extra fields in FastAPI"
difficulty: easy
estimated_minutes: 15
languages: [python]
capabilities: [framework_knowledge, test_writing]
repo:
url: "https://github.com/tiangolo/fastapi"
commit: "628c34e0"
setup_commands:
- "python3 -m venv .venv && source .venv/bin/activate && pip install -e '.[all]'"
issue:
description: |
The endpoint's response_model silently strips extra fields...
files_to_examine:
- "fastapi/routing.py"
verification:
test_commands:
- "source .venv/bin/activate && python3 -m pytest tests/test_extra_fields.py -v"
partial_credit:
- criterion: "Uses Pydantic v2 ConfigDict"
points: 50
check: "grep -q 'ConfigDict' tests/test_extra_fields.py"
- criterion: "Tests pass"
points: 50
check: "source .venv/bin/activate && python3 -m pytest tests/test_extra_fields.py -v"
constraints:
max_iterations: 20
timeout_seconds: 1800
Run awb validate to check your task before opening a PR. Full guide: CONTRIBUTING.md
Supported Tools
| Adapter | Name | Status |
|---|---|---|
| Claude Code (vanilla) | claude-code-vanilla |
Full |
| Claude Code (custom) | claude-code-custom |
Full |
| Pi | pi |
Full |
| Gemini CLI | gemini-cli |
Full |
| Codex CLI | codex-cli |
Full |
| Cursor | cursor |
Planned |
| Aider | aider |
Planned |
| Windsurf | windsurf |
Planned |
| Copilot | copilot |
Planned |
Run awb tools to see which are available in your environment.
Adding Tools
Implement the ToolAdapter ABC in awb/adapters/. v1.0 adds four optional methods to the ABC:
from awb.adapters.base import ToolAdapter, ToolResult
from pathlib import Path
class MyToolAdapter(ToolAdapter):
name = "my-tool"
display_name = "My Tool"
async def execute(self, prompt: str, workspace: Path,
max_turns: int = 20, timeout_seconds: int = 1800,
on_event=None) -> ToolResult:
... # on_event(event) callback for streaming token monitor; return False to abort
def check_available(self) -> bool:
...
def get_config_hash(self) -> str:
...
# Optional — implement to enable pre-flight auth checks
def supports_auth_check(self) -> bool: ...
def check_auth(self) -> tuple[bool, str]: ...
# Optional — implement to enable streaming metrics
def supports_streaming(self) -> bool: ...
def get_model_pricing(self) -> dict[str, float]: ...
Register in awb/adapters/registry.py and add an entry point in pyproject.toml.
External Submissions
Anyone can share results using the submission format defined in results/submission-schema.json:
awb run --runs 3
awb export results/runs/<run_dir>/ -o my-results.json
awb submit my-results.json # validate locally
awb compare-submissions a.json b.json # compare with significance testing
The format captures tool version, model, hardware class, and per-task run results. Hardware classes (e.g., apple_m5_24gb, linux_x86_16gb) enable fair speed comparisons — only compared within the same tier.
Statistical Framework
- Confidence intervals via t-distribution (no scipy required for core scoring)
- Significance testing via sign test for paired tool comparison
- Integrity checks: contamination detection (completions <10s flagged), variance anomalies (identical times/tokens across runs)
- Weight profiles:
default,correctness_focused,production,token_efficient,rate_limited(seeawb/scoring/weights.yaml) - Stability metric: per-task
TaskStability(std_dev, score_range, is_unstable); high-variance tasks can be down-weighted in composite scoring - Token efficiency: sigmoid normalizer (optimal=2k tokens/iter, baseline=15k) blended 50/50 with iteration count in the efficiency dimension
Changelog
1.1.0 (2026-04-07)
Performance and token optimization release. 33-50% faster full runs, ~97% cheaper quick evaluations.
- Workspace template cache — ~55 min saved on full runs (74 FastAPI tasks no longer re-run pip install)
awb warmup— pre-build all unique workspace templates in parallel--use-uv— 10-30x faster pip installs via uv--progressive— easy → medium → hard execution, stops early if weak tool (50-80% token savings)--fast-check— 8 representative tasks, 1 run, ~15 min, ~$4 (97% cheaper than full suite)- Token budget enforcement —
max_input_tokens/max_output_tokensin task constraints, streaming kill switch - Streaming token monitor — Claude Code adapter parses stream events as they arrive
- Parallel partial credit — independent grep/file checks run via asyncio.gather; pytest stays sequential
- Adaptive timeouts — runs 2+ tighten timeout to
min(original, 2x run1_actual) - Richer RunCost — cache_read, cache_creation, thinking token fields
- Token efficiency in scoring — efficiency dimension blends iterations + tokens-per-iteration
- Two new weight profiles —
token_efficientandrate_limitedfor cost-sensitive evaluation - Token-aware gap analysis — cost-per-point outliers, cache hit rate patterns, token burn detection
- JSONL results — additive output format alongside per-file JSON for fast batch loading
- 184 tests (up from 135)
1.0.9 (2026-04-04)
- Add Python 3.13 and 3.14 to CI test matrix and PyPI classifiers
1.0.8 (2026-04-04)
- Sync README changelog with PyPI long description; update GitHub repo description (80 → 100 tasks)
1.0.7 (2026-04-04)
Product audit fixes: 27 findings across observability, scoring, reliability, performance, and CLI safety.
- Observability:
--verboseflag, test output logging, captured partial credit output, specific exception handlers, integrity checks inawb run - Scoring:
SECURITY_METHODOLOGYcapability, signed lint delta, removed hardcodedMETRIC_WEIGHTS, timeout calibrator can increase, leaderboard uses per-task aggregate scoring - Reliability:
KeyboardInterrupthandling,load_singleNone guard,find_incomplete_runscans all_runNdirs, 600s setup timeout,return_exceptionsin gather, finally cleanup - Performance: bare-clone cache (
~/.cache/awb/clones/), cachedRunEnvironment/adapter, schema cache - CLI safety: confirmation prompt (
--yes),quickstartis env-only check, resolved paths,check_availableguard for stubs
1.0.6 (2026-04-03)
- Add trustme to 4 real httpx repo tasks (BF-003, BF-011, BF-013, FA-005)
1.0.5 (2026-04-02)
- Add trio to 16 httpx-based tasks (fixes silent pytest crash on Python 3.13+)
1.0.4 (2026-04-01)
- Fix 4 verification bugs (FA-010, RF-012, CR-007, BF-003)
Older releases
See CHANGELOG.md for the full history (v1.0.0, v0.5.x, v0.4.x, v0.3.x, v0.2.x, v0.1.0).
Links
- Methodology — Fair comparison principles, metric definitions, known limitations
- Architecture — Module graph, data models, pipeline diagrams
- Contributing — Adding tasks, tools, and submitting results
- PyPI —
pip install awb
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file awb-1.1.1.tar.gz.
File metadata
- Download URL: awb-1.1.1.tar.gz
- Upload date:
- Size: 195.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa77797fe1f6347c5462a827809cae863f0f14d8cef636b9cab8618ce05577c3
|
|
| MD5 |
472e3d73c8fd4792bb00b3714af85375
|
|
| BLAKE2b-256 |
a2c8c2f9db9235ecc8eafe89a6f50b53a972ccb9e08937c0873546fab8950f22
|
File details
Details for the file awb-1.1.1-py3-none-any.whl.
File metadata
- Download URL: awb-1.1.1-py3-none-any.whl
- Upload date:
- Size: 301.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc53d187156ac297071f3bb94c15c179cbf7d8c709b5c9fff800d65453d3101d
|
|
| MD5 |
c61ff30de51136336b42f15b01c8ff45
|
|
| BLAKE2b-256 |
9cfc1d6936dd201d38092e9931a171a248160ed593d80c174fd93d61f3393ab9
|