Skip to main content

Benchmark harness measuring AI coding tool+workflow performance, not just model capability

Project description

AI Workflow Benchmark (AWB)

Measure AI coding tool+workflow performance, not just model capability.

PyPI Tests Tasks Python License


Why This Exists

SWE-bench tests models. AWB tests workflows. The same model running vanilla Claude Code vs. a purpose-built setup with a tuned CLAUDE.md, hooks, and structured agents produces meaningfully different results on real engineering tasks. No existing benchmark captures that gap — they all evaluate the model in isolation.

AWB benchmarks the full stack: tool + configuration + workflow + model, together, on 60 tasks drawn from real open-source repositories.

Quick Start

pip install awb

awb quickstart                    # verify your setup
awb run --runs 3                  # full benchmark (3 runs each for stable scores)
awb gap results/runs/<run_dir>/   # analyze capability gaps

How It Works

Clone repo at pinned SHA
  → Run setup commands
  → Capture baseline lint/security counts
  → Execute tool with task prompt
  → Run test suite + partial credit rubric
  → Sigmoid-normalize 7 metrics
  → Produce weighted composite + capability profile

Each task starts from a fresh git clone at a pinned commit. Every tool gets the same prompt, the same timeout, and the same verification suite. Results are scored with sigmoid normalization so scores are never negative and never collapse at the boundary.

Scoring System

Seven dimensions, sigmoid-normalized with per-task baselines derived from difficulty:

Dimension Weight What It Measures
Correctness 55% Pass/fail (60%) + partial credit rubric (40%)
Cost efficiency 15% Estimated USD per task
Speed 10% Wall-clock seconds vs. estimated task time
Code quality 10% Lint warning delta (pre vs. post)
Reliability 5% Pre-existing tests broken by the change
Security 3% New security issues introduced
Efficiency 2% Tool turns used vs. task max

Sigmoid curve: score = 100 / (1 + exp(k * (value - baseline)))

  • Optimal performance (excellent) → ~95
  • Baseline performance (adequate) → ~50
  • Above baseline → smooth decay, never negative

Difficulty-weighted aggregation: hard tasks count 2.5×, medium 1.5×, easy 1.0×. A tool that solves hard tasks beats one that only solves easy ones even if the easy-task count is higher.

Per-task baselines by difficulty:

Metric Easy Medium Hard
Cost optimal / baseline $0.05 / $0.30 $0.20 / $1.00 $1.00 / $3.00
Speed 50% / 100% of estimated_minutes same same
Iterations 3 / max_iters 8 / max_iters 15 / max_iters

The 60 Tasks

Real open-source repos, pinned to release tag SHAs. Setup runs in under 15 seconds via venv + pip (Python) or npm (TypeScript).

Category Count Easy / Med / Hard What It Tests
bug-fix 10 3 / 3 / 2 Root cause analysis, None handling, async bugs, race conditions
feature-addition 8 2 / 3 / 2 Convention adherence, middleware patterns, cross-cutting features
refactoring 10 2 / 3 / 2 Multi-file consistency, pattern extraction, async migration
code-review 7 2 / 3 / 1 Security awareness, OWASP, concurrency bugs, CORS/auth
debugging 7 2 / 1 / 3 Hypothesis testing, connection leaks, pipeline tracing
multi-file 8 0 / 3 / 3 Cross-module architecture, plugin systems, auth chains
legacy-code 10 4 / 4 / 2 Modernization, migration, dead code removal, type annotations

Repos used: FastAPI, httpx, Flask, Starlette, Click, Pydantic, SQLAlchemy 2.0, Hono

Task IDs: BF-001–011 · FA-001–008 · RF-001–010 · CR-001–007 · DB-001–007 · MF-001–008 · LC-001–010

Capability Profiles

Each task maps to 1–3 capabilities, producing a radar chart of tool strengths:

Capability Tasks What It Measures
code_comprehension 27 Understanding existing code before modifying
framework_knowledge 26 Knowing API patterns (Pydantic v2, async SQLAlchemy, etc.)
refactoring_discipline 23 Changing code without breaking behavior
multi_file_reasoning 20 Coordinating changes across multiple files
bug_diagnosis 17 Structured root cause analysis
test_writing 8 Writing correct, meaningful tests
security_awareness 8 Identifying and fixing vulnerabilities
cost_discipline derived Token efficiency across all tasks

Example awb gap output:

Capability Profile
------------------
code_comprehension    ████████████████████  82.4  (n=27, conf=high)
framework_knowledge   ████████████████░░░░  68.1  (n=26, conf=high)
refactoring_discipline████████████████░░░░  65.3  (n=23, conf=high)
multi_file_reasoning  ████████████░░░░░░░░  51.2  (n=20, conf=high)
bug_diagnosis         ███████████████░░░░░  63.7  (n=17, conf=med)
test_writing          ██████████░░░░░░░░░░  44.1  (n=8,  conf=low)
security_awareness    █████████████░░░░░░░  55.8  (n=8,  conf=low)

Systematic Patterns
-------------------
- Fails 70%+ of multi_file_reasoning tasks → consider multi-agent workflows
- Token spend on failed hard tasks: $4.20 → add early-exit heuristics
- No failures on easy tasks → baseline is solid

Top Suggestions
---------------
1. Enable subagent mode for tasks spanning >3 files (impact: high)
2. Add repo-level CLAUDE.md with architecture overview (impact: medium)
3. Use --think flag for debugging tasks (impact: medium)

CLI Reference

Command Description
awb run [tool] [options] Run benchmark tasks
awb gap <run_dir> Analyze capability gaps and generate improvement suggestions
awb compare <run1> <run2> Compare two runs with significance testing
awb export <run_dir> -o file.json Export results in external submission format
awb submit <file.json> Validate and display an external submission
awb compare-submissions <a.json> <b.json> Cross-tool comparison with statistics
awb quickstart Verify setup: tools available, tasks load, validation passes
awb info <task_id> Show task details
awb tools List registered adapters and availability
awb validate Validate all task YAMLs against schema
awb leaderboard Generate HTML leaderboard from run results
awb workflow <subcommand> Export, validate, diff, or init workflow descriptors

Common options for awb run:

awb run                            # all tools, all tasks, 3 runs
awb run claude-code-custom         # single tool
awb run -t BF-001                  # single task
awb run --category legacy-code     # filter by category
awb run --difficulty hard          # filter by difficulty
awb run --capability bug_diagnosis # filter by capability
awb run --runs 1 --dry-run        # preview without executing

Adding Tasks

Tasks live in awb/tasks/<category>/. Copy awb/tasks/_template.yaml:

id: BF-012
category: bug-fix
title: "Fix response_model silently dropping extra fields in FastAPI"
difficulty: easy
estimated_minutes: 15
languages: [python]
capabilities: [framework_knowledge, test_writing]

repo:
  url: "https://github.com/tiangolo/fastapi"
  commit: "628c34e0"
  setup_commands:
    - "python3 -m venv .venv && source .venv/bin/activate && pip install -e '.[all]'"

issue:
  description: |
    The endpoint's response_model silently strips extra fields...
  files_to_examine:
    - "fastapi/routing.py"

verification:
  test_commands:
    - "source .venv/bin/activate && python3 -m pytest tests/test_extra_fields.py -v"
  partial_credit:
    - criterion: "Uses Pydantic v2 ConfigDict"
      points: 50
      check: "grep -q 'ConfigDict' tests/test_extra_fields.py"
    - criterion: "Tests pass"
      points: 50
      check: "source .venv/bin/activate && python3 -m pytest tests/test_extra_fields.py -v"

constraints:
  max_iterations: 20
  timeout_seconds: 1800

Run awb validate to check your task before opening a PR. Full guide: CONTRIBUTING.md

Adding Tools

Implement the ToolAdapter ABC in awb/adapters/:

from awb.adapters.base import ToolAdapter, ToolResult
from pathlib import Path

class MyToolAdapter(ToolAdapter):
    name = "my-tool"
    display_name = "My Tool"

    async def execute(self, prompt: str, workspace: Path,
                      max_turns: int = 20, timeout_seconds: int = 1800) -> ToolResult:
        ...

    def check_available(self) -> bool:
        ...

    def get_config_hash(self) -> str:
        ...

Register in awb/adapters/registry.py and add an entry point in pyproject.toml.

External Submissions

Anyone can share results using the submission format defined in results/submission-schema.json:

awb run --runs 3
awb export results/runs/<run_dir>/ -o my-results.json
awb submit my-results.json                        # validate locally
awb compare-submissions a.json b.json             # compare with significance testing

The format captures tool version, model, hardware class, and per-task run results. Hardware classes (e.g., apple_m5_24gb, linux_x86_16gb) enable fair speed comparisons — only compared within the same tier.

Statistical Framework

  • Confidence intervals via t-distribution (no scipy required for core scoring)
  • Significance testing via sign test for paired tool comparison
  • Integrity checks: contamination detection (completions <10s flagged), variance anomalies (identical times/tokens across runs)
  • Weight profiles: default, correctness_focused, production (see awb/scoring/weights.yaml)

Links

  • Methodology — Fair comparison principles, metric definitions, known limitations
  • Architecture — Module graph, data models, pipeline diagrams
  • Contributing — Adding tasks, tools, and submitting results
  • PyPIpip install awb

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awb-0.2.0.tar.gz (108.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

awb-0.2.0-py3-none-any.whl (170.9 kB view details)

Uploaded Python 3

File details

Details for the file awb-0.2.0.tar.gz.

File metadata

  • Download URL: awb-0.2.0.tar.gz
  • Upload date:
  • Size: 108.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for awb-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5f212486ef1d76caa8a2e07983b13cf632f48e29fc9bf2075bd6eed93e238f7a
MD5 ca608934ec6d9723a16122b9d0a3d57f
BLAKE2b-256 458bab0c627ee1f2eafad6c2d4afab58f169f7b2f1b86a26b3c0dc78659dd6dd

See more details on using hashes here.

File details

Details for the file awb-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: awb-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 170.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for awb-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee8d3c5e353a24c05a703ff26f1136b33b7f78a4787bbc3bd9c685b9757c70fe
MD5 56b922077885be4bbc4216f465c2fb27
BLAKE2b-256 aeee4dd657d9bbd88ddb0d1a751f1455557822512adee5901c46bfc808f74e8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page