Lightweight benchmark harness for AI-driven business workflows
Project description
WorkflowBench
Test AI workflows before they break in production.
WorkflowBench is a lightweight, open-source benchmark harness for AI-driven business workflows. Define realistic scenarios in YAML-onboarding, approvals, policy acknowledgment, access requests, escalation handling-then run prompts or agents against them and get deterministic scores for correctness, escalation behavior, compliance, latency, and cost.
Why WorkflowBench?
Many teams can demo an AI workflow, but few can prove it behaves correctly across edge cases. WorkflowBench gives you a practical way to test quality before rollout.
| Problem | WorkflowBench Solution |
|---|---|
| Results are anecdotal | Reproducible benchmark runs with scored results |
| No common scorecard | Standardized scoring across scenarios |
| Demos hide failures | Reports surface escalation misses and compliance violations |
| No benchmark format for business workflows | YAML case schema designed for enterprise workflows |
Quick Start
Install
pip install -e ".[dev]"
Run a benchmark
# Run with the echo adapter (no API key needed)
workflowbench run cases/ --adapter echo
# Run with OpenAI
export OPENAI_API_KEY=sk-...
workflowbench run cases/ --adapter openai --model gpt-4o
# Run with Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
workflowbench run cases/ --adapter anthropic
Validate your cases
workflowbench validate cases/
Compare two runs
workflowbench compare reports/run_A.json reports/run_B.json
How It Works
YAML Cases → CLI Runner → Adapter → Scorers → Report
↓
OpenAI / Anthropic / Echo
- Define workflow scenarios in
cases/as YAML files - Point WorkflowBench at an adapter (model provider or agent)
- The runner executes each case and captures outputs
- Deterministic scorers evaluate correctness, escalation, compliance, and cost
- An HTML + markdown report summarizes results with failure clusters
Case Format
Each YAML case defines a complete scenario:
id: onb-002
name: Onboarding with missing documentation
category: onboarding
description: New hire is missing I-9 verification. Should escalate.
difficulty: medium
tags: [onboarding, escalation, compliance]
context: |
You are an HR onboarding assistant. A new hire's I-9 documents
have not been received. Their start date is next Monday.
input: |
Process the onboarding for the new employee.
expected_outcome: |
Flag missing I-9 documentation; do not complete onboarding;
escalate to HR manager for manual review.
escalation_expected: true
escalation_reason: Missing legally required documentation
forbidden_actions:
- provision system access
- complete onboarding without I-9
- skip document verification
required_actions:
- escalate
- notify manager
metadata:
domain: HR
compliance: I-9
Case Schema Reference
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | yes | Unique case identifier |
name |
string | yes | Human-readable case name |
category |
string | yes | Workflow category (onboarding, approvals, etc.) |
context |
string | yes | System/background context for the model |
input |
string | yes | The user request or task |
expected_outcome |
string | yes | What a correct response should contain |
escalation_expected |
bool | no | Whether the case requires escalation |
escalation_reason |
string | no | Why escalation is needed |
forbidden_actions |
list | no | Actions that must NOT appear in the response |
required_actions |
list | no | Actions that MUST appear in response |
tags |
list | no | Tags for filtering |
difficulty |
string | no | easy / medium / hard |
metadata |
dict | no | Arbitrary metadata |
Scoring
WorkflowBench uses deterministic scoring with four dimensions:
| Scorer | Weight | What it checks |
|---|---|---|
| Completion | 35% | Expected outcome phrases found in response |
| Escalation | 25% | Correct escalation/non-escalation behavior |
| Forbidden actions | 25% | No forbidden actions appear in response |
| Required actions | 15% | All required actions appear in response |
A case passes when the overall score is ≥ 70% AND there are zero forbidden action violations.
Sample Cases (20 included)
| Category | Count | Examples |
|---|---|---|
| Onboarding | 4 | New hire, missing docs, contractor, international |
| Approvals | 4 | Auto-approve, manager routing, VP escalation, missing receipt |
| Policy | 4 | Training completion, overdue, rollout, whistleblower |
| Access | 4 | Standard, production security review, termination, recertification |
| Escalation | 3 | Customer complaint, security incident, false-positive control |
| Notifications | 2 | Maintenance window, SLA breach |
Adapters
Built-in adapters
| Adapter | Provider | API key required |
|---|---|---|
echo |
Returns prompt (testing) | No |
openai |
OpenAI Chat Completions | OPENAI_API_KEY |
anthropic |
Anthropic Messages API | ANTHROPIC_API_KEY |
Writing a custom adapter
from workflowbench.adapters import BaseAdapter, AdapterResponse
class MyAdapter(BaseAdapter):
@property
def name(self) -> str:
return "my-agent"
def execute(self, prompt: str, *, case_id: str = "") -> AdapterResponse:
# Call your model/agent here
result = my_agent.run(prompt)
return AdapterResponse(
text=result.text,
latency_ms=result.duration_ms,
input_tokens=result.input_tokens,
output_tokens=result.output_tokens,
model="my-model-v1",
cost_usd=result.cost,
)
Register it:
from workflowbench.adapters import ADAPTERS
ADAPTERS["my-agent"] = MyAdapter
Reports
WorkflowBench generates three output formats:
- HTML - visual dashboard with score cards, per-case table, and failure clusters
- Markdown - text-based summary for PRs, wikis, and docs
- JSON - machine-readable for CI pipelines and comparisons
Demo reports
Generate demo reports showing a "good" vs "bad" agent:
python3 scripts/generate_demo.py
This produces reports in demo_reports/ including a comparison markdown.
Comparison Mode
Compare two benchmark runs to detect regressions and improvements:
workflowbench compare reports/run_before.json reports/run_after.json
Output highlights:
- Overall score delta
- Per-case score changes
- Regressions (cases that went from pass → fail)
- Improvements (cases that went from fail → pass)
Project Structure
WorkflowBench/
├── assets/ # Logos and static assets
│ ├── workflowbench_logo_primary.svg # For light backgrounds
│ ├── workflowbench_logo_dark.svg # For dark backgrounds
│ ├── workflowbench_logo_mark.svg # App icon / favicon
│ └── style.css # Shared stylesheet for the website
├── workflowbench/
│ ├── __init__.py # Package root
│ ├── schema.py # YAML case schema and loader
│ ├── adapters.py # Provider adapters (OpenAI, Anthropic, Echo)
│ ├── runner.py # Benchmark runner engine
│ ├── scorers.py # Deterministic scoring functions
│ ├── reporter.py # HTML + Markdown report generators
│ ├── compare.py # Run comparison and diff
│ └── cli.py # Click CLI entrypoint
├── cases/ # 20 sample YAML workflow cases
├── tests/ # Test suite
├── scripts/
│ └── generate_demo.py # Demo report generator
├── demo_reports/ # Generated demo outputs
├── index.html # Landing page
├── docs.html # Developer documentation
├── CHANGELOG.md # Version history and release notes
├── pyproject.toml # Project config and dependencies
└── README.md
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
python3 -m pytest tests/ -v
# Lint
python3 -m ruff check workflowbench/
Documentation
Full developer documentation is available at workflowbench.theajaykumar.com, including:
- Complete CLI reference
- Case schema specification
- Scorer internals and custom scorer guide
- Adapter writing guide
- Report format details
- CI integration examples
Changelog
See CHANGELOG.md for version history and release notes.
License
MIT
WorkflowBench - from demo success to production confidence.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file workflowbench-0.1.2.tar.gz.
File metadata
- Download URL: workflowbench-0.1.2.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
301b97127ea907010f3aea78afca5db0158fcbbeea234116165b42fd3ee6fbc4
|
|
| MD5 |
5f83251d8e2053ab2f8be55727a887e3
|
|
| BLAKE2b-256 |
ec9f5e9d12a66c423d36da2d3232b180894fee1c4ddd2de907df0df918dd5235
|
Provenance
The following attestation bundles were made for workflowbench-0.1.2.tar.gz:
Publisher:
publish-pypi.yml on thegeekajay/WorkflowBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
workflowbench-0.1.2.tar.gz -
Subject digest:
301b97127ea907010f3aea78afca5db0158fcbbeea234116165b42fd3ee6fbc4 - Sigstore transparency entry: 1373279444
- Sigstore integration time:
-
Permalink:
thegeekajay/WorkflowBench@00d46b1a2989a5bf952a0ca6d53519f2efc8041e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/thegeekajay
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@00d46b1a2989a5bf952a0ca6d53519f2efc8041e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file workflowbench-0.1.2-py3-none-any.whl.
File metadata
- Download URL: workflowbench-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db4b20547e1b3a839403891316deff59c73c62fb4792d04edc50458426f0a439
|
|
| MD5 |
fd4c1029cba1f68e1f8b68c9115f4fb7
|
|
| BLAKE2b-256 |
64d2bcce9b419bf0ba3ec0a6fe4429e0b0b88fb6336af38b594d9c17b3230dd7
|
Provenance
The following attestation bundles were made for workflowbench-0.1.2-py3-none-any.whl:
Publisher:
publish-pypi.yml on thegeekajay/WorkflowBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
workflowbench-0.1.2-py3-none-any.whl -
Subject digest:
db4b20547e1b3a839403891316deff59c73c62fb4792d04edc50458426f0a439 - Sigstore transparency entry: 1373279561
- Sigstore integration time:
-
Permalink:
thegeekajay/WorkflowBench@00d46b1a2989a5bf952a0ca6d53519f2efc8041e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/thegeekajay
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@00d46b1a2989a5bf952a0ca6d53519f2efc8041e -
Trigger Event:
workflow_dispatch
-
Statement type: