Goose-first acceptance, validation, routing, and audit layer for agentic work.
Project description
AI Workbench MCP
Acceptance gates for AI coding-agent runs.
AI agents can produce code. AI Workbench MCP helps decide whether that work is accepted.
It records the task, captures agent output, runs deterministic validation, applies a quality gate, and creates an auditable run trail.
Works with Goose today. Designed as a host-agnostic acceptance layer for MCP-compatible agent workflows. Codex local/IDE is the first second-host target through explicit execution_host and response_source evidence metadata.
Before
The agent says: "Done."
After
AI Workbench shows:
- what task was requested
- what agent/model/runtime was used
- what output was produced
- what validation ran
- whether the quality gate accepted, rejected, or requested review
- where the evidence lives
runs/example/
task_metadata.json
final_prompt.md
model_selection.json
model_output.md
validation_report.json
revision_decision.json
run_log.jsonl
Problem
AI coding agents can produce useful work, but "done" is not the same as accepted. A useful acceptance workflow needs reproducible evidence.
AI Workbench MCP provides that acceptance and audit layer, turning agent output into evidence-backed accepted runs.
Why Goose + Acceptance Gates
Goose already owns the agent execution surface: CLI, desktop, providers, recipes, MCP hosting, and the agent loop.
AI Workbench MCP stays complementary:
- opens run evidence folders
- recommends model/runtime tiers
- records model or Goose output
- runs deterministic validation
- makes quality-gate decisions
- summarizes accepted-run analytics
It does not provide a chat UI, editor fork, provider marketplace, or general agent runner.
What MCP Does And Does Not Do
MCP is the connection protocol.
AI Workbench MCP is the tool server. MCP lets Goose, Codex local/IDE, or another compatible host call Workbench tools, but the protocol itself does not verify correctness, inspect code quality, or decide whether a run is accepted.
Workbench applies the acceptance policy by recording evidence, running deterministic validation, and applying the quality gate. See how acceptance works for the full distinction.
Prompt DoD vs Acceptance Gate
A prompt definition-of-done tells the agent what to attempt and what evidence to report. Prompt instructions are not enforcement.
An acceptance gate checks the resulting evidence after the agent acts. It uses explicit validation profiles, command-backed checks, required artifacts, changed-file policies, and quality-gate rules. The same agent saying "done" is never enough for acceptance.
What Decides Acceptance
Acceptance is decided by the selected validation profile and quality gate.
The validation profile runs deterministic checks such as tests, build or lint commands, artifact existence checks, and changed-file policy. The quality gate then accepts the run, requests review, requests revision, or leaves the run failed based on that evidence and the configured risk policy.
The agent performs. Workbench accepts. MCP connects them.
5-Minute Quickstart
Install from the repository root:
python -m pip install -e .
The PyPI package is not published yet. The current wheel is code/server only; full Goose recipe workflows require this checked-out repo because configs, prompts, recipes, examples, evals, and validation profiles are repo assets. See the PyPI publishing prep guide for the packaging boundary and release checklist.
Register the MCP server in Goose:
goose configure
Choose:
Add ExtensionCommand-line Extension- Name:
AI Workbench MCP - Command:
ai-workbench-mcp - Timeout:
300
On slower local models, start with the two-tool smoke to verify Goose can reach the MCP server:
goose run --no-session --max-turns 4 --recipe ./recipes/workbench-mcp-tool-smoke.yaml \
--params project=ai_workbench_mcp \
--params run_dir=runs/goose-tool-smoke \
--params task="Local Goose MCP tool smoke. Do not edit tracked files." \
--params risk=low \
--params complexity_score=4
Then run the full sample recipe smoke after Goose has a provider configured:
goose run --recipe ./recipes/workbench-engineering-acceptance.yaml \
--params project=ai_workbench_mcp \
--params run_dir=runs/goose-tiny-python-fix \
--params task="Fix examples/tiny-python-fix/calculator.py so python -m unittest discover -s examples/tiny-python-fix -p test_*.py passes. Keep the change minimal and report the validation result." \
--params task_type=implement \
--params risk=low \
--params validation_profile=tiny_python_fix \
--params complexity_score=4
For bounded documentation-only changes, use the focused v0.2 recipe:
goose run --recipe ./recipes/workbench-docs-only-acceptance.yaml \
--params project=ai_workbench_mcp \
--params run_dir=runs/goose-docs-only \
--params task="Update the public docs for the requested documentation-only change." \
--params risk=low
For bounded Python package maintenance, use:
goose run --recipe ./recipes/workbench-python-package-maintenance.yaml \
--params project=ai_workbench_mcp \
--params run_dir=runs/goose-package-maintenance \
--params task="Make the requested bounded Python package maintenance change and keep the full test suite passing." \
--params task_type=implement \
--params risk=medium
For bounded test-fix work, use:
goose run --recipe ./recipes/workbench-test-fix-acceptance.yaml \
--params project=ai_workbench_mcp \
--params run_dir=runs/goose-test-fix \
--params task="Fix the requested failing test signal with the smallest justified change and report the exact validation command." \
--params task_test_command="python -m unittest discover -s examples/tiny-python-fix -p test_*.py" \
--params risk=medium
For a general low-risk implementation task with deterministic test coverage, use the engineering recipe with the low-risk coding profile:
goose run --recipe ./recipes/workbench-engineering-acceptance.yaml \
--params project=ai_workbench_mcp \
--params run_dir=runs/goose-low-risk-coding \
--params task="Make the requested bounded low-risk code change and keep deterministic tests passing." \
--params task_type=implement \
--params risk=low \
--params validation_profile=low_risk_coding \
--params complexity_score=8
Inspect the evidence folder:
runs/goose-tiny-python-fix/
task_metadata.json
final_prompt.md
model_selection.json
model_output.md
validation_report.json
revision_decision.json
run_log.jsonl
Do not commit runs/. It is the local evidence ledger.
Codex Local/IDE
Codex uses the same ai-workbench-mcp server. The first Codex slice is local/IDE MCP support, not Codex cloud.
- Codex setup: configure Codex to call the existing MCP stdio server.
- Codex acceptance workflow: use the six-tool lifecycle with
execution_host="codex"andresponse_source="codex". - Codex AGENTS.md snippet: reusable repository instruction block for Codex runs.
- Codex cloud limitations: evidence persistence and export questions deferred to a later design pass.
- Codex live-test handoff: batch/Python helper that runs safe preflight checks, shows a timer, prints a one-shot prompt, and checks the resulting Codex evidence folders.
- Codex acceptance demo walkthrough: bounded local/IDE proof path with loop and crash guardrails.
Sample Analytics Demo
To inspect the trust loop without provider setup, run analytics over the committed synthetic sample runs:
python tools/run_analyze.py --runs-dir examples/sample-runs --out-dir runs/sample-run-analytics
The sample set includes accepted, docs-only accepted, and revision-required test-fix evidence. Read the analytics guide to interpret run_metrics.json, run_summary.md, outcome buckets, failure reasons, routing feedback candidates, and optional cost fields. Read the evidence dashboard guide to use the generated run_dashboard.html for local scanning and demos.
Core MCP operations also write best-effort local events.jsonl ledgers beside evidence artifacts. Read the event ledger guide before using operation events in analytics or CI prototypes.
Run the committed golden-case eval smoke to score accepted sample evidence:
python tools/golden_eval.py --cases-dir evals/golden_cases --source-runs-dir examples/sample-runs --out-dir runs/golden_eval_smoke
The harness writes model_eval_metadata.json and score_report.json under one child folder per case. Read the golden-case harness guide before treating eval results as anything beyond local evidence-contract regression checks.
Advisory Routing Feedback
workbench_select_model can optionally read routing_feedback_candidates from a previous analytics report. The feedback is advisory only: it records whether historical evidence supports the current tier, suggests escalation, or asks for more evidence, but it never changes selected_tier.
Focused recipes pass runs/_reports/run_metrics.json as the default feedback source. Missing, invalid, or low-volume feedback is non-fatal and is recorded in model_selection.json under routing_feedback.
Bring Your Own Models
The committed model registry lives at configs/model_registry.yaml. To customize model IDs or providers locally, copy configs/model_registry.example.yaml to configs/model_registry.local.yaml and edit the local file. The local override is ignored by git, recursively merges into the base registry, and is recorded in model_selection.json with repo-relative source metadata.
See the model registry guide for merge rules, required tier fields, selector-reference validation, and the advisory-only scope.
Six MCP Tools
workbench_open_run
-> creates the run folder, task metadata, final prompt, context packet, and initial run log
records execution_host, defaulting to goose
workbench_select_model
-> recommends a model/runtime tier and writes model_selection.json
workbench_record_execution
-> captures raw Goose/Codex/model response text into model_output.md and appends run_log.jsonl
records response_source, defaulting to goose
workbench_validate_run
-> runs deterministic validation and writes validation_report.json
workbench_quality_gate
-> accepts, rejects, or requests review and writes revision_decision.json
workbench_analyze_runs
-> summarizes accepted-run metrics by execution host, response source, recipe, validation profile, model tier, failure reason, and quality-gate outcome under runs/_reports
and writes run_dashboard.html for local evidence scanning
Workflow
Goose recipe
-> workbench_open_run
-> workbench_select_model
-> Goose performs the task
-> workbench_record_execution
-> workbench_validate_run
-> workbench_quality_gate
-> workbench_analyze_runs
A run is accepted only when deterministic validation and the quality gate support acceptance.
For the detailed acceptance model, read how acceptance works.
Approved Prompt Catalog
Approved prompts live in prompts/approved/. The public library contains 12 reusable Workbench prompts:
| Prompt | Use |
|---|---|
bug_root_cause_investigation.md |
Investigate a bug, identify likely root cause, and define the smallest safe fix. |
code_review_patch_risk_audit.md |
Review a patch or AI-generated change set for correctness, regression, contract, and validation risk. |
data_acquisition_surface_audit.md |
Audit data acquisition, ingestion, scraping, upload, webhook, and external data surfaces. |
documentation_accuracy_audit.md |
Check documentation against actual code, commands, behavior, and configuration. |
implement_request_change_request.md |
Implement a bounded PRD, feature request, bug-fix request, or change request. |
navigation_page_title_ia_audit.md |
Audit navigation, page titles, labels, routing, and information architecture. |
performance_latency_hotspot_audit.md |
Identify performance and latency hot spots with concrete validation steps. |
prompt_failure_improvement_log.md |
Analyze prompt failures and record improvements for future runs. |
repository_context_index_audit.md |
Build or audit a repository context map for agent orientation. |
security_privacy_risk_review.md |
Review security and privacy risk in code, data flows, APIs, logs, and AI features. |
test_case_development_meaningful_coverage.md |
Develop meaningful test coverage for features, bug fixes, APIs, and workflows. |
ux_visual_accessibility_audit.md |
Audit UX, visual clarity, accessibility, and task completion quality. |
Focused v0.2 recipes use the most specific prompt by default: docs-only uses documentation_accuracy_audit.md, test-fix uses bug_root_cause_investigation.md, and a later test-creation workflow should use test_case_development_meaningful_coverage.md.
Examples
- Tiny Python fix: a deliberately broken one-function project for recipe smoke tests.
- Goose tool smoke: two-tool live smoke for slow local models.
- Goose recipe smoke: exact command for a low-risk Goose acceptance run.
- Codex tool smoke: two-tool local/IDE MCP smoke using
execution_host="codex". - Codex acceptance smoke: full six-tool local/IDE lifecycle using
response_source="codex". - Focused v0.2 workflows: command examples for docs-only, package maintenance, test-fix, and low-risk coding workflows.
- How acceptance works: the MCP protocol, Workbench server, validation profile, and quality-gate distinction.
- Docs-only acceptance recipe: focused documentation-only workflow using the
docs_onlyvalidation profile. - Python package maintenance recipe: focused package workflow using the
python_package_maintenancevalidation profile. - Test-fix acceptance recipe: focused failing-test repair workflow using the
test_fixvalidation profile. low_risk_codingvalidation profile: bounded implementation profile for the engineering acceptance recipe.- Sample accepted run: sanitized committed evidence showing an accepted run folder.
- Sample Codex accepted run: sanitized Codex local/IDE evidence showing
execution_host="codex"andresponse_source="codex". - Sample docs-only accepted run: sanitized focused workflow evidence using
documentation_accuracy_auditanddocs_only. - Sample needs-review run: sanitized synthetic evidence showing failed validation and a revision-required quality gate.
- Acceptance analytics guide: how to read
run_metrics.json,run_summary.md, outcome buckets, routing feedback candidates, and optional cost fields. - Evidence dashboard guide: how to read the static
run_dashboard.htmlgenerated by run analytics. - Event ledger guide: how local
events.jsonloperation telemetry is written and why it stays out of committed runs by default. - Golden-case harness guide: how to score sanitized accepted evidence baselines locally.
- Phase 5 dogfooding protocol: how to collect real Goose acceptance runs before changing routing policy.
- Model registry configuration: how to bring your own model tiers with a local ignored override.
- CI gate prototype: what the repo self-validation workflow proves and why semantic PR acceptance comes later.
- Launch issue seeds: public alpha issue backlog for dogfooding, routing feedback, cost evidence, policy packs, CI, and demo work.
- PyPI publishing prep: package build, twine, wheel smoke, and release boundary.
- Repository topics: recommended GitHub topics and setup commands.
- Launch issue drafts: ready-to-post public issue commands.
- Goose acceptance demo walkthrough: recording-ready 3-5 minute public demo runbook.
- Codex acceptance demo walkthrough: local/IDE proof path that avoids nested Codex or foreground stdio-server loops.
Development
Install development dependencies:
python -m pip install -e ".[dev]"
Run tests:
python -m pytest -q -p no:cacheprovider
Run scaffold validation:
python tools/validate_run.py --project ai_workbench_mcp --profile scaffold --out-dir runs/scaffold-smoke
Roadmap
v0.1.0-alpha: first public Goose MCP acceptance workflow.v0.2.0-alpha: focused recipe library and validation policy profiles.v0.3: Codex local/IDE first-class proof and accepted-artifact routing feedback.v0.4: accepted-artifact analytics.v0.5: CI mode for PR acceptance.v1.0: stable MCP contracts and recipe API.
GitHub Topics
Suggested repository topics:
goose
mcp
model-context-protocol
ai-agents
agentic-ai
coding-agents
developer-tools
validation
evals
quality-gates
audit-trail
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_workbench_mcp-0.2.0a0.tar.gz.
File metadata
- Download URL: ai_workbench_mcp-0.2.0a0.tar.gz
- Upload date:
- Size: 118.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77565c68357189585d6d8d33e646d071de5a5da3166c32fe97d35b55bb8f7ef6
|
|
| MD5 |
42cfe22e043762ae87220e5f4fe7a8ff
|
|
| BLAKE2b-256 |
baed66c622e78cce6b55695d05c4f9a30230ecbd15d4c83e3d5075aabe871e47
|
File details
Details for the file ai_workbench_mcp-0.2.0a0-py3-none-any.whl.
File metadata
- Download URL: ai_workbench_mcp-0.2.0a0-py3-none-any.whl
- Upload date:
- Size: 82.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bdaae2b7b47d223b08837a55840c40a2352f2f4b03f3d542d369c519c86370f
|
|
| MD5 |
6f99479c2ad839d4c6d7614c8f7650b6
|
|
| BLAKE2b-256 |
a2cd493a48ba0e59129639ae9577926b11b96784041165adb5249d2fb9ef5d14
|