Skip to main content

Goose-first acceptance, validation, routing, and audit layer for agentic work.

Project description

AI Workbench MCP

License

Acceptance gates for AI coding-agent runs.

AI agents can produce code. AI Workbench MCP helps decide whether that work is accepted.

It records the task, captures agent output, runs deterministic validation, applies a quality gate, and creates an auditable run trail.

Works with Goose today. Designed as a host-agnostic acceptance layer for MCP-compatible agent workflows. Codex local/IDE is the first second-host target through explicit execution_host and response_source evidence metadata.

Current repo work is the v0.3 Semantic PR Acceptance Alpha release branch. Package metadata and workflow defaults are prepared for ai-workbench-mcp==0.3.0a0; TestPyPI, PyPI, and MCP Registry upload verification still require explicit release approval. The latest historical verified package publication before this branch is ai-workbench-mcp==0.2.0a0.

Before

The agent says: "Done."

After

AI Workbench shows:

  • what task was requested
  • what agent/model/runtime was used
  • what output was produced
  • what validation ran
  • whether the quality gate accepted, rejected, or requested review
  • where the evidence lives
runs/example/
  task_metadata.json
  final_prompt.md
  model_selection.json
  model_output.md
  validation_report.json
  revision_decision.json
  run_log.jsonl

Problem

AI coding agents can produce useful work, but "done" is not the same as accepted. A useful acceptance workflow needs reproducible evidence.

AI Workbench MCP provides that acceptance and audit layer, turning agent output into evidence-backed accepted runs.

Why Goose + Acceptance Gates

Goose already owns the agent execution surface: CLI, desktop, providers, recipes, MCP hosting, and the agent loop.

AI Workbench MCP stays complementary:

  • opens run evidence folders
  • recommends model/runtime tiers
  • records model or Goose output
  • runs deterministic validation
  • makes quality-gate decisions
  • summarizes accepted-run analytics

It does not provide a chat UI, editor fork, provider marketplace, or general agent runner.

What MCP Does And Does Not Do

MCP is the connection protocol.

AI Workbench MCP is the tool server. MCP lets Goose, Codex local/IDE, or another compatible host call Workbench tools, but the protocol itself does not verify correctness, inspect code quality, or decide whether a run is accepted.

Workbench applies the acceptance policy by recording evidence, running deterministic validation, and applying the quality gate. See how acceptance works for the full distinction.

Prompt DoD vs Acceptance Gate

A prompt definition-of-done tells the agent what to attempt and what evidence to report. Prompt instructions are not enforcement.

An acceptance gate checks the resulting evidence after the agent acts. It uses explicit validation profiles, command-backed checks, required artifacts, changed-file policies, and quality-gate rules. The same agent saying "done" is never enough for acceptance.

What Decides Acceptance

Acceptance is decided by the selected validation profile and quality gate.

The validation profile runs deterministic checks such as tests, build or lint commands, artifact existence checks, and changed-file policy. The quality gate then accepts the run, requests review, requests revision, or leaves the run failed based on that evidence and the configured risk policy.

The agent performs. Workbench accepts. MCP connects them.

GitHub PR Acceptance Alpha

The v0.3 alpha makes the PR-facing surface consume real Workbench evidence instead of treating green CI as acceptance.

For a PR gate to report accept, the referenced run must include deterministic validation and quality-gate artifacts, especially validation_report.json and revision_decision.json. The PR gate reports exactly one of:

  • accept: validation passed, sign-off is ready, and the quality gate accepted the run.
  • needs_review: validation or the quality gate requires review and no blocker-severity reason is present.
  • block: required evidence is missing or unreadable, validation failed, revision is required, blocker-severity evidence is present, or only scaffold fallback evidence exists.

Scaffold-only evidence is visibility evidence, not semantic acceptance evidence, and blocks with pr_gate.acceptance_evidence_missing. The copy-paste GitHub Actions template in docs/github/pr-gate-workflow-template.md renders PR comments and JSON decisions from Workbench evidence; it does not run Goose, replace the evidence artifacts, or turn CI status into acceptance. For the short external-repository path, use Use AI Workbench PR Gate in your repo in 10 minutes. For a true separate-repo proof target, use the external sample repository proof plan.

5-Minute Quickstart

Install the prepared v0.3 package after publication:

python -m pip install ai-workbench-mcp==0.3.0a0

If the 0.3.0a0 upload has not been completed yet, install from the checked-out repository:

python -m pip install -e .

The prepared 0.3.0a0 source build includes bootstrappable configs, prompts, and recipes through ai-workbench-bootstrap-assets. Historical 0.2.0a0 wheels are code/server only. See the PyPI publishing prep guide for the package boundary and release checklist.

Bootstrap assets do not include private runs/ evidence, committed proof fixtures, provider setup, examples, or evals.

For a first local run, use this order:

  1. Install from the checked-out repository with python -m pip install -e ..
  2. Register ai-workbench-mcp in Goose.
  3. Run the two-tool smoke to prove Goose can reach the MCP server.
  4. Run one full acceptance recipe with a focused validation profile.
  5. Inspect validation_report.json and revision_decision.json before calling the run accepted.

If Goose or a provider is not configured yet, use the committed sample evidence under examples/sample-runs/ and the Goose acceptance demo walkthrough first. That path shows the acceptance artifacts without creating private local evidence.

Register the MCP server in Goose:

goose configure

Choose:

  • Add Extension
  • Command-line Extension
  • Name: AI Workbench MCP
  • Command: ai-workbench-mcp
  • Timeout: 300

On slower local models, start with the two-tool smoke to verify Goose can reach the MCP server:

goose run --no-session --max-turns 4 --recipe ./recipes/workbench-mcp-tool-smoke.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-tool-smoke \
  --params task="Local Goose MCP tool smoke. Do not edit tracked files." \
  --params risk=low \
  --params complexity_score=4

Then run the full sample recipe smoke after Goose has a provider configured:

goose run --recipe ./recipes/workbench-engineering-acceptance.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-tiny-python-fix \
  --params task="Fix examples/tiny-python-fix/calculator.py so python -m unittest discover -s examples/tiny-python-fix -p test_*.py passes. Keep the change minimal and report the validation result." \
  --params task_type=implement \
  --params risk=low \
  --params validation_profile=tiny_python_fix \
  --params complexity_score=4

Choose the validation profile from the task shape:

Task shape Recipe Validation profile
Documentation-only Markdown or example docs recipes/workbench-docs-only-acceptance.yaml docs_only
Low-risk bug fix with a focused regression command recipes/workbench-test-fix-acceptance.yaml low_risk_bug_fix
Bounded package, config, tool, recipe, or test maintenance recipes/workbench-python-package-maintenance.yaml python_package_maintenance
Repo-target failing test repair with a focused test command recipes/workbench-test-fix-acceptance.yaml test_fix
API or MCP contract change recipes/workbench-engineering-acceptance.yaml api_contract_change
Security or privacy-sensitive change recipes/workbench-engineering-acceptance.yaml security_privacy_sensitive
Intentionally broken demo fixture proof recipes/workbench-test-fix-acceptance.yaml fixture_repair_proof
General low-risk implementation with deterministic tests recipes/workbench-engineering-acceptance.yaml low_risk_coding

See focused v0.2 workflows for copy-ready commands.

The five first-class v0.3 policy packs are docs_only, low_risk_bug_fix, test_fix, api_contract_change, and security_privacy_sensitive. Their catalog metadata lives in configs/policy_packs.yaml and is loaded into validation profiles so PR gate comments can explain accepted, review-required, and blocked outcomes without parsing prose.

For bounded documentation-only changes, use the focused v0.2 recipe:

goose run --recipe ./recipes/workbench-docs-only-acceptance.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-docs-only \
  --params task="Update the public docs for the requested documentation-only change." \
  --params risk=low

For bounded Python package maintenance, use:

goose run --recipe ./recipes/workbench-python-package-maintenance.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-package-maintenance \
  --params task="Make the requested bounded Python package maintenance change and keep the full test suite passing." \
  --params task_type=implement \
  --params risk=medium

For bounded test-fix work, use:

goose run --recipe ./recipes/workbench-test-fix-acceptance.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-test-fix \
  --params task="Fix the requested failing test signal with the smallest justified change, keep the repo test suite passing, and report the exact validation command." \
  --params task_test_command="python -m pytest tests/test_target.py -q" \
  --params risk=medium

The default test_fix profile is for repo-target repairs and requires the broader project suite. For intentionally broken demo fixtures, use the focused fixture proof profile instead:

goose run --recipe ./recipes/workbench-test-fix-acceptance.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-fixture-repair-proof \
  --params task="Fix examples/tiny-python-fix/calculator.py so python -m unittest discover -s examples/tiny-python-fix -p test_*.py passes. Keep the change minimal and do not edit unrelated files." \
  --params validation_profile=fixture_repair_proof \
  --params task_test_command="python -m unittest discover -s examples/tiny-python-fix -p test_*.py" \
  --params analytics_runs_dir=runs/goose-fixture-repair-proof \
  --params analytics_out_dir=runs/goose-fixture-repair-proof/_reports \
  --params risk=low

For a general low-risk implementation task with deterministic test coverage, use the engineering recipe with the low-risk coding profile:

goose run --recipe ./recipes/workbench-engineering-acceptance.yaml \
  --params project=ai_workbench_mcp \
  --params run_dir=runs/goose-low-risk-coding \
  --params task="Make the requested bounded low-risk code change and keep deterministic tests passing." \
  --params task_type=implement \
  --params risk=low \
  --params validation_profile=low_risk_coding \
  --params complexity_score=8

Inspect the evidence folder:

runs/goose-tiny-python-fix/
  task_metadata.json
  final_prompt.md
  model_selection.json
  model_output.md
  validation_report.json
  revision_decision.json
  run_log.jsonl

Do not commit runs/. It is the local evidence ledger.

Read outcomes from the evidence, not from the agent's final prose:

  • Accepted: validation_report.json has overall_status="passed" and sign_off_ready=true, and revision_decision.json has final_status="accepted".
  • Needs-review or revision: validation or quality-gate evidence is incomplete, risky, or failed, and the quality gate records a review or revision status such as revision_required.
  • Blocked or failed: deterministic validation fails in a way that is not sign-off ready. Keep the evidence local, inspect the failing check, and do not call the run accepted.

Codex Local/IDE

Codex uses the same ai-workbench-mcp server. The first Codex slice is local/IDE MCP support, not Codex cloud.

Proof Pack

The v0.2 public proof pack is in docs/proof/proof-pack-v0.2.md.

It shows:

  • accepted Goose evidence
  • accepted Codex local/IDE evidence
  • a fresh accepted Gemini Goose fixture proof summary
  • a fresh accepted Codex local/IDE fixture proof summary
  • review-required evidence
  • analytics by execution host and response source
  • a 3-5 minute demo script

The proof pack uses committed sanitized sample evidence under examples/sample-runs/. Raw local runs/ evidence stays ignored.

The v0.3 PR gate outcome demos are in docs/proof/pr-gate-outcome-demos.md with sanitized fixtures under examples/pr-gate-outcomes/. They show accept, needs_review, and block decisions generated from Workbench evidence, not from private local run history.

Sample Analytics Demo

To inspect the trust loop without provider setup, run analytics over the committed synthetic sample runs:

python tools/run_analyze.py --runs-dir examples/sample-runs --out-dir runs/sample-run-analytics

The sample set includes accepted, docs-only accepted, and revision-required test-fix evidence. Read the analytics guide to interpret run_metrics.json, run_summary.md, outcome buckets, failure reasons, routing feedback candidates, and optional cost fields. Read the evidence dashboard guide to use the generated run_dashboard.html for local scanning and demos.

Core MCP operations also write best-effort local events.jsonl ledgers beside evidence artifacts. Read the event ledger guide before using operation events in analytics or CI prototypes.

Run the committed golden-case eval smoke to score accepted sample evidence:

python tools/golden_eval.py --cases-dir evals/golden_cases --source-runs-dir examples/sample-runs --out-dir runs/golden_eval_smoke

The harness writes model_eval_metadata.json and score_report.json under one child folder per case. Read the golden-case harness guide before treating eval results as anything beyond local evidence-contract regression checks.

Advisory Routing Feedback

workbench_select_model can optionally read routing_feedback_candidates from a previous analytics report. The feedback is advisory only: it records whether historical evidence supports the current tier, suggests escalation, or asks for more evidence, but it never changes selected_tier.

The implemented prefer_current_tier path is intentionally narrow: docs_only_current_tier_when_accepted applies only to low-risk, easy docs-only work that already selects local_coding and has enough accepted Workbench evidence. Other high-acceptance buckets stay advisory as no_change until a bounded policy is implemented for them.

Focused recipes pass runs/_reports/run_metrics.json as the default feedback source. Missing, invalid, or low-volume feedback is non-fatal and is recorded in model_selection.json under routing_feedback.

Bring Your Own Models

The committed model registry lives at configs/model_registry.yaml. To customize model IDs or providers locally, copy configs/model_registry.example.yaml to configs/model_registry.local.yaml and edit the local file. The local override is ignored by git, recursively merges into the base registry, and is recorded in model_selection.json with repo-relative source metadata.

See the model registry guide for merge rules, required tier fields, selector-reference validation, and the advisory-only scope.

Six MCP Tools

workbench_open_run
  -> creates the run folder, task metadata, final prompt, context packet, and initial run log
     records execution_host, defaulting to goose

workbench_select_model
  -> recommends a model/runtime tier and writes model_selection.json

workbench_record_execution
  -> captures raw Goose/Codex/model response text into model_output.md and appends run_log.jsonl
     records response_source, defaulting to goose

workbench_validate_run
  -> runs deterministic validation and writes validation_report.json

workbench_quality_gate
  -> accepts, rejects, or requests review and writes revision_decision.json

workbench_analyze_runs
  -> summarizes accepted-run metrics by execution host, response source, recipe, validation profile, model tier, failure reason, and quality-gate outcome under runs/_reports
     and writes run_dashboard.html for local evidence scanning

Workflow

Goose recipe
  -> workbench_open_run
  -> workbench_select_model
  -> Goose performs the task
  -> workbench_record_execution
  -> workbench_validate_run
  -> workbench_quality_gate
  -> workbench_analyze_runs

A run is accepted only when deterministic validation and the quality gate support acceptance.

For the detailed acceptance model, read how acceptance works.

Approved Prompt Catalog

Approved prompts live in prompts/approved/. The public library contains 12 reusable Workbench prompts:

Prompt Use
bug_root_cause_investigation.md Investigate a bug, identify likely root cause, and define the smallest safe fix.
code_review_patch_risk_audit.md Review a patch or AI-generated change set for correctness, regression, contract, and validation risk.
data_acquisition_surface_audit.md Audit data acquisition, ingestion, scraping, upload, webhook, and external data surfaces.
documentation_accuracy_audit.md Check documentation against actual code, commands, behavior, and configuration.
implement_request_change_request.md Implement a bounded PRD, feature request, bug-fix request, or change request.
navigation_page_title_ia_audit.md Audit navigation, page titles, labels, routing, and information architecture.
performance_latency_hotspot_audit.md Identify performance and latency hot spots with concrete validation steps.
prompt_failure_improvement_log.md Analyze prompt failures and record improvements for future runs.
repository_context_index_audit.md Build or audit a repository context map for agent orientation.
security_privacy_risk_review.md Review security and privacy risk in code, data flows, APIs, logs, and AI features.
test_case_development_meaningful_coverage.md Develop meaningful test coverage for features, bug fixes, APIs, and workflows.
ux_visual_accessibility_audit.md Audit UX, visual clarity, accessibility, and task completion quality.

Focused v0.2 recipes use the most specific prompt by default: docs-only uses documentation_accuracy_audit.md, test-fix uses bug_root_cause_investigation.md, and a later test-creation workflow should use test_case_development_meaningful_coverage.md.

Examples

Development

Install development dependencies:

python -m pip install -e ".[dev]"

Run tests:

python -m pytest -q -p no:cacheprovider

Run scaffold validation:

python tools/validate_run.py --project ai_workbench_mcp --profile scaffold --out-dir runs/scaffold-smoke

Roadmap

  • v0.1.0-alpha: first public Goose MCP acceptance workflow.
  • v0.2.0-alpha: focused recipe library and validation policy profiles.
  • Phase 5 complete: accepted-artifact analytics, Codex local/IDE proof, PyPI/MCP Registry publication, and 31 complete dogfood evidence runs.
  • Current: v0.3 Semantic PR Acceptance Alpha with real Workbench evidence PR decisions, scaffold-only blocking, five first-class policy packs, copy-paste GitHub workflow template, bootstrap assets, external-repo setup docs, an external sample repo proof plan, and sanitized PR gate outcome demos.
  • Next: complete 0.3.0a0 TestPyPI/PyPI/MCP Registry verification, then continue Checks API integration, fork-comment strategy, cost/time evidence, and stable v1 contract packaging.
  • v1.0: stable MCP contracts and recipe API.

GitHub Topics

Suggested repository topics:

goose
mcp
model-context-protocol
ai-agents
agentic-ai
coding-agents
developer-tools
validation
evals
quality-gates
audit-trail

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_workbench_mcp-0.3.0a0.tar.gz (205.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_workbench_mcp-0.3.0a0-py3-none-any.whl (174.8 kB view details)

Uploaded Python 3

File details

Details for the file ai_workbench_mcp-0.3.0a0.tar.gz.

File metadata

  • Download URL: ai_workbench_mcp-0.3.0a0.tar.gz
  • Upload date:
  • Size: 205.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for ai_workbench_mcp-0.3.0a0.tar.gz
Algorithm Hash digest
SHA256 25756fb61262bb3ddad0b7906d551fc29e5b4e51af0003d9ec141a33ed647895
MD5 814f580d2f773d1c4983c3d8d93e0a0c
BLAKE2b-256 074ea2e5cb370334b4231149588215a5ec5240aea2aa20c7595dba356ede662b

See more details on using hashes here.

File details

Details for the file ai_workbench_mcp-0.3.0a0-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_workbench_mcp-0.3.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 7238313a87e80727fb23d72e561ac8dd1fe5d43aadf90cb9c904c76d5b35015c
MD5 f8ec3b1dafe2888b3ad37bb944293585
BLAKE2b-256 65d1b7971026005f3e71c287afe218cf3eefac9b5f9426b4584db3640d8f5e0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page