Skip to main content

Deterministic governance controls for AI agent-driven software delivery

Project description

Controlled Execution System (CES)

Deterministic governance for AI agent-driven software delivery

Python 3.12+ Coverage 88%+ 3000+ Tests License MIT


What is CES?

AI agents can write code. But should you let them ship it without guardrails?

CES is a CLI tool that gives engineering teams (2-50 people) structured oversight of AI agents building their software. Instead of hoping agents produce correct code, CES provides deterministic controls that verify it — with trust that scales based on measured evidence, not faith.

Think of it as a governance layer between "an AI wrote this code" and "this code is in production." Every change gets classified by risk, reviewed by independent agents, and tracked in a tamper-proof audit ledger. Low-risk changes flow through automatically. High-risk changes get human review. Trust expands as agents prove themselves — and contracts when they don't.

The core principle: No autonomy expansion may rely solely on advisory controls. Every escalation in agent freedom must be backed by hard-enforced verification.

Default posture: CES is builder-first. Start with ces build, let CES set up local project state if needed, and only drop into expert commands when you need direct governance control.


How It Works

CES operates across three planes, each with a distinct role:

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   CONTROL PLANE  (deterministic — no LLM calls)                     │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│   │ Manifest │  │  Policy  │  │ Workflow │  │   Audit Ledger   │   │
│   │ Manager  │  │  Engine  │  │  State   │  │ (append-only,    │   │
│   │          │  │          │  │ Machine  │  │  hash-chained)   │   │
│   └────┬─────┘  └────┬─────┘  └────┬─────┘  └──────────────────┘   │
│        │             │             │                                 │
│────────┼─────────────┼─────────────┼────────────────────────────────│
│        │             │             │                                 │
│   HARNESS PLANE  (orchestration + quality assurance)                │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│   │  Trust   │  │ Evidence │  │  Review  │  │   Classification │   │
│   │ Manager  │  │Synthesiz.│  │  Router  │  │   Engine+Oracle  │   │
│   └────┬─────┘  └────┬─────┘  └────┬─────┘  └──────────────────┘   │
│        │             │             │                                 │
│────────┼─────────────┼─────────────┼────────────────────────────────│
│        │             │             │                                 │
│   EXECUTION PLANE  (bounded agent work)                             │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│   │  Agent   │  │  Guide   │  │  Self-   │  │    Sensor        │   │
│   │  Runner  │  │Pack Build│  │Correction│  │  Orchestrator    │   │
│   └──────────┘  └──────────┘  └──────────┘  └──────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Task Lifecycle

Every change follows this path:

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Create  │────▶│ Classify │────▶│ Execute  │────▶│  Review  │
  │ Manifest │     │ (risk +  │     │ (bounded │     │(independ.│
  │          │     │  class)  │     │  agent)  │     │  agents) │
  └──────────┘     └──────────┘     └──────────┘     └────┬─────┘
                                                          │
  ┌──────────┐     ┌──────────┐     ┌──────────┐         │
  │ Deployed │◀────│  Merged  │◀────│ Approved │◀────────┘
  │          │     │          │     │(human or │
  │          │     │          │     │ auto)    │
  └──────────┘     └──────────┘     └──────────┘
  • Tier C tasks (low risk): Can flow through fully autonomously
  • Tier B tasks (medium risk): Require hybrid human+agent review
  • Tier A tasks (high risk): Require full human approval, 3 different LLM models for review diversity

Key Concepts

Manifests

A manifest is the governance contract for a task. It defines what an agent is allowed to do: which files it can touch, which tools it can use, how many tokens it can spend, and when it expires. No manifest = no work.

Classification

Every manifest gets classified along three dimensions:

Dimension What It Measures Values
Risk Tier Blast radius if something goes wrong A (highest) → B → C (lowest)
Behavior Confidence How predictable the output is BC1 (deterministic) → BC2 → BC3 (subjective)
Change Class Type of modification Class 1 (new) → Class 5 (removal)

The aggregate classification determines the review workflow. The classifier must always be a different agent than the implementer.

Trust Tiers

Agent profiles earn trust over time:

  candidate ──────▶ trusted ◀──────▶ watch ◀──────▶ constrained
  (new agent)      (proven)        (slipping)       (restricted)

Trust moves based on measured performance: defect rates, escape rates, calibration probe results. Hidden checks (agents don't know they're being tested) prevent gaming.

Evidence Packets

Every task produces an evidence packet — a structured bundle of test results, sensor outputs, review findings, and the decision view. This is the proof trail. Evidence packets are immutable once assembled.

Audit Ledger

Every significant action is recorded in an append-only, hash-chained ledger. No updates, no deletes (enforced by database triggers). 14 event types from approvals to kill-switch activations. Tamper detection via HMAC chain.


Quick Start

Prerequisites

  • Python 3.12+
  • uv (package manager)
  • A supported local runtime: codex or claude

Just want to try it? See the 5-Minute Quickstart — no Docker, no Postgres, no API keys.

1. Clone and install

git clone https://github.com/chrisduvillard/controlled-execution-system.git
cd controlled-execution-system
uv sync

2. Verify a local runtime

ces doctor

3. Configure environment

cp .env.example .env
# Optional: set CES_AUDIT_HMAC_SECRET and CES_DEMO_MODE=1.
# Real local execution uses the installed `codex` or `claude` CLI.

4. Start with ces build

# Fresh repo or existing repo: `ces build` auto-creates `.ces/` on first run
# Builder-first path: describe what you want and let CES draft the contract
ces build "Add input validation to the user registration endpoint" --yes

# Resume the latest builder session without re-entering context
ces continue --yes

# Explain the latest builder state in plain language
ces explain
ces explain --view decisioning
ces explain --view brownfield

# Check the latest request, activity, and next step
ces status

# Export a concise builder run report for audit or handoff
ces report builder

If you prefer manual setup before the first build, CES still supports:

ces init my-project

For existing repos, CES auto-detects brownfield mode and asks what must be preserved before it runs the change. ces continue resumes the saved session stage instead of replaying the whole flow, and once a run is finished it points you back to ces build for the next request. You can force either path with --greenfield or --brownfield.

For day-to-day brownfield delivery, stay builder-first with ces build, ces continue, and ces explain --view brownfield. Use the expert brownfield commands only when you need an explicit legacy-behavior decision such as ces brownfield review OLB-<entry-id> --disposition preserve. The Brownfield Guide covers that handoff in more detail.

Use the builder-first flow when you want CES to carry the current request context for you. Switch to the expert workflow when you need explicit review, triage, approval, or audit/handoff artifacts. The Operator Playbook shows the boundary and recommended command sequences.

When you leave the single-request builder loop and need system-wide visibility or incident response, use the expert operations surfaces instead of relying on the builder-first ces status view:

ces status --expert
ces status --expert --watch
ces audit --limit 20
ces emergency declare "Security incident detected"

The Operations Runbook covers incident drills and recovery expectations for those commands.

If you want the explicit expert workflow instead, CES still supports:

ces manifest "Add input validation to the user registration endpoint" --yes
ces classify M-<manifest-id>

5. Use local expert commands directly

ces execute M-<manifest-id> --runtime auto
ces review M-<manifest-id>
ces triage M-<manifest-id>
ces approve M-<manifest-id> --yes

Try the FreshCart example

These demo commands are intended for a source checkout of CES after git clone and uv sync.

# Seed sample data
uv run python -m examples.freshcart.seed_data

# Run the end-to-end workflow
uv run python -m examples.freshcart.run_e2e

Use CES on CES

If you want CES to review changes to this repository itself, initialize repo-local state first:

ces init controlled-execution-system
ces dogfood --base origin/master

This creates a local .ces/ directory for repo-specific state. Keep that directory untracked; it is operational state, not project source.


CLI Commands

ces <command> [options]

Start Here

Command Description
build Builder-first local workflow: describe the change, gather only the missing context, run, review, and approve
continue Resume the latest saved builder session without re-entering the same setup context
explain Summarize the latest builder brief, evidence, blockers, and next step in plain language
status Show builder-first project status; add --expert for the full expert view
report builder Export the latest builder run report for audit or reviewer handoff
init Optional manual setup before the first builder-first run

Advanced Governance

Command Description
manifest Generate a task manifest from a natural-language description
classify Classify a manifest (risk tier, behavior confidence, change class)
execute Execute an agent task within manifest boundaries
review Run the review pipeline on completed work
triage Pre-screen evidence with triage color (green/amber/red)
approve Approve or reject an evidence packet
gate Evaluate a phase gate (computational + agent checks)
intake Run intake interview for a project phase
calibrate Run hidden calibration probes against an agent
audit Expert operations audit inspection; for example, ces audit --limit 20
emergency declare Expert operations emergency declaration; for example, ces emergency declare "Security incident detected"
Command Groups
vault ... Knowledge vault operations (Zettelkasten-style notes)
brownfield ... Expert legacy behavior capture, review, and promotion

Configuration

Copy .env.example to .env and configure:

Variable Description Default
CES_AUDIT_HMAC_SECRET HMAC secret for audit chain integrity (change in managed environments)
CES_LOG_LEVEL Logging level INFO
CES_LOG_FORMAT Log format (json or text) json
CES_DEFAULT_RUNTIME Default local runtime codex
CES_DEMO_MODE Use demo helper responses when no CLI-backed provider is available 0

For local runtime execution, CES relies on an installed codex or claude CLI. Any provider-specific credentials are handled by that runtime rather than through CES package extras.


Project Structure

controlled-execution-system/
├── src/ces/
│   ├── cli/               # Typer CLI
│   │   ├── __init__.py    #   App entry point
│   │   ├── run_cmd.py     #   Builder-first local workflow
│   │   ├── status_cmd.py  #   Status and explanation surfaces
│   │   └── *_cmd.py       #   Expert command modules
│   ├── control/           # Governance engine
│   │   ├── db/            #   SQLAlchemy tables and repositories
│   │   ├── models/        #   Governance domain models
│   │   └── services/      #   Manifest, policy, workflow, merge
│   ├── harness/           # Quality assurance
│   │   ├── models/        #   Harness-facing models
│   │   ├── sensors/       #   Computational sensors
│   │   └── services/      #   Trust, evidence, review, guide packs
│   ├── execution/         # Agent orchestration
│   │   ├── agent_runner.py    #   Agent runner
│   │   ├── providers/         #   LLM provider adapters
│   │   ├── runtimes/          #   Runtime registry + adapters
│   │   └── sandbox.py         #   Execution sandboxing
│   ├── intake/            # Intake interview flow
│   ├── knowledge/         # Vault services and ranking
│   ├── emergency/         # Kill switch
│   ├── brownfield/        # Legacy integration
│   ├── observability/     # Internal metrics and telemetry helpers
│   └── shared/            # Enums, crypto, config, logging
├── tests/
│   ├── unit/              # 157 test files
│   └── integration/       # End-to-end and regression integration tests
├── examples/              # FreshCart demo project
├── docs/                  # PRD, guides, reference cards
├── pyproject.toml         # Project config + dependencies
└── .env.example           # Environment template

Testing

CES maintains an 88%+ branch coverage gate enforced by CI.

# Run all unit tests
uv run pytest

# Run with coverage report
uv run pytest --cov=ces --cov-report=term-missing

# Skip integration tests (no Docker required)
uv run pytest -m "not integration"

# Run integration tests only (requires Docker)
uv run pytest -m integration

Current local suite: 3,000+ tests with an 88%+ branch coverage gate enforced in CI.


Tech Stack

Technology Version Role
Python 3.12+ Runtime
uv 0.11+ Package management
Typer + Rich 0.24+ CLI interface
SQLAlchemy 2.0+ ORM
Pydantic 2.12+ Schema validation + domain models
Codex CLI current Local GPT-backed execution runtime
Claude Code CLI current Local Claude-backed execution runtime
cryptography 46+ Manifest signing, audit chain integrity
structlog 25+ Structured JSON logging
python-statemachine 3.0+ Workflow state transitions
pytest 9.0+ Testing (88%+ coverage)
ruff 0.15+ Linting + formatting
mypy 1.20+ Static type checking (strict mode with targeted relaxations — see [tool.mypy] in pyproject.toml)

Documentation

Document Description
Product Requirements (PRD) Complete specification (5,600+ lines) — the authoritative reference
Implementation Guide Architectural guidance and build order
FreshCart Worked Example End-to-end walkthrough using a sample project
Quick Reference Card Classification tables, merge checklists, TTL rules
Security Audit Security model and threat mitigations
Getting Started Setup guide with step-by-step instructions
Operator Playbook Builder-first vs expert workflow guidance and evidence/handoff patterns
Brownfield Guide Applying CES to existing codebases
GNHF Trial Guide Guardrails for using gnhf externally to develop CES
Troubleshooting Common issues and solutions
Production Deployment Production configuration and operational guidance

Contributing

See CONTRIBUTING.md for development setup, testing, and workflow details. If you want to evaluate external agent loops such as gnhf, use the GNHF Trial Guide and scripts/gnhf_trial.sh rather than treating them as part of CES itself. Run them from a clean sibling worktree or clean clone, keep the scope to contributor-side docs/tests/CLI polish, exclude manifest/policy, approval/triage/review, audit, kill-switch, sandbox, and runtime-boundary changes, review every generated branch manually before cherry-picking or merging, and keep CES's own builder-first or expert workflows for actual delivery work.

Changelog

See CHANGELOG.md for release history.


License

MIT


Built with the Agent-Native Software Delivery Operating Model v4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

controlled_execution_system-0.1.2.tar.gz (867.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

controlled_execution_system-0.1.2-py3-none-any.whl (341.2 kB view details)

Uploaded Python 3

File details

Details for the file controlled_execution_system-0.1.2.tar.gz.

File metadata

File hashes

Hashes for controlled_execution_system-0.1.2.tar.gz
Algorithm Hash digest
SHA256 11968552b8e44accc6bea321f30f33603ff28c5fe5e06b38afd8fed7e9c07ba9
MD5 9649dd121179c9d8e3770a9c2e4af060
BLAKE2b-256 f8254feb8b12fe2897b7a9c1b0b7ad21dda18ccf0d3895ccb6b7e7e1e8a38dd8

See more details on using hashes here.

Provenance

The following attestation bundles were made for controlled_execution_system-0.1.2.tar.gz:

Publisher: publish.yml on chrisduvillard/controlled-execution-system

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file controlled_execution_system-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for controlled_execution_system-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c0026fd90b41c872ae050bd26e3301a7adde984fe36b87bb6537c0321755394b
MD5 2bf0eeb61805b2304677b901ce2dd454
BLAKE2b-256 22625f14b9f694b05b745c92bcec0a5eb92d642daf6a1de0e7de28d8713ec0e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for controlled_execution_system-0.1.2-py3-none-any.whl:

Publisher: publish.yml on chrisduvillard/controlled-execution-system

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page