Local-first process assurance for agentic AI pipelines.
Project description
agent-assure
Local-first process assurance for agentic AI pipelines.
Core thesis: output equivalence is not process equivalence.
A candidate agent pipeline can return the same final approval, denial,
recommendation, or summary while silently changing material evidence, review
routing, provider/tool boundaries, redaction behavior, retries, or provenance.
agent-assure produces local evidence packets and CI gates so reviewers can
detect those observable process regressions.
Install
Install from PyPI and run the flagship demo:
pip install agent-assure
agent-assure demo flagship
The demo runs offline with bundled deterministic fixtures. It writes local
review artifacts under .tmp/demo/flagship by default.
One-command demo
Expected punchline:
output equivalence: preserved
missing evidence link: claim-duration
classification: new_failure
CI gate: blocked as expected
The baseline and candidate both keep
recommendation=approve; outcome=approve. The candidate still fails because it
drops the material evidence link for claim-duration.
Claim boundary
agent-assure produces local review evidence, traceability, evidence mapping,
artifact digests, and CI-gate signals. It does not replace legal, regulatory,
clinical, provider-quality, model-quality, or business-impact review.
This project is not a compliance attestation. Safety review remains a separate human and organizational responsibility.
Schemas
Schema changes are versioned. Development work uses schemas/unreleased/.
Stable releases freeze a copy into schemas/vX.Y.Z/.
The release gate verifies the latest frozen schema directory, while schema
staging exports the current development schema surface to schemas/unreleased/.
Local development
From a repository checkout:
pip install -e .
For validation checks, install the development extras:
pip install -e ".[dev]"
Five-minute fixture walkthrough
Run these commands one at a time from the repository root. The final two
commands write reports and are expected to exit 1; the GitHub Actions snippet
below shows how to assert those expected failures in set -e contexts.
pip install -e ".[dev]"
mkdir -p .tmp/showcase
agent-assure suite compile examples/prior_auth_synthetic/suite.yaml --out .tmp/showcase/prior-auth.compiled.json --manifest .tmp/showcase/prior-auth.fixtures.json
agent-assure suite run .tmp/showcase/prior-auth.compiled.json --variant examples/prior_auth_synthetic/variants/baseline.yaml --manifest .tmp/showcase/prior-auth.fixtures.json --out .tmp/showcase/prior-auth.baseline.json
agent-assure suite run .tmp/showcase/prior-auth.compiled.json --variant examples/prior_auth_synthetic/variants/candidate_evidence_normalization.yaml --manifest .tmp/showcase/prior-auth.fixtures.json --out .tmp/showcase/prior-auth.evidence-candidate.json
agent-assure evaluate .tmp/showcase/prior-auth.baseline.json --suite .tmp/showcase/prior-auth.compiled.json --out-dir .tmp/showcase/baseline-report
agent-assure evaluate .tmp/showcase/prior-auth.evidence-candidate.json --suite .tmp/showcase/prior-auth.compiled.json --out-dir .tmp/showcase/evidence-report
agent-assure compare .tmp/showcase/prior-auth.baseline.json .tmp/showcase/prior-auth.evidence-candidate.json --suite .tmp/showcase/prior-auth.compiled.json --out-dir .tmp/showcase/comparison-report
agent-assure ci .tmp/showcase/prior-auth.evidence-candidate.json --suite .tmp/showcase/prior-auth.compiled.json --baseline .tmp/showcase/prior-auth.baseline.json --out-dir .tmp/showcase/ci-report --report-mode full
The baseline evaluation exits 0 and writes a pass summary with ten evaluated
cases and zero blocking findings. The candidate evaluation is expected to exit
1; its report contains one blocking finding for
shared-source-multi-claim with reason code
MATERIAL_CLAIM_MISSING_EVIDENCE.
The comparison command is also expected to exit 1. It writes
.tmp/showcase/comparison-report/comparison-report.md with classification
new_failure and fixture-equivalence state pass. For the failing case, the
baseline and candidate both keep recommendation=approve; outcome=approve; the
material regression is the missing claim-duration evidence link. See
docs/showcase.md for the expected report fields, GitHub Actions snippet, and
artifact digest summary.
After reports exist, an evidence packet can also be built and gated from summaries:
agent-assure packet build .tmp/showcase/evidence-report/evaluation-summary.json --comparison .tmp/showcase/comparison-report/comparison-summary.json --out .tmp/showcase/evidence-packet.json
agent-assure ci gate .tmp/showcase/evidence-packet.json
For this known failing candidate, both the CI command and packet gate are
expected to exit 1. The CI command writes JSON/Markdown reports,
evidence-packet.json, evidence-packet.md, dependency-inventory.json,
release-artifact-manifest.json, and ci-diagnostics.json.
Release evidence can be bundled and replayed from raw digests for stable source artifacts and stable JSON projection digests for environment-bearing packet artifacts:
python scripts/build_release_bundle.py --out .tmp/release --write-digests .tmp/release/release-digest-replay.json
agent-assure release replay .tmp/release/release-digest-replay.json --artifact-root . --require-current-commit
The release bundle includes the evidence packet, release manifest, replay file,
SBOM, source distribution, wheel, manifest-listed digest cross-checks, and
exact cosign-verifiable blobs when built by the release workflow. For keyless
cosign verification of workflow-signed release blobs, see
docs/release_evidence.md.
What the demo shows
The flagship demo is intentionally narrow. It shows that a candidate can keep the same visible answer while losing a material evidence link, and that the evaluation report identifies the failing invariant under equivalent fixtures. It is deterministic review evidence for a declared fixture, not a broad model or provider assessment.
Flagship regression at a glance
The key idea: ordinary output comparison can miss governance regressions. In the
flagship fixture, the candidate keeps the same visible recommendation and
outcome as the baseline, but drops a material evidence link. agent-assure
catches the missing evidence invariant and classifies the baseline-to-candidate
comparison as a new_failure under passing fixture equivalence.
flowchart LR
subgraph OutputCheck["Ordinary visible-output check"]
BOut["Baseline output<br/>recommendation=approve<br/>outcome=approve"]
COut["Candidate output<br/>recommendation=approve<br/>outcome=approve"]
Same["Visible answer unchanged"]
BOut --> Same
COut --> Same
end
subgraph InvariantCheck["agent-assure invariant check"]
BEv["Baseline evidence<br/>claim-duration linked"]
CEv["Candidate evidence<br/>claim-duration missing link"]
Pass["Baseline evaluation: pass"]
Fail["Candidate evaluation: fail<br/>MATERIAL_CLAIM_MISSING_EVIDENCE"]
BEv --> Pass
CEv --> Fail
end
Same --> Tension["Output unchanged<br/>but governance invariant regressed"]
Equiv["Fixture equivalence: pass"] --> Compare["Baseline-to-candidate comparison"]
Pass --> Compare
Fail --> Compare
Tension --> Compare
Compare --> NewFailure["Classification: new_failure"]
classDef pass fill:#e5f5ff,stroke:#0072b2,color:#003b5c;
classDef fail fill:#fff1e0,stroke:#d55e00,color:#5c2a00;
classDef neutral fill:#eef3ff,stroke:#3f51b5,color:#1a237e;
classDef warn fill:#fff8e1,stroke:#f9a825,color:#5d4037;
class Pass,Equiv pass;
class Fail,NewFailure fail;
class Same,Compare neutral;
class Tension warn;
Architecture
This is the full toolkit shape. The five-minute demo exercises the fixture-mode path and evidence outputs.
flowchart LR
A[Authoring<br/>YAML suites<br/>live protocols] --> B[Compile and bind<br/>strict JSON<br/>canonical digests]
B --> C{Execution}
C -->|Fixture mode| D[Fixed local fixtures<br/>offline<br/>no token spend]
C -->|Live mode| E[Declared adapters<br/>static JSONL<br/>external script<br/>OpenAI-compatible]
D --> F[RunSet records<br/>redacted summaries<br/>provenance<br/>trace context]
E --> F
F --> G[Evaluate controls<br/>expectations<br/>policies<br/>privacy checks]
G --> H[Change review<br/>fixture equivalence<br/>verdicts<br/>provenance diffs]
G --> I[Live review<br/>cluster rates<br/>rare-event bounds<br/>drift and trajectories]
H --> J[Evidence outputs<br/>reports<br/>packets<br/>CI gates<br/>release replay]
I --> J
J --> K[Observability<br/>span plans<br/>optional SDK/OTLP]
Small generic example
The expense-approval example is a compact non-healthcare suite that uses the same offline fixture and expectation method. It is a generic demonstration, not a benchmark.
agent-assure suite compile examples/expense_approval_minimal/suite.yaml --out .tmp/expense.compiled.json --manifest .tmp/expense.fixtures.json
agent-assure suite run .tmp/expense.compiled.json --variant examples/expense_approval_minimal/variants/baseline.yaml --manifest .tmp/expense.fixtures.json --out .tmp/expense.baseline.json
agent-assure suite run .tmp/expense.compiled.json --variant examples/expense_approval_minimal/variants/candidate_provider_policy.yaml --manifest .tmp/expense.fixtures.json --out .tmp/expense.candidate.json
agent-assure evaluate .tmp/expense.baseline.json --suite .tmp/expense.compiled.json --out-dir .tmp/expense.baseline-report
agent-assure evaluate .tmp/expense.candidate.json --suite .tmp/expense.compiled.json --out-dir .tmp/expense.candidate-report
The baseline evaluation exits 0. The provider-policy candidate is expected to
exit 1 with deterministic provider, outcome, and human-review control
findings.
Current claim boundary
The project currently claims deterministic offline controls and
protocol-bound live operational evaluation implemented in this repository.
Public claims are tracked in
docs/claims_traceability_matrix.yaml.
A statistical protocol is documented in
docs/measurement/experiment_protocol.md for live stochastic evaluation. The
agent-assure live commands require a machine-readable protocol, run
explicitly configured adapters, and analyze repeated observations with
cluster-aware rates, protocol-declared comparison methods, and exploratory
guardrails for low cluster counts. Optional advanced endpoint plans bind
confirmatory/exploratory labels, Bonferroni multiplicity controls, rare-event upper
bounds, observed cluster-correlation summaries, and paired randomization-test
prerequisites to the protocol digest. Optional trajectory reports derive
privacy-filtered observable state paths, canonical transition profiles,
sequence invariants, and operational event-process summaries from structured
run artifacts. Live results remain bounded by the declared
protocol, data boundary, provider/model configuration, and execution window.
They are not general model-quality, safety, compliance, or clinical-validation
claims.
Synthetic calibration and regression coverage for the live statistical,
drift-monitoring, trajectory, and event-process paths is summarized in
docs/live_calibration.md.
The external-script live adapter runs configured scripts through a no-shell
subprocess harness and records redacted emergency-process-record artifacts
for process failures. It passes only declared environment allowlist entries,
explicit config variables, and runner-injected trace/request variables.
OpenTelemetry export is optional:
pip install -e ".[otel]"
agent-assure otel export RUNSET_OR_RECORD_OR_SPAN_PLAN.json --protocol otlp-http --endpoint http://localhost:4318/v1/traces
Exported spans are derived from span plans and structured run records, not live SDK instrumentation of provider calls; raw prompts, raw outputs, tool arguments, and unredacted summaries are not emitted.
GitHub Actions snippet
name: agent-assure-showcase
on: [push, pull_request]
jobs:
flagship:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -e ".[dev]"
- run: mkdir -p .tmp/showcase
- run: agent-assure suite compile examples/prior_auth_synthetic/suite.yaml --out .tmp/showcase/prior-auth.compiled.json --manifest .tmp/showcase/prior-auth.fixtures.json
- run: agent-assure suite run .tmp/showcase/prior-auth.compiled.json --variant examples/prior_auth_synthetic/variants/baseline.yaml --manifest .tmp/showcase/prior-auth.fixtures.json --out .tmp/showcase/prior-auth.baseline.json
- run: agent-assure suite run .tmp/showcase/prior-auth.compiled.json --variant examples/prior_auth_synthetic/variants/candidate_evidence_normalization.yaml --manifest .tmp/showcase/prior-auth.fixtures.json --out .tmp/showcase/prior-auth.evidence-candidate.json
- run: agent-assure evaluate .tmp/showcase/prior-auth.baseline.json --suite .tmp/showcase/prior-auth.compiled.json --out-dir .tmp/showcase/baseline-report
- name: Evaluate evidence candidate
run: |
set +e
agent-assure evaluate .tmp/showcase/prior-auth.evidence-candidate.json --suite .tmp/showcase/prior-auth.compiled.json --out-dir .tmp/showcase/evidence-report
status=$?
set -e
if [ "$status" -ne 1 ]; then
echo "expected exit 1, got $status"
exit 1
fi
grep -q "MATERIAL_CLAIM_MISSING_EVIDENCE" .tmp/showcase/evidence-report/evaluation-report.md
- name: Compare baseline to candidate
run: |
set +e
agent-assure compare .tmp/showcase/prior-auth.baseline.json .tmp/showcase/prior-auth.evidence-candidate.json --suite .tmp/showcase/prior-auth.compiled.json --out-dir .tmp/showcase/comparison-report
status=$?
set -e
if [ "$status" -ne 1 ]; then
echo "expected exit 1, got $status"
exit 1
fi
grep -q 'Classification: `new_failure`' .tmp/showcase/comparison-report/comparison-report.md
grep -q 'Fixture-Equivalence Result' .tmp/showcase/comparison-report/comparison-report.md
grep -q 'State: `pass`' .tmp/showcase/comparison-report/comparison-report.md
Development
git config core.hooksPath .githooks
python scripts/check_docs_alignment.py
ruff check .
mypy src
pytest
python -m build
Dependency locking for release builds is documented in
docs/dependency_locking.md. Release bundle reproduction, SBOM generation, and
cosign verification are documented in docs/release_evidence.md.
The installed package includes bundled deterministic examples for reproducible
local demos. The top-level examples/ tree mirrors those packaged resources
for repository-oriented docs and tests; scripts/check_packaged_examples.py
keeps the copies aligned. They are not a stable extension API; see
docs/api_surface.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_assure-0.3.0.tar.gz.
File metadata
- Download URL: agent_assure-0.3.0.tar.gz
- Upload date:
- Size: 338.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77a3329bd82cc6f103b5a31712cec53fc25116cac28625e76a531d170563fff6
|
|
| MD5 |
8b706ad8dcbef9dc65b764ff3089e39e
|
|
| BLAKE2b-256 |
c77b8348339924fddfe21980eef75e53d0d305f580e21d322c18ce32cff24fc8
|
Provenance
The following attestation bundles were made for agent_assure-0.3.0.tar.gz:
Publisher:
release.yml on acblabs/agent-assure
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_assure-0.3.0.tar.gz -
Subject digest:
77a3329bd82cc6f103b5a31712cec53fc25116cac28625e76a531d170563fff6 - Sigstore transparency entry: 2044029977
- Sigstore integration time:
-
Permalink:
acblabs/agent-assure@282208d83c66fed204031072fad48015067e3362 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/acblabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@282208d83c66fed204031072fad48015067e3362 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent_assure-0.3.0-py3-none-any.whl.
File metadata
- Download URL: agent_assure-0.3.0-py3-none-any.whl
- Upload date:
- Size: 318.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebcba9a95ad3061836b7e5253cf70b905b6ecc92300026d7590a7db5d934f2fd
|
|
| MD5 |
50e223759ab086d04c5dacd22b757cd1
|
|
| BLAKE2b-256 |
2ef062d7817538cff4856bf822c149d379503af71e869edd0a9a2d349ee76939
|
Provenance
The following attestation bundles were made for agent_assure-0.3.0-py3-none-any.whl:
Publisher:
release.yml on acblabs/agent-assure
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_assure-0.3.0-py3-none-any.whl -
Subject digest:
ebcba9a95ad3061836b7e5253cf70b905b6ecc92300026d7590a7db5d934f2fd - Sigstore transparency entry: 2044030001
- Sigstore integration time:
-
Permalink:
acblabs/agent-assure@282208d83c66fed204031072fad48015067e3362 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/acblabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@282208d83c66fed204031072fad48015067e3362 -
Trigger Event:
push
-
Statement type: