Capture, replay, and audit AI agent decisions. An open-source SDK for auditable AI agents.

These details have not been verified by PyPI

Project links

Project description

auditable

Capture, replay, and audit AI agent decisions.

Novelties · Flagship Demo · Lifecycle · Install · Roadmap

Agents act on dependency state that quietly drifts. A budget read minutes ago can fall below the amount already committed; a price pinned at plan time can move before the action lands. Most tools log what happened, yet they cannot re-decide under the state that is live now, so a stale decision stands until a human notices. auditable closes that recovery gap across the lifecycle on one graph: it captures the decision, replays it against live state, and reverses the committed action when it no longer holds.

Three Novelties

The full lifecycle chain. One graph kernel carries the analysis from design to live operation to review. PRE lints a declared plan, LIVE captures and recovers a live decision, POST ranks a finished run. examples/end_to_end.py walks a single payment through all three on one dataset and one state.

The graph model. Every pillar reads the same typed two-layer decision graph: an execution layer (control flow, observed from the trace) over a dependency layer (what each step relied on, declared or inferred, never read off the trace). The two-layer model is introduced in GRADE (arXiv:2606.22741).

The orthogonal decomposition. One agent decision crosses three spans: data (what it read and the snapshot it relied on), model (which model produced the output and its stated basis), and harness (the action executed and its cost). auditable binds all three in a single signed, hash-chained record, so a decision is judged as a unit.

The Flagship Moment

One payment, walked through the whole lifecycle in 18 auditable calls. The agent approves a $2,083.20 vendor payment against a budget snapshot that covered it. Six days later the live budget has dropped below the amount. replay re-decides on the live state, and the gate reverses the committed payment. This is recovery, not a log line.

pip install "auditable[graph]"
python examples/end_to_end.py

The run prints a single audit report: a REVIEW verdict, the keystone with its coverage reason, six findings with severity tags and recommended actions, and the LIVE recovery that rolled the payment back. Paste it into a pull request or an issue.

Audit report page for a payment-approver run. A REVIEW verdict banner sits above a roll-up of zero blocks, one rollback, and five PRE lints. Sections below list the keystone decision with its coverage reason, six numbered findings each carrying a severity tag and a recommended action, and a LIVE recovery tally showing the rolled-back payment record.

The shortest taste is LIVE on its own. Capture a decision, replay it under the live state, reverse the action when it no longer holds:

from auditable import Action, ActionGate, DependencySnapshot, ReferenceLedger, audit, replay

def policy(state, action):
    ok = action.cost <= state["budget"]
    return ok, "within budget" if ok else "over budget"

ledger = ReferenceLedger(balance=10_000)
gate = ActionGate(ledger)
payment = Action("payment", {"to": "acme"}, cost=4_200)

# The agent pays $4,200 against a budget snapshot that read $10,000.
with audit("payment", snapshot=DependencySnapshot(state={"budget": 10_000})) as decision:
    decision.act(payment)
receipt = gate.commit(payment)                       # paid; balance is now 5,800

# The live budget is now $3,000. Replay re-decides; the gate reverses the payment.
verdict = replay(decision.record, live_state={"budget": 3_000}, policy=policy)
gate.enforce_post_commit(verdict, receipt=receipt)
print(verdict.action.value, "->", ledger.balance)   # rollback -> 10000

replay is pure: it deep-copies the live state and the action, so a policy can never alter the signed record.

The Lifecycle

auditable runs the same detection-and-report pass over one typed decision graph at three points in an agent's life. The graph kernel stays constant; only the pillar changes (when it fires, what it scores).

One typed two-layer decision graph at the center, read by three lifecycle attach points: PRE lints a declared plan before deploy, LIVE replays and recovers while the agent runs, and POST ranks a finished run

Pillar	When It Fires	Public Entry	Focus
PRE	Before deploy	`analyze_plan`	Read-only structural lints on a declared plan. Names the control-flow chokepoint. Dependency-state risk withheld.
LIVE	While running	`audit` + `replay` + `ActionGate`	Capture a decision, re-decide under live state, route a fix (allow, block, review, rollback) through a rail. The sharpest pillar.
POST	After a run	`analyze_run`	Rank a finished run by structural blast share. Name the keystone the run rests on, so you review that step first.

Install

pip install auditable

The core is dependency-free and torch-free. Structural-graph analysis (analyze_plan for PRE, analyze_run for POST) needs the optional graph extra (NetworkX):

pip install "auditable[graph]"

The LIVE snippet above runs on the core install alone.

The Three Pillars, in Detail

PRE: lint the plan before deploy (four read-only lints, the chokepoint, a coverage report)

Point analyze_plan at a declared plan (a plain dict, the neutral target a LangGraph, CrewAI, or AutoGen front-end would lower into) and it runs read-only structural lints over the plan graph. Every check is a pure NetworkX query: no value is executed, and every finding is a structural design warning, not a validated failure prediction.

from auditable.graph.pre import analyze_plan
from auditable.graph.adapters import declared_plan_v1

plan = {
    "nodes": [
        # 0: read a volatile price, but grant scope far beyond what it read.
        {"idx": 0, "agent": "planner", "kind": "tool_call",
         "reads": [{"id": "price", "volatile": True}],
         "scope": ["price", "ledger", "vendor_db"]},
        # 1: a decision that rests on the unpinned, un-revalidated price.
        {"idx": 1, "agent": "planner", "kind": "decision",
         "reads": [{"id": "price", "producer": 0, "volatile": True}],
         "control_preds": [0]},
        # 2: a consequential write of 'order', with no prior read of 'order'
        #    and no re-read of 'price' between the volatile read and the action.
        {"idx": 2, "agent": "executor", "kind": "tool_call",
         "reads": [{"id": "price", "producer": 0, "volatile": True}],
         "writes": ["order"], "control_preds": [1]},
    ]
}

report = analyze_plan(plan, adapter=declared_plan_v1)
print(report)

The four shipping lints, all read-only queries at severity='warning':

Lint	Fires When
`write_with_no_prior_read`	A node writes a resource that nothing in its backward slice ever read.
`flippable_dependency_annotation`	An unpinned, non-revalidated volatile dependency feeds a decision. This is an annotation; the would-it-flip question needs runtime values and is out of scope at PRE.
`scope_vs_snapshot`	Granted tool scope strictly exceeds the snapshot the node read.
`missing_revalidation_barrier`	A volatile read reaches a consequential action with no intervening re-read. Drift confirmation needs runtime values and is out of scope at PRE.

The report also names the execution-topology keystone: the structural chokepoint of the declared plan, the node that the most other nodes transitively follow in control flow (the argmax of execution_reach over the handoff_to projection). This is a structural design lint, a separate concept from the POST blast-radius keystone, and it does not predict failure.

Alongside the lints, the Preflight Coverage Report is a descriptive coverage-readiness view, explicitly not a risk score. It reports the dependency-edge grade mix, the observed fraction, the saturation ratio, the exact no-score reason the runtime scorer would apply, which declared reads, writes, and edges still lack a resource identity, and the declared revalidation barriers per resource.

Two boundaries, stated plainly. Dependency-state blast-share risk is withheld at PRE: the declared dependency layer is declared-only (observed fraction zero), so analyze_plan returns state_b_risk=None with state_b_withheld=True and a reason string, and it raises rather than emit a number if a scored verdict ever came back. A table-stakes OWASP-Agentic and CWE rule floor is planned, not shipping.

See examples/analyze_plan.py and the PRE rules reference.

POST: rank a finished run, find the keystone (`analyze_run` over a recorded trajectory)

analyze_run reads a recorded agent run, builds one decision graph, and ranks every step by how much of the run transitively rests on it, so you review the keystone first. On a tau-bench airline trajectory, the one reservation read that both later writes depend on is the keystone.

auditable analyze_run ranks a recorded tau-bench run by structural blast share and names the keystone decision

from auditable import analyze_run
from auditable.graph.adapters import tau_bench_prior_db_reads_v1

report = analyze_run(run, adapter=tau_bench_prior_db_reads_v1)
k = report.keystone
print(k.idx, k.node_attrs["tool"])   # 2  get_reservation_details

The score is an uncalibrated triage ranking, not a calibrated probability. In a no-score state (no_score:single_decision, no_score:low_coverage) the scores are None, so a withheld score never reads as zero risk. The corpus write-to-read edges are modeled (a conservative prior-read upper bound, not a causal label); the report carries these caveats in report.notes. The trajectory is modeled on tau-bench (Sierra Research, MIT). See examples/analyze_run.py and the POST analysis reference.

How It Works: the two-layer graph, the signed record, the rail (kernel internals)

auditable models one decision as a two-layer graph: an execution layer (control flow, observed from the trace) over a dependency layer (what each step relied on); replay catches a step that rested on a value that has since gone stale

auditable links a run into one graph with two edge layers: execution (control flow, observed from the trace) over dependency (what each step relied on). When a step rested on a value that has since gone stale, like price, replay catches it. The record itself binds three spans per decision (data, model, harness).

The graph kernel has two edge layers, and the distinction is load-bearing. Execution edges (emits, handoff_to) are observed from the trace. Dependency edges (depends_on) are declared or inferred, never read off the trace. PRE and POST both run over this same typed graph; audit() is the ergonomic capture entry, so you are never asked to build the graph by hand.

One agent decision crosses three spans, and auditable binds all three in a single signed, hash-chained record:

Span	What the Record Binds	Signal in v0.1
Data	What the agent read and the dependency snapshot it relied on	Snapshot freshness
Model	Which model produced the output, and its stated basis	Decision-basis trust flag
Harness	The action executed and its cost	A static cost-cap rule, plus the replay verdict

replay() re-derives whether the action still holds under the live dependency state versus the snapshot the agent used, and returns one of four routed verdicts: ALLOW, ROLLBACK (justified on the snapshot but not on live state, the stale-state case), BLOCK (justified on neither), or HUMAN_REVIEW (the policy raised ReplayUndecidable). A Policy is any callable (state, action) -> (justified, reason). The ActionGate then executes that verdict through a Rail: a post-commit ROLLBACK or BLOCK calls rail.compensate(receipt) to reverse a committed action, rather than printing a recommendation. The shipped ReferenceLedger is an in-process reference rail for demos and tests; it is not a production payment rail.

Records are signed and hash-chained: each record carries a prev_digest and a content-addressed record_id. Two sinks ship today, MemorySink (in-process) and FileSink (append-only JSONL, durable across process exit, fails closed on a corrupt tail).

Ingestion is source-agnostic through the public Adapter protocol. Three adapters ship: tau_bench_prior_db_reads_v1 (public-corpus trajectory, POST), own_record_v1 (auditable's own signed records, POST), and declared_plan_v1 (a framework-agnostic declared plan dict, PRE). The declared-plan adapter is the neutral seam a LangGraph, CrewAI, or AutoGen front-end would lower into; it is not a parser for any framework.

See the architecture reference for the full kernel, adapters, and sinks.

Using a single layer (standalone auditors as inputs to the record)

Each span's check is a standalone Auditor that runs on its own, with no agent and no chain, and returns a signed Report. DataAuditor scores snapshot freshness, ModelAuditor produces a decision-basis trust flag, and HarnessAuditor applies one static cost-cap rule.

import time
from auditable import DataAuditor, DependencySnapshot

snapshot = DependencySnapshot(state={"budget_remaining": 1000}, captured_at=time.time() - 7 * 86400)
report = DataAuditor(max_age_seconds=86400).assess(snapshot)
print(report.flag, report.score)   # stale 1.0

These standalone auditors are inputs to the record, not the headline. The composition (capture, replay, recovery, and the lifecycle pillars) is the main line; the modules feed it. See examples/standalone_report.py.

Scope, stated honestly (what ships today vs. what is planned)

What ships today. The full signed chain, replay under live state, executed recovery through a rail-neutral gate, two sinks (in-memory and append-only JSONL), the POST analyze_run ranking, and the PRE analyze_plan lints plus preflight coverage report.

The release does not yet claim a learned data-anomaly method, a calibrated model-trust score, calibrated cross-layer risk, or live incremental scoring. The POST structural score is an uncalibrated ranking; the compound report is a transparent, explicitly uncalibrated debug bundle. Everything beyond this list is on the roadmap.

Roadmap

Every item below is planned, not shipping.

v0.2 Data a fitted anomaly score on the dependency state (PyOD backend), with snapshot freshness as a fallback
v0.3 Model and compound a calibrated cross-layer compound, and model as a first-class graph node attribute with grounding beyond the current deterministic basis check
v0.3b Live live and incremental scoring, plus the runtime resource-touch contract that fills observed dependency edges
v0.4 Control data refresh or quarantine, and model fallback or sign-off control faces
PRE rule floor a table-stakes OWASP-Agentic and CWE rule floor for CI legibility, consumed rather than forked
v1.0 pluggable sinks (OpenTelemetry, LangSmith), exportable evidence bundles, and a stable public API
Framework integrations (LangChain, LangGraph, CrewAI) and an MCP server

Citation

If you use auditable in your work, please cite the GRADE paper (the typed two-layer graph model the library is built on) and, optionally, the software:

@article{zhao2026grade,
  title   = {GRADE: Graph Representation of LLM Agent Dependency and Execution},
  author  = {Zhao, Yue},
  journal = {arXiv preprint arXiv:2606.22741},
  year    = {2026}
}

@software{auditable,
  title  = {auditable: Capture, Replay, and Recover AI Agent Decisions},
  author = {Zhao, Yue},
  year   = {2026},
  url    = {https://github.com/yzhao062/auditable}
}

License

Apache-2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 23, 2026

0.0.1

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auditable-0.1.0.tar.gz (138.1 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

auditable-0.1.0-py3-none-any.whl (101.6 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file auditable-0.1.0.tar.gz.

File metadata

Download URL: auditable-0.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 138.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for auditable-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`502660156e7618772d48844ac921f979c50f08b3642f48fc0de519935dacab0a`
MD5	`7d3c36d7029ce57ba18b63ed2675f1a0`
BLAKE2b-256	`a507a9fbe3cd1e95a3504ca0f9a7b892ba088e833be66da74fdba4152009739d`

See more details on using hashes here.

File details

Details for the file auditable-0.1.0-py3-none-any.whl.

File metadata

Download URL: auditable-0.1.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 101.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for auditable-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5200e7f1eff4437865dbd391c72dbe87edb585e5d952bc7cd5c245a435616cc4`
MD5	`b8a7005d940613518d2c52d790c0e89a`
BLAKE2b-256	`1249df717bdc2650e0c2fb8d717f822cc82576ad0cf0e4669890d051fe370432`

See more details on using hashes here.

auditable 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

auditable

Three Novelties

The Flagship Moment

The Lifecycle

Install

The Three Pillars, in Detail

Roadmap

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes