Deterministic Observability Framework — formal governance, privacy benchmarks, and adversarial testing for multi-agent LLM systems
Project description
VERIFY. PROVE. ATTEST.
Deterministic Observability Framework (DOF)
Deterministic governance for multi-agent LLM systems. Constitutional rules, formal proofs, and on-chain attestation on Avalanche.
Built with Python 3.11+ · Z3 SMT Solver · web3.py · BLAKE3 · Avalanche C-Chain · PostgreSQL
pip install dof-sdk
from dof import GenericAdapter
result = GenericAdapter().wrap_output("your agent output here")
# → {status: "pass", violations: [], score: 8.5}
30ms. Zero LLM tokens. Works with CrewAI, LangGraph, AutoGen, or anything that produces text.
python -m dof verify "your text here" # governance check
python -m dof prove # Z3 formal verification
python -m dof health # component status
python -m dof benchmark # adversarial benchmark
python -m dof privacy # privacy benchmark
python -m dof version # show version
Contents
The Problem · Highlights · Architecture · Governance Layers · Z3 Verification · On-Chain · Benchmarks · Comparison · External Validation · Limitations · Citation
The Problem
LLM agents hallucinate. Nobody catches it deterministically. Using LLMs to verify LLMs is circular — the evaluator shares failure modes with the evaluated. Rate limits, cascading retries, and non-deterministic output quality interact across execution steps, producing unstable system-level behavior that cannot be attributed to specific infrastructure variables.
DOF solves this with 7 deterministic governance layers, formal Z3 proofs, and on-chain attestation — zero LLM tokens in the verification path.
Highlights
- 7 governance layers — Constitution → AST → Supervisor → Z3 → Red/Blue → Memory → Signer
- SS(f) = 1 − f³ — Z3 verified stability formula under bounded retries
- GCR(f) = 1.0 — governance invariant under any failure rate (Z3 proven)
- 21 on-chain attestations on Avalanche C-Chain mainnet
- Merkle batching — 10,000 attestations = 1 tx ≈ $0.01
- Automated benchmark — Governance 100%, Hallucination 90%, Consistency 100% FDR, 0% FPR
- Privacy benchmark — 71% detection rate across 7 AgentLeak channels (PII, API keys, memory, tool inputs)
- OpenTelemetry ready — optional OTLP tracing (
pip install dof-sdk[otel]) - EventBus — in-memory pub/sub with circular buffer, Redis/Kafka ready
- Framework agnostic — CrewAI, LangGraph, AutoGen, or raw Python
- A2A server (8 skills) + MCP server (10 tools) + REST API (14 endpoints)
- 719 tests, 27K+ LOC, 25 core modules, 36 contributions
Architecture
+----------------------------------------------------+
| L7 Signer HMAC + Avalanche ~2s |
+----------------------------------------------------+
| L6 Memory Gov Bi-temporal + decay <1ms |
+----------------------------------------------------+
| L5 Red/Blue Red -> Guard -> Arb ~50ms |
+----------------------------------------------------+
| L4 Z3 Proofs 4 theorems UNSAT ~10ms |
+----------------------------------------------------+
| L3 Supervisor Q+A+C+F scoring ~5ms |
+----------------------------------------------------+
| L2 AST Verifier eval/exec/secrets <1ms |
+----------------------------------------------------+
| L1 Constitution 4 HARD + 5 SOFT <1ms |
+----------------------------------------------------+
| Engine DAG + LoopGuard + TokenTracker |
+----------------------------------------------------+
| Data Oracle 6 verification strategies <1ms |
+----------------------------------------------------+
Total governance latency: < 70ms (layers 1-6). On-chain signing adds ~2s when enabled.
Seven Governance Layers
Layer 1 — Constitution. Hard rules block output (hallucination claims without sources, non-English text, empty output, >50K chars). Soft rules score but don't block (missing sources, no structure, repetition, no actionable steps). Pure regex + keyword matching. <1ms.
Layer 2 — AST Verifier. Static analysis of agent-generated code via Python ast module. Blocks eval(), exec(), subprocess, os.system(), __import__(), and hardcoded secrets (OpenAI, GitHub, AWS patterns). <1ms.
Layer 3 — Meta-Supervisor. Weighted quality score: S = Q(0.40) + A(0.25) + C(0.20) + F(0.15). ACCEPT ≥ 7.0, RETRY ≥ 5.0, ESCALATE < 5.0. Cross-provider execution. ~5ms.
Layer 4 — Z3 Formal Proofs. Four machine-checked theorems via Z3 SMT solver. GCR invariance, SS cubic derivation, SS strict monotonicity, SS boundary conditions. All UNSAT (no counterexample exists). ~10ms total.
Layer 5 — Red/Blue Adversarial. RedTeamAgent finds defects, GuardianAgent defends with evidence, DeterministicArbiter adjudicates using only passing tests / governance compliance / AST results. Zero LLM in final adjudication. ACR metric. ~50ms.
Layer 6 — Memory Governance. GovernedMemoryStore validates every write against Constitution. Bi-temporal versioning (valid_from, valid_to, recorded_at). Constitutional decay (λ=0.99/hour) with protected categories (decisions, errors immune to decay). <1ms.
Layer 7 — On-Chain Signer. HMAC-SHA256 signed attestation certificates. Compliance-gated: only GCR=1.0 attestations are published. BLAKE3 certificate hashing. Avalanche C-Chain mainnet via web3.py. ~2s.
Formal Verification (Z3)
| Theorem | Math | English | Z3 Result |
|---|---|---|---|
| GCR Invariant | ∀f∈[0,1]: GCR(f)=1.0 | Governance is independent of failure rate | UNSAT |
| SS Cubic | ∀f∈[0,1]: SS(f)=1−f³ | Stability follows cubic decay (r=2 retries) | UNSAT |
| SS Monotonicity | f₁<f₂ ⟹ SS(f₁)>SS(f₂) | More failures = less stability | UNSAT |
| SS Boundaries | SS(0)=1.0 ∧ SS(1)=0.0 | Perfect at 0% failure, zero at 100% | UNSAT |
10ms total. Proof certificates: logs/z3_proofs.json.
On-Chain Attestation
| Field | Value |
|---|---|
| Contract | 0x88f6043B091055Bbd896Fc8D2c6234A47C02C052 |
| Network | Avalanche C-Chain (43114) |
| Attestations | 21 (March 2026) |
| Functions | registerAttestation(), registerBatch(), isCompliant(), getAttestation() |
| Cost | |
| Deployer | 0xB529f4f99ab244cfa7a48596Bf165CAc5B317929 |
Three verification layers: PostgreSQL (200ms) → Enigma Scanner (900ms) → Avalanche on-chain (2-3s, immutable).
Benchmark Results
Adversarial Benchmark (400 generated tests, deterministic)
| Category | FDR | FPR | F1 | Tests |
|---|---|---|---|---|
| Governance | 100.0% | 0.0% | 100.0% | 100 |
| Code Safety | 86.0% | 0.0% | 92.5% | 100 |
| Hallucination | 90.0% | 0.0% | 94.7% | 100 |
| Consistency | 100.0% | 0.0% | 100.0% | 100 |
| Overall F1 | 96.8% | 400 |
Production Results (n=30 runs, real infrastructure)
| Metric | Value | Interpretation |
|---|---|---|
| SS | 0.90 ± 0.31 | 90% execution stability |
| GCR | 1.00 ± 0.00 | Perfect governance invariance |
| PFI | 0.61 ± 0.18 | Provider failures recovered via rotation |
| Supervisor | 27/30 ACCEPT | 90% acceptance rate |
Comparison
| Feature | DOF | LangChain | CrewAI | Langfuse |
|---|---|---|---|---|
| Constitutional governance | 7 layers | — | — | — |
| Z3 formal proofs | 4 theorems | — | — | — |
| AST code safety | Deterministic | — | — | — |
| On-chain attestation | Avalanche | — | — | — |
| Adversarial Red/Blue | DeterministicArbiter | — | — | — |
| Governed memory | Bi-temporal + decay | — | — | — |
| FDR/FPR benchmark | Automated | — | — | — |
| Token tracking | Per-call | — | — | Per-call |
| Execution DAG | Critical path | — | — | Trace tree |
| Framework agnostic | Any string output | LangChain only | CrewAI only | Any (tracing) |
| MCP server | 10 tools | — | — | — |
| REST API | 14 endpoints | — | — | API |
| Open source | Apache 2.0 | MIT | MIT | MIT/Commercial |
Production Agents
Two DOF-governed agents operating on Avalanche mainnet, ranked #1 and #2 of 1,772 agents on erc-8004scan.xyz:
| Agent | Token ID | Wallet | Protocols | Status |
|---|---|---|---|---|
| Apex Arbitrage | #1687 | 0xcd59...a983 |
A2A + OASF (7 skills) | ACTIVE |
| AvaBuilder | #1686 | 0x29a4...E71a |
A2A + OASF (5 skills) | ACTIVE |
Combined trust score: 0.85 (governance 0.35 + safety 0.15 + infrastructure 0.15 + activity 0.15 + community 0.20).
External Validation (Google Colab)
Tested externally via pip install dof-sdk==0.2.2 — zero internal dependencies.
| Test | Result | Time |
|---|---|---|
| Z3 Formal Proofs (4/4) | VERIFIED | 19.25ms |
| MerkleBatcher (plain text) | PASSED | 0.31ms |
| Error Classifier (7/7 classes) | PASSED | 1.28ms |
Full reports: tests/external/dof_enterprise_report.json (v0.2.1) and tests/external/dof_enterprise_report_v2.json (v0.2.2)
Honest Limitations
- Hallucination detection is regex-based — 6 deterministic strategies (pattern matching, cross-reference, consistency, entity extraction, numerical plausibility, self-consistency) achieve 90% FDR on adversarial tests. Misses semantic hallucinations without known-facts coverage.
- No correlated or cascading failure modeling — SS(f)=1−f³ assumes independent failures.
- Supervisor is itself an LLM — mitigated by cross-provider execution and deterministic governance layer, but circularity is bounded, not eliminated.
- Free-tier infrastructure — 3/30 runs fail from provider exhaustion cascades where all 5 providers hit rate limits simultaneously.
- Finite sample sizes — n=20-30 per configuration; rare tail events not statistically guaranteed.
- No economic cost modeling — token costs tracked but not optimized.
Links
| Resource | URL |
|---|---|
| PyPI | pypi.org/project/dof-sdk |
| GitHub | github.com/Cyberpaisa/deterministic-observability-framework |
| Snowtrace | snowtrace.io/address/0x88f6...C052 |
| Enigma Scanner | erc-8004scan.xyz |
| Paper | paper/PAPER_OBSERVABILITY_LAB.md |
| Getting Started | docs/GETTING_STARTED.md |
| Architecture | docs/ARCHITECTURAL_REDESIGN_v1.md |
Citation
@article{cyberpaisa2026deterministic,
title={Deterministic Observability and Resilience Engineering for
Multi-Agent LLM Systems: An Experimental Framework
with Formal Verification},
author={Cyber Paisa and Enigma Group},
year={2026},
note={27K+ LOC, 719 tests, 25 modules, 4 Z3 theorems,
21 Avalanche attestations, Apache 2.0, pip install dof-sdk}
}
Contributing
Contributions welcome. See CONTRIBUTING.md for guidelines.
License
This project is licensed under the Business Source License 1.1. Free for non-commercial use, research, and personal projects. Commercial use requires a separate agreement. Contact: @Cyber_paisa on Telegram.
On 2028-03-08 this project converts to Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dof_sdk-0.2.3.tar.gz.
File metadata
- Download URL: dof_sdk-0.2.3.tar.gz
- Upload date:
- Size: 211.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fac0cd38fcc9e753188856cfe864cdbe435026b7a60b02d2f70b42f5298c7ed2
|
|
| MD5 |
7d509eaa88e612e6332affb71f224b61
|
|
| BLAKE2b-256 |
652ca1db10ea88f506e725f28bbf80f33fa5b5277c06a74d26d3b0c595d3e720
|
File details
Details for the file dof_sdk-0.2.3-py3-none-any.whl.
File metadata
- Download URL: dof_sdk-0.2.3-py3-none-any.whl
- Upload date:
- Size: 154.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10742eff77088c121869f753dc15f21676a6023c5f0b81576501c066a72126f0
|
|
| MD5 |
91aea1c2354be430d0f0cec45ec4683e
|
|
| BLAKE2b-256 |
98c111547b55036dd1190a7e61e006f5a37cf99c32d6cb680704221f1cf3bf51
|