Skip to main content

Deterministic Observability Framework — formal governance, privacy benchmarks, and adversarial testing for multi-agent LLM systems

Project description

DOF - Deterministic Observability Framework

VERIFY. PROVE. ATTEST.

CI tests Z3 proofs attestations PyPI license LOC Avalanche

Deterministic Observability Framework (DOF)

Deterministic governance for multi-agent LLM systems. Constitutional rules, formal proofs, and on-chain attestation on Avalanche.

Built with Python 3.11+ · Z3 SMT Solver · web3.py · BLAKE3 · Avalanche C-Chain · PostgreSQL

pip install dof-sdk
from dof import GenericAdapter
result = GenericAdapter().wrap_output("your agent output here")
# → {status: "pass", violations: [], score: 8.5}

30ms. Zero LLM tokens. Works with CrewAI, LangGraph, AutoGen, or anything that produces text.

python -m dof verify "your text here"   # governance check
python -m dof prove                      # Z3 formal verification
python -m dof health                     # component status
python -m dof benchmark                  # adversarial benchmark
python -m dof privacy                    # privacy benchmark
python -m dof version                    # show version

Contents

The Problem · Highlights · Architecture · Governance Layers · Z3 Verification · On-Chain · Benchmarks · Comparison · External Validation · Limitations · Citation


The Problem

LLM agents hallucinate. Nobody catches it deterministically. Using LLMs to verify LLMs is circular — the evaluator shares failure modes with the evaluated. Rate limits, cascading retries, and non-deterministic output quality interact across execution steps, producing unstable system-level behavior that cannot be attributed to specific infrastructure variables.

DOF solves this with 7 deterministic governance layers, formal Z3 proofs, and on-chain attestation — zero LLM tokens in the verification path.


Highlights

  • 7 governance layers — Constitution → AST → Supervisor → Z3 → Red/Blue → Memory → Signer
  • SS(f) = 1 − f³ — Z3 verified stability formula under bounded retries
  • GCR(f) = 1.0 — governance invariant under any failure rate (Z3 proven)
  • 21 on-chain attestations on Avalanche C-Chain mainnet
  • Merkle batching — 10,000 attestations = 1 tx ≈ $0.01
  • Automated benchmark — Governance 100%, Hallucination 90%, Consistency 100% FDR, 0% FPR
  • Privacy benchmark — 71% detection rate across 7 AgentLeak channels (PII, API keys, memory, tool inputs)
  • OpenTelemetry ready — optional OTLP tracing (pip install dof-sdk[otel])
  • EventBus — in-memory pub/sub with circular buffer, Redis/Kafka ready
  • Framework agnostic — CrewAI, LangGraph, AutoGen, or raw Python
  • A2A server (8 skills) + MCP server (10 tools) + REST API (14 endpoints)
  • 719 tests, 27K+ LOC, 25 core modules, 36 contributions

Architecture

+----------------------------------------------------+
| L7  Signer       HMAC + Avalanche           ~2s    |
+----------------------------------------------------+
| L6  Memory Gov   Bi-temporal + decay        <1ms   |
+----------------------------------------------------+
| L5  Red/Blue     Red -> Guard -> Arb       ~50ms   |
+----------------------------------------------------+
| L4  Z3 Proofs    4 theorems UNSAT          ~10ms   |
+----------------------------------------------------+
| L3  Supervisor   Q+A+C+F scoring            ~5ms   |
+----------------------------------------------------+
| L2  AST Verifier eval/exec/secrets          <1ms   |
+----------------------------------------------------+
| L1  Constitution 4 HARD + 5 SOFT            <1ms   |
+----------------------------------------------------+
| Engine  DAG + LoopGuard + TokenTracker             |
+----------------------------------------------------+
| Data Oracle  6 verification strategies      <1ms   |
+----------------------------------------------------+

Total governance latency: < 70ms (layers 1-6). On-chain signing adds ~2s when enabled.


Seven Governance Layers

Layer 1 — Constitution. Hard rules block output (hallucination claims without sources, non-English text, empty output, >50K chars). Soft rules score but don't block (missing sources, no structure, repetition, no actionable steps). Pure regex + keyword matching. <1ms.

Layer 2 — AST Verifier. Static analysis of agent-generated code via Python ast module. Blocks eval(), exec(), subprocess, os.system(), __import__(), and hardcoded secrets (OpenAI, GitHub, AWS patterns). <1ms.

Layer 3 — Meta-Supervisor. Weighted quality score: S = Q(0.40) + A(0.25) + C(0.20) + F(0.15). ACCEPT ≥ 7.0, RETRY ≥ 5.0, ESCALATE < 5.0. Cross-provider execution. ~5ms.

Layer 4 — Z3 Formal Proofs. Four machine-checked theorems via Z3 SMT solver. GCR invariance, SS cubic derivation, SS strict monotonicity, SS boundary conditions. All UNSAT (no counterexample exists). ~10ms total.

Layer 5 — Red/Blue Adversarial. RedTeamAgent finds defects, GuardianAgent defends with evidence, DeterministicArbiter adjudicates using only passing tests / governance compliance / AST results. Zero LLM in final adjudication. ACR metric. ~50ms.

Layer 6 — Memory Governance. GovernedMemoryStore validates every write against Constitution. Bi-temporal versioning (valid_from, valid_to, recorded_at). Constitutional decay (λ=0.99/hour) with protected categories (decisions, errors immune to decay). <1ms.

Layer 7 — On-Chain Signer. HMAC-SHA256 signed attestation certificates. Compliance-gated: only GCR=1.0 attestations are published. BLAKE3 certificate hashing. Avalanche C-Chain mainnet via web3.py. ~2s.


Formal Verification (Z3)

Theorem Math English Z3 Result
GCR Invariant ∀f∈[0,1]: GCR(f)=1.0 Governance is independent of failure rate UNSAT
SS Cubic ∀f∈[0,1]: SS(f)=1−f³ Stability follows cubic decay (r=2 retries) UNSAT
SS Monotonicity f₁<f₂ ⟹ SS(f₁)>SS(f₂) More failures = less stability UNSAT
SS Boundaries SS(0)=1.0 ∧ SS(1)=0.0 Perfect at 0% failure, zero at 100% UNSAT

10ms total. Proof certificates: logs/z3_proofs.json.


On-Chain Attestation

Field Value
Contract 0x88f6043B091055Bbd896Fc8D2c6234A47C02C052
Network Avalanche C-Chain (43114)
Attestations 21 (March 2026)
Functions registerAttestation(), registerBatch(), isCompliant(), getAttestation()
Cost $0.01 per attestation ($0.01 per Merkle batch of 10,000)
Deployer 0xB529f4f99ab244cfa7a48596Bf165CAc5B317929

Three verification layers: PostgreSQL (200ms) → Enigma Scanner (900ms) → Avalanche on-chain (2-3s, immutable).


Benchmark Results

Adversarial Benchmark (400 generated tests, deterministic)

Category FDR FPR F1 Tests
Governance 100.0% 0.0% 100.0% 100
Code Safety 86.0% 0.0% 92.5% 100
Hallucination 90.0% 0.0% 94.7% 100
Consistency 100.0% 0.0% 100.0% 100
Overall F1 96.8% 400

Production Results (n=30 runs, real infrastructure)

Metric Value Interpretation
SS 0.90 ± 0.31 90% execution stability
GCR 1.00 ± 0.00 Perfect governance invariance
PFI 0.61 ± 0.18 Provider failures recovered via rotation
Supervisor 27/30 ACCEPT 90% acceptance rate

Comparison

Feature DOF LangChain CrewAI Langfuse
Constitutional governance 7 layers
Z3 formal proofs 4 theorems
AST code safety Deterministic
On-chain attestation Avalanche
Adversarial Red/Blue DeterministicArbiter
Governed memory Bi-temporal + decay
FDR/FPR benchmark Automated
Token tracking Per-call Per-call
Execution DAG Critical path Trace tree
Framework agnostic Any string output LangChain only CrewAI only Any (tracing)
MCP server 10 tools
REST API 14 endpoints API
Open source Apache 2.0 MIT MIT MIT/Commercial

Production Agents

Two DOF-governed agents operating on Avalanche mainnet, ranked #1 and #2 of 1,772 agents on erc-8004scan.xyz:

Agent Token ID Wallet Protocols Status
Apex Arbitrage #1687 0xcd59...a983 A2A + OASF (7 skills) ACTIVE
AvaBuilder #1686 0x29a4...E71a A2A + OASF (5 skills) ACTIVE

Combined trust score: 0.85 (governance 0.35 + safety 0.15 + infrastructure 0.15 + activity 0.15 + community 0.20).


External Validation (Google Colab)

Tested externally via pip install dof-sdk==0.2.2 — zero internal dependencies.

Test Result Time
Z3 Formal Proofs (4/4) VERIFIED 19.25ms
MerkleBatcher (plain text) PASSED 0.31ms
Error Classifier (7/7 classes) PASSED 1.28ms

Full reports: tests/external/dof_enterprise_report.json (v0.2.1) and tests/external/dof_enterprise_report_v2.json (v0.2.2)


Honest Limitations

  • Hallucination detection is regex-based — 6 deterministic strategies (pattern matching, cross-reference, consistency, entity extraction, numerical plausibility, self-consistency) achieve 90% FDR on adversarial tests. Misses semantic hallucinations without known-facts coverage.
  • No correlated or cascading failure modeling — SS(f)=1−f³ assumes independent failures.
  • Supervisor is itself an LLM — mitigated by cross-provider execution and deterministic governance layer, but circularity is bounded, not eliminated.
  • Free-tier infrastructure — 3/30 runs fail from provider exhaustion cascades where all 5 providers hit rate limits simultaneously.
  • Finite sample sizes — n=20-30 per configuration; rare tail events not statistically guaranteed.
  • No economic cost modeling — token costs tracked but not optimized.

Links

Resource URL
PyPI pypi.org/project/dof-sdk
GitHub github.com/Cyberpaisa/deterministic-observability-framework
Snowtrace snowtrace.io/address/0x88f6...C052
Enigma Scanner erc-8004scan.xyz
Paper paper/PAPER_OBSERVABILITY_LAB.md
Getting Started docs/GETTING_STARTED.md
Architecture docs/ARCHITECTURAL_REDESIGN_v1.md

Citation

@article{cyberpaisa2026deterministic,
  title={Deterministic Observability and Resilience Engineering for
         Multi-Agent LLM Systems: An Experimental Framework
         with Formal Verification},
  author={Cyber Paisa and Enigma Group},
  year={2026},
  note={27K+ LOC, 719 tests, 25 modules, 4 Z3 theorems,
        21 Avalanche attestations, Apache 2.0, pip install dof-sdk}
}

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.


License

This project is licensed under the Business Source License 1.1. Free for non-commercial use, research, and personal projects. Commercial use requires a separate agreement. Contact: @Cyber_paisa on Telegram.

On 2028-03-08 this project converts to Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dof_sdk-0.2.3.tar.gz (211.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dof_sdk-0.2.3-py3-none-any.whl (154.0 kB view details)

Uploaded Python 3

File details

Details for the file dof_sdk-0.2.3.tar.gz.

File metadata

  • Download URL: dof_sdk-0.2.3.tar.gz
  • Upload date:
  • Size: 211.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for dof_sdk-0.2.3.tar.gz
Algorithm Hash digest
SHA256 fac0cd38fcc9e753188856cfe864cdbe435026b7a60b02d2f70b42f5298c7ed2
MD5 7d509eaa88e612e6332affb71f224b61
BLAKE2b-256 652ca1db10ea88f506e725f28bbf80f33fa5b5277c06a74d26d3b0c595d3e720

See more details on using hashes here.

File details

Details for the file dof_sdk-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: dof_sdk-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 154.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for dof_sdk-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 10742eff77088c121869f753dc15f21676a6023c5f0b81576501c066a72126f0
MD5 91aea1c2354be430d0f0cec45ec4683e
BLAKE2b-256 98c111547b55036dd1190a7e61e006f5a37cf99c32d6cb680704221f1cf3bf51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page