Formal behavioral specification and runtime enforcement for autonomous AI agents. Agent Behavioral Contracts (ABC).
Project description
AgentAssert
Formal Behavioral Contracts for AI Agents
AgentAssert is the formal behavioral specification and runtime enforcement engine for autonomous AI agents. Define what your agent must and must not do in a YAML contract, then enforce those rules at runtime with mathematical guarantees.
It is the only framework combining all 6 pillars of rigorous agent governance:
- ContractSpec DSL -- YAML-based behavioral specification with 14 operators
- Hard/Soft Constraints -- Formal separation with graduated enforcement and recovery
- Drift Detection -- Jensen-Shannon Divergence for distributional behavioral analysis
- (p, delta, k)-Satisfaction -- Probabilistic compliance guarantees with statistical bounds
- Compositional Safety Proofs -- Formal bounds for multi-agent pipelines
- Mathematical Stability -- Ornstein-Uhlenbeck dynamics with Lyapunov stability proof
Paper: Bhardwaj, V.P. (2026). AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents. arXiv:2602.22302
Install
pip install agentassert-abc[yaml,math]
Requires Python 3.12+. Licensed under AGPL-3.0.
Optional extras:
| Extra | What it adds |
|---|---|
yaml |
YAML contract parsing (ruamel.yaml) |
math |
Drift detection, Theta computation (scipy, numpy) |
llm |
Recovery re-prompting (LiteLLM) |
otel |
OpenTelemetry metric export |
all |
Everything above |
Quick Start -- 5 Minutes to Behavioral Contracts
import agentassert_abc as aa
from agentassert_abc.integrations.generic import GenericAdapter
# 1. Load a domain contract (12 included out of the box)
contract = aa.load("contracts/examples/ecommerce-product-recommendation.yaml")
# 2. Create an adapter
adapter = GenericAdapter(contract)
# 3. Monitor agent output on every turn
result = adapter.check({
"output.pii_detected": False,
"output.competitor_reference_detected": False,
"output.sponsored_items_disclosed": True,
"output.brand_tone_score": 0.85,
"output.recommendation_relevance_score": 0.9,
})
print(f"Hard violations: {result.hard_violations}")
print(f"Soft violations: {result.soft_violations}")
# 4. Raise on critical violations
adapter.check_and_raise({
"output.pii_detected": False,
"output.competitor_reference_detected": False,
"output.sponsored_items_disclosed": True,
"output.brand_tone_score": 0.85,
"output.recommendation_relevance_score": 0.9,
})
# 5. Get session reliability score (Theta)
summary = adapter.session_summary()
print(f"Reliability (Theta): {summary.theta:.3f}")
print(f"Deploy-ready: {summary.theta >= 0.90}")
Framework Integration
AgentAssert is plug-and-play with the major 2026 agent frameworks.
LangGraph -- Node Interception
from langgraph.graph import StateGraph, START, END
from agentassert_abc.exceptions import ContractBreachError
from agentassert_abc.integrations.langgraph import LangGraphAdapter
contract = aa.load("contracts/examples/customer-support.yaml")
adapter = LangGraphAdapter(contract)
builder = StateGraph(State)
builder.add_node("classify", adapter.wrap_node(classify_fn))
builder.add_node("respond", adapter.wrap_node(respond_fn))
builder.add_edge(START, "classify")
builder.add_edge("classify", "respond")
builder.add_edge("respond", END)
graph = builder.compile()
try:
result = graph.invoke(initial_state)
except ContractBreachError as e:
print(f"Hard violation blocked: {e}")
print(f"Session Theta: {adapter.session_summary().theta:.3f}")
CrewAI -- Task Guardrails
from crewai import Agent, Task, Crew
from agentassert_abc.integrations.crewai import CrewAIAdapter
contract = aa.load("contracts/examples/research-assistant.yaml")
adapter = CrewAIAdapter(contract)
# Guardrail rejects output on hard violations -- CrewAI retries automatically
research_task = Task(
description="Research AI agent frameworks in 2026",
expected_output="Cited report on top 5 frameworks",
agent=researcher,
guardrail=adapter.guardrail,
guardrail_max_retries=3,
)
OpenAI Agents SDK -- Output Guardrails
from agents import Agent, Runner
from agentassert_abc.integrations.openai_agents import OpenAIAgentsAdapter
contract = aa.load("contracts/examples/healthcare-triage.yaml")
adapter = OpenAIAgentsAdapter(contract)
agent = Agent(
name="triage-agent",
instructions="You are a medical triage assistant.",
output_guardrails=[adapter.output_guardrail],
output_type=TriageOutput,
)
result = await Runner.run(agent, "I have chest pain", hooks=adapter.run_hooks)
print(f"Theta: {adapter.session_summary().theta:.3f}")
AgentContract-Bench -- 293 Scenarios, 12 Domains
AgentAssert ships with AgentContract-Bench, a benchmark suite of 293 scenarios across 12 real-world domains for testing contract enforcement accuracy.
Benchmark Results (v0.1.0)
| Domain | Scenarios | Pass Rate | Hard P/R/F1 | Soft P/R/F1 |
|---|---|---|---|---|
| E-Commerce (Product) | 50 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Financial Advisor | 33 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Healthcare Triage | 33 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| MCP Tool Server | 28 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| RAG Agent | 28 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Code Generation | 23 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Customer Support | 23 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| E-Commerce (CS) | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| E-Commerce (Order) | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Research Assistant | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Retail Shopping | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Telecom Support | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Total | 293 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
# Run benchmarks locally
python benchmarks/runner.py # All 293 scenarios
python benchmarks/runner.py --domain ecommerce # Single domain
python benchmarks/runner.py --verbose # Show details
Live LLM Benchmark -- Real Models, Real Contracts
We tested AgentAssert against 3 production LLMs on a 10-16 turn e-commerce session using the retail-shopping-assistant contract with real Azure AI Foundry endpoints:
| Model | Turns | Hard Violations | Soft Violations | Theta | Mean Drift |
|---|---|---|---|---|---|
| GPT-5.3 (OpenAI) | 16 | 0 | 11 | 0.688 | 0.034 |
| Claude Sonnet 4.6 (Anthropic) | 10 | 4 | 0 | 0.823 | 0.020 |
| Mistral-Large-3 (Mistral) | 10 | 5 | 0 | 0.813 | 0.025 |
Key findings:
- GPT-5.3 achieved zero hard violations but exhibited soft quality drift (response completeness and latency)
- Claude Sonnet 4.6 and Mistral-Large-3 triggered
no-false-availabilityhard violations -- fabricating product availability without catalog access - All three models scored below the 0.90 Theta threshold for autonomous deployment, demonstrating why runtime behavioral contracts are essential
These results are consistent with the findings reported in arXiv:2602.22302. AgentAssert catches violations that traditional guardrails miss because it tracks behavioral drift over entire sessions, not just individual outputs.
Domain Contracts -- Ready to Use
12 production-ready contracts ship with AgentAssert in contracts/examples/:
| Contract | Domain | Hard | Soft | Key Checks |
|---|---|---|---|---|
ecommerce-product-recommendation |
E-Commerce | 7 | 8 | PII, competitor mentions, sponsored disclosure |
ecommerce-order-management |
E-Commerce | 7 | 8 | Payment data, order accuracy, refund policy |
ecommerce-customer-service |
E-Commerce | 7 | 8 | Escalation, SLA, customer sentiment |
financial-advisor |
Finance | 7 | 8 | Regulatory compliance, risk disclosure, suitability |
healthcare-triage |
Healthcare | 9 | 7 | Medical safety, urgency detection, no diagnosis |
retail-shopping-assistant |
Retail | 7 | 9 | Availability, pricing accuracy, upsell limits |
telecom-customer-support |
Telecom | 7 | 9 | Plan accuracy, billing, cancellation handling |
code-generation |
Dev Tools | 7 | 7 | License compliance, security, test coverage |
research-assistant |
Research | 6 | 7 | Citation accuracy, source attribution, bias |
customer-support |
General | 6 | 5 | Tone, escalation, resolution quality |
mcp-tool-server |
MCP (2026) | 6 | 5 | Tool authorization, rate limits, output bounds |
rag-agent |
RAG (2026) | 7 | 7 | Hallucination, source grounding, retrieval quality |
ContractSpec DSL
Define behavioral contracts in YAML:
contractspec: "0.1"
kind: agent
name: my-agent-contract
description: Behavioral contract for my agent
version: "1.0.0"
invariants:
hard:
- name: no-pii-leak
description: Never expose personal information
check:
field: output.pii_detected
equals: false
soft:
- name: tone-quality
description: Maintain professional tone
check:
field: output.tone_score
gte: 0.7
recovery: fix-tone
recovery_window: 2
recovery:
strategies:
- name: fix-tone
type: inject_correction
actions:
- "Rewrite with professional tone"
satisfaction:
p: 0.95
delta: 0.1
k: 3
14 operators: equals, not_equals, gt, gte, lt, lte, in, not_in, contains, not_contains, matches, exists, expr, between
Writing Your Own Contract
- Identify fields -- Examine your agent's output and list the fields that matter for safety and quality
- Map to flat dict -- AgentAssert uses
output.field_nameas keys (e.g.,{"output.safe": True}) - Choose constraint type -- Hard for non-negotiable safety (violations halt execution), Soft for quality goals (violations trigger recovery)
- Set satisfaction --
p= target compliance rate,delta= tolerance,k= max violations before alert
SPRT Certification
Certify agents for production with 50-80% fewer test sessions using Sequential Probability Ratio Testing:
from agentassert_abc.certification.sprt import SPRTCertifier, SPRTDecision
certifier = SPRTCertifier(p0=0.85, p1=0.95, alpha=0.05, beta=0.10)
for session_passed in session_results:
result = certifier.update(session_passed)
if result.decision != SPRTDecision.CONTINUE:
print(f"Decision: {result.decision.value} after {result.sessions_used} sessions")
break
Compositional Guarantees
Prove safety bounds for multi-agent pipelines:
from agentassert_abc.certification.composition import compose_guarantees
# Agent A (p=0.95) -> Agent B (p=0.98), handoff reliability 0.99
bound = compose_guarantees(p_a=0.95, p_b=0.98, p_h=0.99)
print(f"Pipeline bound: {bound:.3f}") # p_{A+B} >= 0.921
How AgentAssert Differs
| Dimension | AgentAssert | Guardrails AI | NeMo Guardrails | Microsoft AGT |
|---|---|---|---|---|
| Formal math (Theta, SPRT) | Yes | No | No | No |
| Session drift detection (JSD) | Yes | No | No | No |
| Compositional safety proofs | Yes | No | No | No |
| Hard/Soft constraint separation | Yes | Partial | No | No |
| Recovery re-prompting | Yes | Yes | Yes | No |
| Framework integrations | 10 adapters | 3 | 1 (LangChain) | 2 |
| Statistical certification (SPRT) | Yes | No | No | No |
| Benchmark suite | 293 scenarios | No | No | No |
| Academic paper | arXiv:2602.22302 | No | No | No |
Examples
See examples/ for runnable demos:
| Example | What It Shows |
|---|---|
01_basic_monitoring.py |
Simplest usage -- load, monitor, get Theta |
02_ecommerce_session.py |
Full e-commerce session from the paper |
03_drift_detection.py |
JSD-based behavioral drift over 20 turns |
04_sprt_certification.py |
SPRT statistical certification |
05_langgraph_middleware.py |
LangGraph StateGraph integration |
06_crewai_integration.py |
CrewAI task guardrails |
07_composition_pipeline.py |
Multi-agent compositional bounds |
08_mcp_tool_monitoring.py |
MCP tool server monitoring |
Research Paper
"AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents"
The theoretical foundations, formal proofs, and experimental validation are published in a peer-reviewed paper covering all 6 pillars of the framework, with full mathematical treatment of the Reliability Index, drift dynamics, compositional guarantees, and SPRT certification.
Read the paper on arXiv (cs.AI + cs.SE)
Cite This Work
@article{bhardwaj2026agentassert,
title={AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents},
author={Bhardwaj, Varun Pratap},
journal={arXiv preprint arXiv:2602.22302},
year={2026},
url={https://arxiv.org/abs/2602.22302}
}
Contributing
Contributions welcome. See CONTRIBUTING.md for setup instructions, coding standards, and submission guidelines.
License
GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE.
For commercial licensing (closed-source, proprietary, or hosted use), see COMMERCIAL-LICENSE.md or contact varun.pratap.bhardwaj@gmail.com.
Copyright (c) 2026 Varun Pratap Bhardwaj / Qualixar.
Part of Qualixar -- AI Agent Reliability Engineering
A research initiative by Varun Pratap Bhardwaj
qualixar.com · varunpratap.com · arXiv:2602.22302 · agentassert.com
⭐ Support This Project
If this project solves a real problem for you, please star the repo — it helps other developers discover Qualixar and signals that the AI agent reliability community is growing. Every star matters.
Part of the Qualixar AI Agent Reliability Platform
Qualixar is building the open-source infrastructure for AI agent reliability engineering. Seven products, seven peer-reviewed papers, one coherent platform. Each tool solves one reliability pillar:
| Product | Purpose | Install | Paper |
|---|---|---|---|
| SuperLocalMemory | Persistent memory + learning for AI agents | npx superlocalmemory |
arXiv:2604.04514 |
| Qualixar OS | Universal agent runtime (13 execution topologies) | npx qualixar-os |
arXiv:2604.06392 |
| SLM Mesh | P2P coordination across AI agent sessions | npm i slm-mesh |
— |
| SLM MCP Hub | Federate 430+ MCP tools through one gateway | pip install slm-mcp-hub |
— |
| AgentAssay | Token-efficient AI agent testing | pip install agentassay |
arXiv:2603.02601 |
| AgentAssert | Behavioral contracts + drift detection | pip install agentassert-abc |
arXiv:2602.22302 |
| SkillFortify | Formal verification for AI agent skills | pip install skillfortify |
arXiv:2603.00195 |
Zero cloud dependency. Local-first. EU AI Act compliant.
Start here → qualixar.com · All papers on Qualixar HuggingFace
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentassert_abc-0.2.3.tar.gz.
File metadata
- Download URL: agentassert_abc-0.2.3.tar.gz
- Upload date:
- Size: 154.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f499f5bfa2a0665280790fe93f255c56b44f6e97f2661503767c8cef299753f
|
|
| MD5 |
a5e7344c62ffc0ca9ae6952e5abb62a4
|
|
| BLAKE2b-256 |
240149e858f4d31ca54f128674f11a050fb3810a303550ff8366c977143cc3d9
|
File details
Details for the file agentassert_abc-0.2.3-py3-none-any.whl.
File metadata
- Download URL: agentassert_abc-0.2.3-py3-none-any.whl
- Upload date:
- Size: 59.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ee40edaaf0992be3bdd97017d268e7c2ee0bb2981069ee65b2f535be856b1f8
|
|
| MD5 |
944f1abc79fbbf4f0ca7a2a89c38ebd6
|
|
| BLAKE2b-256 |
4bbeb352b5585078718711e9aa1eede415a01ed5903591bdc08adbc4d9f99f4b
|