Skip to main content

Formal behavioral specification and runtime enforcement for autonomous AI agents. Agent Behavioral Contracts (ABC).

Project description

AgentAssert
Formal Behavioral Contracts for AI Agents

PyPI Python arXiv CI License


AgentAssert is the formal behavioral specification and runtime enforcement engine for autonomous AI agents. Define what your agent must and must not do in a YAML contract, then enforce those rules at runtime with mathematical guarantees.

It is the only framework combining all 6 pillars of rigorous agent governance:

  1. ContractSpec DSL -- YAML-based behavioral specification with 14 operators
  2. Hard/Soft Constraints -- Formal separation with graduated enforcement and recovery
  3. Drift Detection -- Jensen-Shannon Divergence for distributional behavioral analysis
  4. (p, delta, k)-Satisfaction -- Probabilistic compliance guarantees with statistical bounds
  5. Compositional Safety Proofs -- Formal bounds for multi-agent pipelines
  6. Mathematical Stability -- Ornstein-Uhlenbeck dynamics with Lyapunov stability proof

Paper: Bhardwaj, V.P. (2026). AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents. arXiv:2602.22302


Install

pip install agentassert-abc[yaml,math]

Requires Python 3.12+. Licensed under Elastic License 2.0.

Optional extras:

Extra What it adds
yaml YAML contract parsing (ruamel.yaml)
math Drift detection, Theta computation (scipy, numpy)
llm Recovery re-prompting (LiteLLM)
otel OpenTelemetry metric export
all Everything above

Quick Start -- 5 Minutes to Behavioral Contracts

import agentassert_abc as aa
from agentassert_abc.integrations.generic import GenericAdapter

# 1. Load a domain contract (12 included out of the box)
contract = aa.load("contracts/examples/ecommerce-product-recommendation.yaml")

# 2. Create an adapter
adapter = GenericAdapter(contract)

# 3. Monitor agent output on every turn
result = adapter.check({
    "output.pii_detected": False,
    "output.competitor_reference_detected": False,
    "output.sponsored_items_disclosed": True,
    "output.brand_tone_score": 0.85,
    "output.recommendation_relevance_score": 0.9,
})

print(f"Hard violations: {result.hard_violations}")
print(f"Soft violations: {result.soft_violations}")

# 4. Raise on critical violations
adapter.check_and_raise({
    "output.pii_detected": False,
    "output.competitor_reference_detected": False,
    "output.sponsored_items_disclosed": True,
    "output.brand_tone_score": 0.85,
    "output.recommendation_relevance_score": 0.9,
})

# 5. Get session reliability score (Theta)
summary = adapter.session_summary()
print(f"Reliability (Theta): {summary.theta:.3f}")
print(f"Deploy-ready: {summary.theta >= 0.90}")

Framework Integration

AgentAssert is plug-and-play with the major 2026 agent frameworks.

LangGraph -- Node Interception

from langgraph.graph import StateGraph, START, END
from agentassert_abc.exceptions import ContractBreachError
from agentassert_abc.integrations.langgraph import LangGraphAdapter

contract = aa.load("contracts/examples/customer-support.yaml")
adapter = LangGraphAdapter(contract)

builder = StateGraph(State)
builder.add_node("classify", adapter.wrap_node(classify_fn))
builder.add_node("respond", adapter.wrap_node(respond_fn))
builder.add_edge(START, "classify")
builder.add_edge("classify", "respond")
builder.add_edge("respond", END)

graph = builder.compile()

try:
    result = graph.invoke(initial_state)
except ContractBreachError as e:
    print(f"Hard violation blocked: {e}")

print(f"Session Theta: {adapter.session_summary().theta:.3f}")

CrewAI -- Task Guardrails

from crewai import Agent, Task, Crew
from agentassert_abc.integrations.crewai import CrewAIAdapter

contract = aa.load("contracts/examples/research-assistant.yaml")
adapter = CrewAIAdapter(contract)

# Guardrail rejects output on hard violations -- CrewAI retries automatically
research_task = Task(
    description="Research AI agent frameworks in 2026",
    expected_output="Cited report on top 5 frameworks",
    agent=researcher,
    guardrail=adapter.guardrail,
    guardrail_max_retries=3,
)

OpenAI Agents SDK -- Output Guardrails

from agents import Agent, Runner
from agentassert_abc.integrations.openai_agents import OpenAIAgentsAdapter

contract = aa.load("contracts/examples/healthcare-triage.yaml")
adapter = OpenAIAgentsAdapter(contract)

agent = Agent(
    name="triage-agent",
    instructions="You are a medical triage assistant.",
    output_guardrails=[adapter.output_guardrail],
    output_type=TriageOutput,
)

result = await Runner.run(agent, "I have chest pain", hooks=adapter.run_hooks)
print(f"Theta: {adapter.session_summary().theta:.3f}")

AgentContract-Bench -- 293 Scenarios, 12 Domains

AgentAssert ships with AgentContract-Bench, a benchmark suite of 293 scenarios across 12 real-world domains for testing contract enforcement accuracy.

Benchmark Results (v0.1.0)

Domain Scenarios Pass Rate Hard P/R/F1 Soft P/R/F1
E-Commerce (Product) 50 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Financial Advisor 33 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Healthcare Triage 33 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
MCP Tool Server 28 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
RAG Agent 28 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Code Generation 23 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Customer Support 23 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
E-Commerce (CS) 15 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
E-Commerce (Order) 15 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Research Assistant 15 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Retail Shopping 15 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Telecom Support 15 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
Total 293 100% 1.00 / 1.00 / 1.00 1.00 / 1.00 / 1.00
# Run benchmarks locally
python benchmarks/runner.py                     # All 293 scenarios
python benchmarks/runner.py --domain ecommerce  # Single domain
python benchmarks/runner.py --verbose           # Show details

Live LLM Benchmark -- Real Models, Real Contracts

We tested AgentAssert against 3 production LLMs on a 10-16 turn e-commerce session using the retail-shopping-assistant contract with real Azure AI Foundry endpoints:

Model Turns Hard Violations Soft Violations Theta Mean Drift
GPT-5.3 (OpenAI) 16 0 11 0.688 0.034
Claude Sonnet 4.6 (Anthropic) 10 4 0 0.823 0.020
Mistral-Large-3 (Mistral) 10 5 0 0.813 0.025

Key findings:

  • GPT-5.3 achieved zero hard violations but exhibited soft quality drift (response completeness and latency)
  • Claude Sonnet 4.6 and Mistral-Large-3 triggered no-false-availability hard violations -- fabricating product availability without catalog access
  • All three models scored below the 0.90 Theta threshold for autonomous deployment, demonstrating why runtime behavioral contracts are essential

These results are consistent with the findings reported in arXiv:2602.22302. AgentAssert catches violations that traditional guardrails miss because it tracks behavioral drift over entire sessions, not just individual outputs.


Domain Contracts -- Ready to Use

12 production-ready contracts ship with AgentAssert in contracts/examples/:

Contract Domain Hard Soft Key Checks
ecommerce-product-recommendation E-Commerce 7 8 PII, competitor mentions, sponsored disclosure
ecommerce-order-management E-Commerce 7 8 Payment data, order accuracy, refund policy
ecommerce-customer-service E-Commerce 7 8 Escalation, SLA, customer sentiment
financial-advisor Finance 7 8 Regulatory compliance, risk disclosure, suitability
healthcare-triage Healthcare 9 7 Medical safety, urgency detection, no diagnosis
retail-shopping-assistant Retail 7 9 Availability, pricing accuracy, upsell limits
telecom-customer-support Telecom 7 9 Plan accuracy, billing, cancellation handling
code-generation Dev Tools 7 7 License compliance, security, test coverage
research-assistant Research 6 7 Citation accuracy, source attribution, bias
customer-support General 6 5 Tone, escalation, resolution quality
mcp-tool-server MCP (2026) 6 5 Tool authorization, rate limits, output bounds
rag-agent RAG (2026) 7 7 Hallucination, source grounding, retrieval quality

ContractSpec DSL

Define behavioral contracts in YAML:

contractspec: "0.1"
kind: agent
name: my-agent-contract
description: Behavioral contract for my agent
version: "1.0.0"

invariants:
  hard:
    - name: no-pii-leak
      description: Never expose personal information
      check:
        field: output.pii_detected
        equals: false

  soft:
    - name: tone-quality
      description: Maintain professional tone
      check:
        field: output.tone_score
        gte: 0.7
      recovery: fix-tone
      recovery_window: 2

recovery:
  strategies:
    - name: fix-tone
      type: inject_correction
      actions:
        - "Rewrite with professional tone"

satisfaction:
  p: 0.95
  delta: 0.1
  k: 3

14 operators: equals, not_equals, gt, gte, lt, lte, in, not_in, contains, not_contains, matches, exists, expr, between


Writing Your Own Contract

  1. Identify fields -- Examine your agent's output and list the fields that matter for safety and quality
  2. Map to flat dict -- AgentAssert uses output.field_name as keys (e.g., {"output.safe": True})
  3. Choose constraint type -- Hard for non-negotiable safety (violations halt execution), Soft for quality goals (violations trigger recovery)
  4. Set satisfaction -- p = target compliance rate, delta = tolerance, k = max violations before alert

SPRT Certification

Certify agents for production with 50-80% fewer test sessions using Sequential Probability Ratio Testing:

from agentassert_abc.certification.sprt import SPRTCertifier, SPRTDecision

certifier = SPRTCertifier(p0=0.85, p1=0.95, alpha=0.05, beta=0.10)
for session_passed in session_results:
    result = certifier.update(session_passed)
    if result.decision != SPRTDecision.CONTINUE:
        print(f"Decision: {result.decision.value} after {result.sessions_used} sessions")
        break

Compositional Guarantees

Prove safety bounds for multi-agent pipelines:

from agentassert_abc.certification.composition import compose_guarantees

# Agent A (p=0.95) -> Agent B (p=0.98), handoff reliability 0.99
bound = compose_guarantees(p_a=0.95, p_b=0.98, p_h=0.99)
print(f"Pipeline bound: {bound:.3f}")  # p_{A+B} >= 0.921

How AgentAssert Differs

Dimension AgentAssert Guardrails AI NeMo Guardrails Microsoft AGT
Formal math (Theta, SPRT) Yes No No No
Session drift detection (JSD) Yes No No No
Compositional safety proofs Yes No No No
Hard/Soft constraint separation Yes Partial No No
Recovery re-prompting Yes Yes Yes No
Framework integrations 10 adapters 3 1 (LangChain) 2
Statistical certification (SPRT) Yes No No No
Benchmark suite 293 scenarios No No No
Academic paper arXiv:2602.22302 No No No

Examples

See examples/ for runnable demos:

Example What It Shows
01_basic_monitoring.py Simplest usage -- load, monitor, get Theta
02_ecommerce_session.py Full e-commerce session from the paper
03_drift_detection.py JSD-based behavioral drift over 20 turns
04_sprt_certification.py SPRT statistical certification
05_langgraph_middleware.py LangGraph StateGraph integration
06_crewai_integration.py CrewAI task guardrails
07_composition_pipeline.py Multi-agent compositional bounds
08_mcp_tool_monitoring.py MCP tool server monitoring

Research Paper

"AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents"

The theoretical foundations, formal proofs, and experimental validation are published in a peer-reviewed paper covering all 6 pillars of the framework, with full mathematical treatment of the Reliability Index, drift dynamics, compositional guarantees, and SPRT certification.

Read the paper on arXiv (cs.AI + cs.SE)

Cite This Work

@article{bhardwaj2026agentassert,
  title={AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents},
  author={Bhardwaj, Varun Pratap},
  journal={arXiv preprint arXiv:2602.22302},
  year={2026},
  url={https://arxiv.org/abs/2602.22302}
}

Contributing

Contributions welcome. See CONTRIBUTING.md for setup instructions, coding standards, and submission guidelines.


License

Elastic License 2.0. See LICENSE for details.


Part of Qualixar -- AI Agent Reliability Engineering
A research initiative by Varun Pratap Bhardwaj

qualixar.com · varunpratap.com · arXiv:2602.22302 · agentassert.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentassert_abc-0.1.0.tar.gz (142.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentassert_abc-0.1.0-py3-none-any.whl (48.6 kB view details)

Uploaded Python 3

File details

Details for the file agentassert_abc-0.1.0.tar.gz.

File metadata

  • Download URL: agentassert_abc-0.1.0.tar.gz
  • Upload date:
  • Size: 142.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agentassert_abc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 db648f23f937dddc63e21463420433890275c15e6369e133be5d78c2dfb1920e
MD5 0c4cabbe9af16448906021ef6c57989b
BLAKE2b-256 ee6cc66650e0c5c736c4e0b8bfcf16d4d740b4bb24da53dbb3765ccdc1038e24

See more details on using hashes here.

File details

Details for the file agentassert_abc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentassert_abc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agentassert_abc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f5a8013439bf4cd2809c5bd72062e84acaa3460043af991c609c4483903fd77
MD5 49387c80a9d1bd0d916ef08f217e097d
BLAKE2b-256 20fd380c4b3736d78859dd16263ecd909c9239127e92550299f5878d68dd2cd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page