Skip to main content

The open specification for testing, validating, and guaranteeing agent-to-agent interactions.

Project description

a2a-spec
The open specification for testing, validating, and guaranteeing agent-to-agent interactions.

PyPI CI License Python 3.11+ Typed


The Problem

Multi-agent AI systems are impossible to test reliably. When Agent A changes its output format, Agent B silently breaks. LLM outputs are non-deterministic, so CI pipelines either skip testing or flake constantly. Existing tools focus on prompt evaluation or observability — none provide contract testing between agents.

The Solution

a2a-spec is a specification, testing, and validation layer for multi-agent systems. Define what one agent expects from another as a YAML spec. Record LLM outputs as snapshots. Replay them deterministically in CI with zero LLM calls. Detect structural and semantic regressions before they reach production.

Agent A ──[spec]──> Agent B ──[spec]──> Agent C
    │                   │                   │
    └── snapshot ──> replay ──> validate ──> ✓ CI passes

What a2a-spec is NOT

a2a-spec is not Examples What a2a-spec is
An agent framework LangChain, CrewAI, AutoGen A testing layer that sits alongside any framework
An observability tool LangSmith, Arize, Langfuse A validation engine that runs in CI, not production
A prompt evaluation tool Promptfoo, DeepEval A contract testing system between agents
An agent runtime n/a A specification framework for agent boundaries

Quick Start

Install

pip install a2a-spec

With optional features:

pip install a2a-spec[semantic]    # Embedding-based semantic comparison
pip install a2a-spec[langchain]   # LangChain adapter
pip install a2a-spec[dev]         # Testing and linting tools
pip install a2a-spec[all]         # Everything

Initialize a project

a2aspec init --name my-project

This creates:

my-project/
├── a2a-spec.yaml              # Project configuration
└── a2a_spec/
    ├── specs/                  # Agent-to-agent contracts
    │   └── example-spec.yaml
    ├── snapshots/              # Recorded outputs (committed to git!)
    ├── scenarios/              # Test input scenarios
    └── adapters/               # Agent wrappers

Define a spec

A spec is a YAML contract between a producer agent and a consumer agent. It defines structural, semantic, and policy requirements:

# a2a_spec/specs/triage-to-resolution.yaml
spec:
  name: triage-to-resolution
  version: "1.0"
  producer: triage-agent
  consumer: resolution-agent
  description: "What the resolution agent expects from triage"

  structural:
    type: object
    required: [category, summary, confidence]
    properties:
      category:
        type: string
        enum: [billing, shipping, product, general]
      summary:
        type: string
        minLength: 10
        maxLength: 500
      confidence:
        type: number
        minimum: 0.0
        maximum: 1.0

  semantic:
    - rule: summary_reflects_input
      description: "Summary must faithfully reflect the customer message"
      method: embedding_similarity
      threshold: 0.8

  policy:
    - rule: no_pii
      description: "Output must not contain PII"
      method: regex
      patterns:
        - '\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b'  # Credit card
        - '\b\d{3}-\d{2}-\d{4}\b'                       # SSN

Record snapshots

a2aspec record  # Calls live agents via adapters, saves outputs to disk

Snapshots are JSON files committed to git — they become your deterministic test baselines.

Test in CI (zero LLM calls)

a2aspec test --replay  # Validates saved snapshots against specs

No API keys needed. No LLM costs. Fully deterministic. Runs in milliseconds.

Detect semantic drift

After changing a prompt or upgrading a model:

a2aspec record   # Re-record with the new configuration
a2aspec diff     # Compare new vs. baseline outputs

The diff engine reports structural changes (fields added/removed/type-changed) and semantic drift (meaning shifted beyond threshold), with severity levels from LOW to CRITICAL.


Core Concepts

Concept Description
Spec A YAML file defining what one agent expects from another — structure, semantics, and policy rules
Snapshot A recorded LLM output for a given input, stored as JSON and committed to git
Replay Running validation against saved snapshots with zero LLM calls — fast, free, deterministic
Diff Structural + semantic comparison between old and new agent outputs, with severity levels
Pipeline A DAG of agents with routing conditions, tested end-to-end with spec validation at each step
Adapter A wrapper around your agent (function, HTTP, LangChain) so a2a-spec can call it

→ See docs/concepts.md for detailed explanations.


Adapters — Wrap Any Agent

a2a-spec is framework-agnostic. Adapters wrap your agents so the framework can call them during recording and testing.

Plain async functions

from a2a_spec import FunctionAdapter

async def my_triage_agent(input_data: dict) -> dict:
    # Your agent logic (calls OpenAI, Anthropic, local model, etc.)
    return {"category": "billing", "summary": "Customer reports duplicate charge", "confidence": 0.95}

adapter = FunctionAdapter(
    fn=my_triage_agent,
    agent_id="triage-agent",
    version="1.0.0",
    model="gpt-4",
)

HTTP endpoints

from a2a_spec import HTTPAdapter

adapter = HTTPAdapter(
    url="http://localhost:8000/triage",
    agent_id="triage-agent",
    version="1.0.0",
    headers={"Authorization": "Bearer $TOKEN"},
    timeout=30.0,
)

Custom adapters

from a2a_spec import AgentAdapter, AgentMetadata, AgentResponse

class MyCrewAIAdapter(AgentAdapter):
    def get_metadata(self) -> AgentMetadata:
        return AgentMetadata(agent_id="my-crew-agent", version="1.0")

    async def call(self, input_data: dict) -> AgentResponse:
        result = await my_crew.kickoff(input_data)
        return AgentResponse(output=result.dict())

→ See docs/writing-adapters.md for the full guide.


Pipeline Testing

Test entire multi-agent pipelines as a DAG. a2a-spec validates each agent's output against its spec and checks routing conditions:

pipeline:
  name: customer-support
  agents:
    triage-agent: {}
    billing-agent: {}
    shipping-agent: {}
    resolution-agent: {}
  edges:
    - from: triage-agent
      to: billing-agent
      condition: "output.category == 'billing'"
    - from: triage-agent
      to: shipping-agent
      condition: "output.category == 'shipping'"
    - from: [billing-agent, shipping-agent]
      to: resolution-agent
  test_cases:
    - name: billing_flow
      input: { message: "I was charged twice" }
a2aspec pipeline test pipeline.yaml --mode replay

→ See docs/architecture.md for the pipeline execution model.


Configuration

Project configuration lives in a2a-spec.yaml:

project_name: "my-project"
version: "1.0"

specs_dir: "./a2a_spec/specs"
scenarios_dir: "./a2a_spec/scenarios"

semantic:
  provider: sentence-transformers
  model: all-MiniLM-L6-v2     # Lazy-loaded, only when needed
  enabled: true

storage:
  backend: local
  path: ./a2a_spec/snapshots

ci:
  fail_on_semantic_drift: true
  drift_threshold: 0.15
  replay_mode: exact

Python API

Use a2a-spec programmatically in your existing test suite:

from a2a_spec import load_spec, validate_output, SnapshotStore, ReplayEngine

# Load and validate
spec = load_spec("a2a_spec/specs/triage-to-resolution.yaml")
result = validate_output(
    {"category": "billing", "summary": "Customer charged twice", "confidence": 0.95},
    spec,
)
assert result.passed

# Replay snapshots
store = SnapshotStore("./a2a_spec/snapshots")
engine = ReplayEngine(store)
output = engine.replay("triage-agent", "billing_overcharge")

# Diff two outputs
from a2a_spec import DiffEngine
diff = DiffEngine()
results = diff.diff(old_output, new_output, semantic_threshold=0.85)
for r in results:
    print(f"{r.field}: {r.severity}{r.explanation}")

# Policy enforcement
from a2a_spec.policy.engine import PolicyEngine
from a2a_spec.policy.builtin import no_pii_in_output
engine = PolicyEngine()
engine.register_validator("no_pii", no_pii_in_output)

CLI Reference

Command Description
a2aspec init [DIR] Scaffold a new a2a-spec project with examples
a2aspec record Record live agent outputs as snapshots
a2aspec test --replay Validate snapshots against specs (deterministic, zero LLM calls)
a2aspec test --live Validate live agent outputs against specs
a2aspec diff Compare current outputs against baselines
a2aspec diff --agent NAME Diff a specific agent only
a2aspec pipeline test FILE Test a multi-agent pipeline DAG
a2aspec --version Show version

→ See docs/cli-reference.md for full options and flags.


CI Integration

a2a-spec is designed for CI-first workflows:

# .github/workflows/a2a-spec.yml
name: Agent Contract Tests
on: [push, pull_request]

jobs:
  spec-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install a2a-spec
      - run: a2aspec test --replay

Key principle: Record locally (with API keys), test in CI (with snapshots). Snapshots are committed to git — they are your test baselines.

Output Format Flag Use Case
Console (Rich) --format console Local development
Markdown --format markdown PR comments
JUnit XML --format junit CI test reporters

→ See docs/ci-integration.md for GitHub Actions, Jenkins, and more.


Comparison

Feature a2a-spec Pact DeepEval Promptfoo LangSmith
Agent-to-agent contracts
LLM output snapshots
Deterministic CI replay
Semantic drift detection
Policy enforcement (PII, etc.)
Pipeline DAG testing
Framework agnostic
Zero LLM calls in CI N/A
Typed Python API (PEP 561) N/A N/A

Architecture

src/a2a_spec/
├── cli/          # Typer CLI (init, record, test, diff, pipeline)
├── spec/         # Spec schema (Pydantic), YAML loader, JSON Schema validator
├── snapshot/     # Record, store, fingerprint, and replay engine
├── diff/         # Structural (JSON) + semantic (embedding) comparison
├── pipeline/     # DAG builder, topological executor, execution traces
├── adapters/     # Agent wrappers: function, HTTP, LangChain
├── policy/       # Policy engine with regex and custom validators
├── semantic/     # Embedding model interface (sentence-transformers)
├── reporting/    # Console (Rich), Markdown, JUnit XML, GitHub annotations
├── config/       # YAML config loader with Pydantic validation
├── _internal/    # SHA256 hashing, safe expression evaluator, type aliases
└── exceptions.py # Hierarchical error types with actionable messages

→ See docs/architecture.md for the full design.


Examples

The examples/customer_support/ directory contains a complete walkthrough:

  • Two agents (triage + resolution) with a2a-spec contract
  • YAML spec with structural, semantic, and policy rules
  • Pre-recorded snapshot for deterministic replay
  • Test scenarios and pytest integration
  • Step-by-step README

Documentation

Guide Description
Getting Started Installation and first test in 2 minutes
Core Concepts Specs, snapshots, replay, diff explained
CLI Reference Every command with all options
Writing Specs Structural, semantic, and policy rules
Writing Adapters Wrap any agent for a2a-spec
CI Integration GitHub Actions, JUnit, exit codes
Architecture Module design and extension points

Contributing

Contributions are welcome. See CONTRIBUTING.md for the development setup, check commands, and PR process.


License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a2a_spec-0.1.0.tar.gz (61.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

a2a_spec-0.1.0-py3-none-any.whl (57.9 kB view details)

Uploaded Python 3

File details

Details for the file a2a_spec-0.1.0.tar.gz.

File metadata

  • Download URL: a2a_spec-0.1.0.tar.gz
  • Upload date:
  • Size: 61.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for a2a_spec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 65747c3b3d66e4ee46e23519d8ec964fed79c9f60e1e669bc4cb3990364f9e16
MD5 4e535122e37cfde4bf42ee2ec67f6020
BLAKE2b-256 48ca95b11d0442e24fb9fcc5883b8a67fd3a54c28642ca13e7474e5459bcad25

See more details on using hashes here.

Provenance

The following attestation bundles were made for a2a_spec-0.1.0.tar.gz:

Publisher: publish.yml on padobrik/a2a-spec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file a2a_spec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: a2a_spec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for a2a_spec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af2858fb35f3265df8d6ead8261118d6f8981e4168b76fbbcc8ce5a3a12a6f5f
MD5 a404f19aba9dc7f65684dc313745f784
BLAKE2b-256 fe10ec68b76b0404eaafab5029be7c5f597f7aff667d287c3156f112d1fee6f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for a2a_spec-0.1.0-py3-none-any.whl:

Publisher: publish.yml on padobrik/a2a-spec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page