Skip to main content

Deterministic testing for AI agents. Assert tool calls, not vibes.

Project description

agentverify

PyPI Downloads CI Coverage License

pytest for AI agents. Assert tool calls, not vibes.

agentverify is a pytest plugin for deterministic testing of AI agent behavior. Record real LLM calls once, replay them in CI with zero cost, and assert exactly which tools were called, in what order, with what arguments — plus cost budgets and safety guardrails. Framework-agnostic, provider-agnostic, zero LLM cost in CI.

Why agentverify?

Most AI testing tools evaluate what an LLM says. agentverify tests what an agent does.

When agents move from prototype to production, the questions change: did the agent call the right tools in the right order? Did it stay within budget? Did it avoid dangerous operations? These are deterministic properties you can assert in CI, the same way you test any other code.

Unlike HTTP-level recorders that capture raw network traffic, agentverify records at the LLM SDK level — capturing tool calls, token usage, and model responses as first-class objects you can assert against. And unlike eval frameworks that score output quality with LLM-as-judge, agentverify asserts deterministic properties: routing correctness, cost control, and safety boundaries.

agentverify brings that discipline to agent development. It works with any framework — Strands Agents, LangChain, CrewAI, or plain Python — and any LLM provider. Just build an ExecutionResult from your agent's output and write pytest assertions.

Install

pip install agentverify

Quick Start — No LLM Required

Copy this into test_agent.py and run pytest. No API keys, no cassettes — just pure assertions.

from agentverify import (
    ExecutionResult, ToolCall, ANY,
    assert_tool_calls, assert_cost, assert_no_tool_call, assert_final_output,
)

# Build an ExecutionResult from your agent's output (or a dict)
result = ExecutionResult.from_dict({
    "tool_calls": [
        {"name": "get_location", "arguments": {"city": "Tokyo"}},
        {"name": "get_weather", "arguments": {"lat": 35.6, "lon": 139.7}},
    ],
    "token_usage": {"input_tokens": 50, "output_tokens": 30},
    "total_cost_usd": 0.002,
    "final_output": "The weather in Tokyo is sunny, 22°C.",
})

def test_tool_sequence():
    assert_tool_calls(result, expected=[
        ToolCall("get_location", {"city": "Tokyo"}),
        ToolCall("get_weather", {"lat": ANY, "lon": ANY}),
    ])

def test_budget():
    assert_cost(result, max_tokens=500, max_cost_usd=0.01)

def test_safety():
    assert_no_tool_call(result, forbidden_tools=["delete_user", "drop_table"])

def test_output():
    assert_final_output(result, contains="Tokyo")

Using a real agent framework? You don't need to build dicts by hand — write a small converter function instead. See Building from Your Framework below.

$ pytest test_agent.py -v
test_agent.py::test_tool_sequence PASSED
test_agent.py::test_budget PASSED
test_agent.py::test_safety PASSED
test_agent.py::test_output PASSED

3 Steps to Test Your Agent

Step 1: Build an ExecutionResult

Build an ExecutionResult from your agent's output. You can construct one from a dict, or use a framework-specific converter (see examples/ for Strands Agents and LangChain converters).

from agentverify import ExecutionResult

result = ExecutionResult.from_dict({
    "tool_calls": [
        {"name": "get_location", "arguments": {"city": "Tokyo"}},
        {"name": "get_weather", "arguments": {"lat": 35.6, "lon": 139.7}},
    ],
    "token_usage": {"input_tokens": 50, "output_tokens": 30},
    "total_cost_usd": 0.002,
    "final_output": "The weather in Tokyo is sunny, 22°C.",
})

ExecutionResult.from_dict() accepts the following keys:

Key Type Description
tool_calls list[dict] Each dict has name (str, required), arguments (dict, optional), result (any, optional — tool execution result stored for reference; not used in assertions)
token_usage dict or None {"input_tokens": int, "output_tokens": int}
total_cost_usd float or None Total cost in USD (must be set manually — not auto-calculated from tokens)
final_output str or None The agent's final text response

You can also use ExecutionResult.from_json(json_string) to parse from a JSON string, and to_dict() / to_json() for serialization.

Building from Your Framework

Every agent framework has its own output format. To use agentverify, you need a small converter function (~20–50 lines) that maps your framework's output to the dict schema above. Here's the general pattern:

from agentverify import ExecutionResult, ToolCall, TokenUsage

def my_framework_to_execution_result(agent_output) -> ExecutionResult:
    # 1. Extract tool calls: map your framework's tool call objects
    #    to ToolCall(name=..., arguments=...)
    tool_calls = [
        ToolCall(name=tc.tool_name, arguments=tc.params)
        for tc in agent_output.tool_history
    ]

    # 2. Extract token usage (if available)
    token_usage = TokenUsage(
        input_tokens=agent_output.metrics.prompt_tokens,
        output_tokens=agent_output.metrics.completion_tokens,
    )

    # 3. Extract final output text
    return ExecutionResult(
        tool_calls=tool_calls,
        token_usage=token_usage,
        final_output=agent_output.response_text,
    )

See examples/strands-file-organizer/converter.py and examples/langchain-issue-triage/converter.py for complete, production-ready converters. Built-in framework adapters are planned for a future release (see Roadmap).

Step 2: Assert

from agentverify import assert_tool_calls, assert_cost, assert_no_tool_call, assert_final_output, ToolCall, ANY

# Did the agent call the right tools in the right order?
assert_tool_calls(result, expected=[
    ToolCall("get_location", {"city": "Tokyo"}),
    ToolCall("get_weather", {"lat": ANY, "lon": ANY}),
])

# Did it stay within budget?
assert_cost(result, max_tokens=500, max_cost_usd=0.01)

# Did it avoid dangerous tools?
assert_no_tool_call(result, forbidden_tools=["delete_user", "drop_table"])

# Did the final output contain the expected content?
# See "Final Output Assertions" below for equals and regex options.
assert_final_output(result, contains="Tokyo")

Step 3: Record & Replay with Cassettes

Record real LLM API calls once. Replay them in CI forever — zero cost, deterministic.

⚠️ Important: Cassette replay uses sequential matching — responses are returned in recorded order without verifying request content. If your agent's prompts, tools, or model change, the cassette will still replay but may return stale/incorrect responses silently. Delete and re-record cassettes after significant agent changes. Request content matching is planned for a future release (see Roadmap).

The cassette fixture is a pytest fixture provided by the agentverify plugin. It creates an LLMCassetteRecorder that intercepts LLM SDK calls (not HTTP — it patches the SDK's chat completion method directly). Use @pytest.mark.agentverify to mark your test, and call your agent code inside the with cassette(...) block. After the block exits, call rec.to_execution_result() to build the result for assertions.

import pytest
from agentverify import assert_tool_calls, ToolCall, ANY

@pytest.mark.agentverify
def test_weather_agent(cassette):
    with cassette("weather_agent.yaml", provider="openai") as rec:
        # Replace this with your actual agent invocation, e.g.:
        # agent.run("What's the weather in Tokyo?")
        run_my_agent("What's the weather in Tokyo?")

    # rec.to_execution_result() is called AFTER the with block exits
    result = rec.to_execution_result()
    assert_tool_calls(result, expected=[
        ToolCall("get_location", {"city": "Tokyo"}),
        ToolCall("get_weather", {"lat": ANY, "lon": ANY}),
    ])

Record the cassette once, then replay it forever:

# First run: record real LLM calls to cassette file
pytest --cassette-mode=record

# All subsequent runs: replay from cassette (zero cost, deterministic)
pytest

Cassettes are human-readable YAML (or JSON). Commit them to git, review in PRs.

Cassette modes:

Mode Behavior
AUTO (default) If cassette file exists → REPLAY. Otherwise → call real LLM API but don't save (no cassette file is created).
RECORD Always call real LLM API and save to cassette file.
REPLAY Always replay from cassette file. Raises error if file is missing.

To create a cassette, use mode=CassetteMode.RECORD explicitly or pass --cassette-mode=record on the command line. To re-record, simply run with RECORD again — the existing file is overwritten.

# Record cassettes for all tests
pytest --cassette-mode=record

# Replay cassettes (default behavior when cassette files exist)
pytest

Other limitations:

  • total_cost_usd is not populated from cassettes. Use assert_cost(max_tokens=...) for cassette-based budget checks, or set total_cost_usd manually in your ExecutionResult.
  • Be mindful not to include sensitive data (API keys, PII, confidential prompts) in cassette files checked into version control.

Assertion Modes

from agentverify import assert_tool_calls, OrderMode, ToolCall, ANY

# Exact match — same tools, same order, same count (default)
assert_tool_calls(result, expected=[...])

# Subsequence — these tools appeared in this order (other calls in between are OK)
assert_tool_calls(result, expected=[...], order=OrderMode.IN_ORDER)

# Set membership — these tools were called (order doesn't matter)
assert_tool_calls(result, expected=[...], order=OrderMode.ANY_ORDER)

# Partial args — only check the keys you care about
assert_tool_calls(result, expected=[
    ToolCall("search", {"query": "Tokyo"}),
], partial_args=True)

# Collect all failures at once (doesn't stop at first)
from agentverify import assert_all
assert_all(
    result,
    lambda r: assert_tool_calls(r, expected=[...]),
    lambda r: assert_cost(r, max_tokens=1000),
    lambda r: assert_no_tool_call(r, forbidden_tools=["delete_user"]),
    lambda r: assert_final_output(r, contains="Tokyo"),
)

Strict Cost Assertions

By default, assert_cost() silently passes when token_usage or total_cost_usd is None (e.g., during cassette replay where cost data may be unavailable). Use strict=True to require that the data is present:

# Fails if token_usage is None, even if the budget would pass
assert_cost(result, max_tokens=500, strict=True)

# Fails if total_cost_usd is None
assert_cost(result, max_cost_usd=0.01, strict=True)

Final Output Assertions

assert_final_output() verifies the agent's final text response. Use contains for substring checks, equals for exact match, or matches for regex:

from agentverify import assert_final_output

# Substring check
assert_final_output(result, contains="Tokyo")

# Exact match
assert_final_output(result, equals="The weather in Tokyo is sunny, 22°C.")

# Regex match
assert_final_output(result, matches=r"\d+°C")

Framework Integration

agentverify is framework-agnostic. Build an ExecutionResult from any agent framework's output using a converter function. The examples/ directory includes ready-to-use converters:

Framework Converter Description
Strands Agents strands-file-organizer/converter.py Converts AgentResultExecutionResult
LangChain langchain-issue-triage/converter.py Converts AgentExecutor output → ExecutionResult

These converters are small (~50 lines) and easy to adapt for your own framework. Built-in framework adapters are planned for a future release (see Roadmap).

Supported LLM Providers

Provider Extra
OpenAI pip install agentverify[openai]
Amazon Bedrock pip install agentverify[bedrock]
Google Gemini pip install agentverify[gemini]
Anthropic pip install agentverify[anthropic]
LiteLLM pip install agentverify[litellm]
All providers pip install agentverify[all]

CI Integration

agentverify is designed for CI pipelines. Commit your cassette files to git and replay them in CI with zero LLM cost.

GitHub Actions

name: Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-python@v6
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"
      - run: pytest --tb=short -v

Cassette files in tests/cassettes/ are replayed automatically — no API keys or secrets needed in CI.

Error Messages

Clear, structured output when assertions fail:

ToolCallSequenceError: Tool call sequence mismatch at index 1

Expected:
  [0] get_location(city="Tokyo")
  [1] get_news(topic="weather")     ← first mismatch

Actual:
  [0] get_location(city="Tokyo")
  [1] search_web(query="Tokyo weather")  ← actual
CostBudgetError: Token budget exceeded

  Actual:  1,250 tokens
  Limit:   1,000 tokens
  Exceeded by: 250 tokens (25.0%)
SafetyRuleViolationError: 2 forbidden tool calls detected

  [1] delete_database(table="users") at position 3
  [2] drop_table(name="orders") at position 5
FinalOutputError: final_output does not contain expected substring

  Substring: 'Berlin'
  Actual:    'The weather in Tokyo is sunny, 22°C.'

Requirements

  • Python 3.10+
  • pytest 7+

Examples

The examples/ directory contains end-to-end examples with real agent frameworks and MCP servers. Each example ships with pre-recorded cassettes — run the tests without any API keys.

Example Framework Description
strands-file-organizer Strands Agents + Bedrock Scans a directory via Filesystem MCP, suggests organization. Read-only safety verified
langchain-issue-triage LangChain + OpenAI Triages GitHub issues via GitHub MCP. Label and priority suggestions
mcp-server Mock GitHub MCP server for token-free testing

Try it:

git clone https://github.com/simukappu/agentverify.git
cd agentverify
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Strands File Organizer
pip install -e "examples/strands-file-organizer/.[dev]"
pytest examples/strands-file-organizer/tests -v
tests/test_file_organizer.py::test_tool_call_sequence PASSED
tests/test_file_organizer.py::test_token_budget PASSED
tests/test_file_organizer.py::test_safety_read_only PASSED
# LangChain Issue Triage
pip install -e "examples/langchain-issue-triage/.[dev]"
pytest examples/langchain-issue-triage/tests -v
tests/test_issue_triage.py::TestIssueTriage_MockMCP::test_tool_call_sequence PASSED
tests/test_issue_triage.py::TestIssueTriage_MockMCP::test_safety_read_and_label_only PASSED

See each example's README for agent execution instructions and recording mode details.

Roadmap

  • Agent framework adapters — extract ExecutionResult directly from Strands Agents, LangChain, and others without writing a converter
  • Tool mocking/stubbing — test agent routing logic without calling real tools
  • Async support — first-class asyncio testing for async agents and tools
  • Cassette request matching — verify request content during replay to detect stale cassettes
  • Cassette sanitization — automatic masking of API keys and sensitive data in recorded cassettes
  • Cost estimation from tokens — auto-calculate total_cost_usd from token usage and model pricing
  • YAML/JSON test case definitions — declarative test cases for non-Python CI pipelines
  • CLI test runner — run agent tests without pytest

Changelog

See CHANGELOG.md for release history.

License

MIT

Contributing

Contributions welcome. Please open an issue first to discuss what you'd like to change.

Development setup:

git clone https://github.com/simukappu/agentverify.git
cd agentverify
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentverify-0.1.0.tar.gz (46.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentverify-0.1.0-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file agentverify-0.1.0.tar.gz.

File metadata

  • Download URL: agentverify-0.1.0.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentverify-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d72be054634116e50c5dc86202f7d016f5e50bf5ed517e107be41e3973813f6d
MD5 ca763d3573a2519836b5686cbad71a14
BLAKE2b-256 6f358f80bae65e1969b10135201e96b04ea50ed5ed80350e76ba7308dfcae71a

See more details on using hashes here.

File details

Details for the file agentverify-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentverify-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentverify-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce96d88fb2dba6c49ac4c5b28e9a1c91027cc92a31d8b5942e74fb4df1c6b7c9
MD5 6f3ab633b02e679a3bbcf79b62f0be81
BLAKE2b-256 db65393560ba9105d0d24b369f8d13d43c0345768f35a70de9fe824cb4ca024f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page