Deterministic testing for AI agents. Assert tool calls, not vibes.
Project description
agentverify
pytest for AI agents. Assert tool calls, not vibes.
agentverify is a pytest plugin for deterministic testing of AI agent behavior. Record real LLM calls once, replay them in CI with zero cost, and assert exactly which tools were called, in what order, with what arguments — plus cost budgets and safety guardrails. Framework-agnostic, provider-agnostic, zero LLM cost in CI.
Why agentverify?
Most AI testing tools evaluate what an LLM says. agentverify tests what an agent does.
When agents move from prototype to production, the questions change: did the agent call the right tools in the right order? Did it stay within budget? Did it avoid dangerous operations? These are deterministic properties you can assert in CI, the same way you test any other code.
Unlike HTTP-level recorders that capture raw network traffic, agentverify records at the LLM SDK level — capturing tool calls, token usage, and model responses as first-class objects you can assert against. And unlike eval frameworks that score output quality with LLM-as-judge, agentverify asserts deterministic properties: routing correctness, cost control, and safety boundaries.
agentverify brings that discipline to agent development. It works with any framework — Strands Agents, LangChain, CrewAI, or plain Python — and any LLM provider. Just build an ExecutionResult from your agent's output and write pytest assertions.
Install
pip install agentverify
Quick Start — No LLM Required
Copy this into test_agent.py and run pytest. No API keys, no cassettes — just pure assertions.
from agentverify import (
ExecutionResult, ToolCall, ANY,
assert_tool_calls, assert_cost, assert_no_tool_call, assert_final_output,
)
# Build an ExecutionResult from your agent's output (or a dict)
result = ExecutionResult.from_dict({
"tool_calls": [
{"name": "get_location", "arguments": {"city": "Tokyo"}},
{"name": "get_weather", "arguments": {"lat": 35.6, "lon": 139.7}},
],
"token_usage": {"input_tokens": 50, "output_tokens": 30},
"total_cost_usd": 0.002,
"final_output": "The weather in Tokyo is sunny, 22°C.",
})
def test_tool_sequence():
assert_tool_calls(result, expected=[
ToolCall("get_location", {"city": "Tokyo"}),
ToolCall("get_weather", {"lat": ANY, "lon": ANY}),
])
def test_budget():
assert_cost(result, max_tokens=500, max_cost_usd=0.01)
def test_safety():
assert_no_tool_call(result, forbidden_tools=["delete_user", "drop_table"])
def test_output():
assert_final_output(result, contains="Tokyo")
Using a real agent framework? You don't need to build dicts by hand — write a small converter function instead. See Building from Your Framework below.
$ pytest test_agent.py -v
test_agent.py::test_tool_sequence PASSED
test_agent.py::test_budget PASSED
test_agent.py::test_safety PASSED
test_agent.py::test_output PASSED
3 Steps to Test Your Agent
Step 1: Build an ExecutionResult
Build an ExecutionResult from your agent's output. You can construct one from a dict, or use a framework-specific converter (see examples/ for Strands Agents and LangChain converters).
from agentverify import ExecutionResult
result = ExecutionResult.from_dict({
"tool_calls": [
{"name": "get_location", "arguments": {"city": "Tokyo"}},
{"name": "get_weather", "arguments": {"lat": 35.6, "lon": 139.7}},
],
"token_usage": {"input_tokens": 50, "output_tokens": 30},
"total_cost_usd": 0.002,
"final_output": "The weather in Tokyo is sunny, 22°C.",
})
ExecutionResult.from_dict() accepts the following keys:
| Key | Type | Description |
|---|---|---|
tool_calls |
list[dict] |
Each dict has name (str, required), arguments (dict, optional), result (any, optional — tool execution result stored for reference; not used in assertions) |
token_usage |
dict or None |
{"input_tokens": int, "output_tokens": int} |
total_cost_usd |
float or None |
Total cost in USD (must be set manually — not auto-calculated from tokens) |
final_output |
str or None |
The agent's final text response |
You can also use ExecutionResult.from_json(json_string) to parse from a JSON string, and to_dict() / to_json() for serialization.
Building from Your Framework
Every agent framework has its own output format. To use agentverify, you need a small converter function (~20–50 lines) that maps your framework's output to the dict schema above. Here's the general pattern:
from agentverify import ExecutionResult, ToolCall, TokenUsage
def my_framework_to_execution_result(agent_output) -> ExecutionResult:
# 1. Extract tool calls: map your framework's tool call objects
# to ToolCall(name=..., arguments=...)
tool_calls = [
ToolCall(name=tc.tool_name, arguments=tc.params)
for tc in agent_output.tool_history
]
# 2. Extract token usage (if available)
token_usage = TokenUsage(
input_tokens=agent_output.metrics.prompt_tokens,
output_tokens=agent_output.metrics.completion_tokens,
)
# 3. Extract final output text
return ExecutionResult(
tool_calls=tool_calls,
token_usage=token_usage,
final_output=agent_output.response_text,
)
See examples/strands-file-organizer/converter.py and examples/langchain-issue-triage/converter.py for complete, production-ready converters. Built-in framework adapters are planned for a future release (see Roadmap).
Step 2: Assert
from agentverify import assert_tool_calls, assert_cost, assert_no_tool_call, assert_final_output, ToolCall, ANY
# Did the agent call the right tools in the right order?
assert_tool_calls(result, expected=[
ToolCall("get_location", {"city": "Tokyo"}),
ToolCall("get_weather", {"lat": ANY, "lon": ANY}),
])
# Did it stay within budget?
assert_cost(result, max_tokens=500, max_cost_usd=0.01)
# Did it avoid dangerous tools?
assert_no_tool_call(result, forbidden_tools=["delete_user", "drop_table"])
# Did the final output contain the expected content?
# See "Final Output Assertions" below for equals and regex options.
assert_final_output(result, contains="Tokyo")
Step 3: Record & Replay with Cassettes
Record real LLM API calls once. Replay them in CI forever — zero cost, deterministic.
⚠️ Important: Cassette replay uses sequential matching — responses are returned in recorded order without verifying request content. If your agent's prompts, tools, or model change, the cassette will still replay but may return stale/incorrect responses silently. Delete and re-record cassettes after significant agent changes. Request content matching is planned for a future release (see Roadmap).
The cassette fixture is a pytest fixture provided by the agentverify plugin. It creates an LLMCassetteRecorder that intercepts LLM SDK calls (not HTTP — it patches the SDK's chat completion method directly). Use @pytest.mark.agentverify to mark your test, and call your agent code inside the with cassette(...) block. After the block exits, call rec.to_execution_result() to build the result for assertions.
import pytest
from agentverify import assert_tool_calls, ToolCall, ANY
@pytest.mark.agentverify
def test_weather_agent(cassette):
with cassette("weather_agent.yaml", provider="openai") as rec:
# Replace this with your actual agent invocation, e.g.:
# agent.run("What's the weather in Tokyo?")
run_my_agent("What's the weather in Tokyo?")
# rec.to_execution_result() is called AFTER the with block exits
result = rec.to_execution_result()
assert_tool_calls(result, expected=[
ToolCall("get_location", {"city": "Tokyo"}),
ToolCall("get_weather", {"lat": ANY, "lon": ANY}),
])
Record the cassette once, then replay it forever:
# First run: record real LLM calls to cassette file
pytest --cassette-mode=record
# All subsequent runs: replay from cassette (zero cost, deterministic)
pytest
Cassettes are human-readable YAML (or JSON). Commit them to git, review in PRs.
Cassette modes:
| Mode | Behavior |
|---|---|
AUTO (default) |
If cassette file exists → REPLAY. Otherwise → call real LLM API but don't save (no cassette file is created). |
RECORD |
Always call real LLM API and save to cassette file. |
REPLAY |
Always replay from cassette file. Raises error if file is missing. |
To create a cassette, use mode=CassetteMode.RECORD explicitly or pass --cassette-mode=record on the command line. To re-record, simply run with RECORD again — the existing file is overwritten.
# Record cassettes for all tests
pytest --cassette-mode=record
# Replay cassettes (default behavior when cassette files exist)
pytest
Other limitations:
total_cost_usdis not populated from cassettes. Useassert_cost(max_tokens=...)for cassette-based budget checks, or settotal_cost_usdmanually in yourExecutionResult.- Be mindful not to include sensitive data (API keys, PII, confidential prompts) in cassette files checked into version control.
Assertion Modes
from agentverify import assert_tool_calls, OrderMode, ToolCall, ANY
# Exact match — same tools, same order, same count (default)
assert_tool_calls(result, expected=[...])
# Subsequence — these tools appeared in this order (other calls in between are OK)
assert_tool_calls(result, expected=[...], order=OrderMode.IN_ORDER)
# Set membership — these tools were called (order doesn't matter)
assert_tool_calls(result, expected=[...], order=OrderMode.ANY_ORDER)
# Partial args — only check the keys you care about
assert_tool_calls(result, expected=[
ToolCall("search", {"query": "Tokyo"}),
], partial_args=True)
# Collect all failures at once (doesn't stop at first)
from agentverify import assert_all
assert_all(
result,
lambda r: assert_tool_calls(r, expected=[...]),
lambda r: assert_cost(r, max_tokens=1000),
lambda r: assert_no_tool_call(r, forbidden_tools=["delete_user"]),
lambda r: assert_final_output(r, contains="Tokyo"),
)
Strict Cost Assertions
By default, assert_cost() silently passes when token_usage or total_cost_usd is None (e.g., during cassette replay where cost data may be unavailable). Use strict=True to require that the data is present:
# Fails if token_usage is None, even if the budget would pass
assert_cost(result, max_tokens=500, strict=True)
# Fails if total_cost_usd is None
assert_cost(result, max_cost_usd=0.01, strict=True)
Final Output Assertions
assert_final_output() verifies the agent's final text response. Use contains for substring checks, equals for exact match, or matches for regex:
from agentverify import assert_final_output
# Substring check
assert_final_output(result, contains="Tokyo")
# Exact match
assert_final_output(result, equals="The weather in Tokyo is sunny, 22°C.")
# Regex match
assert_final_output(result, matches=r"\d+°C")
Framework Integration
agentverify is framework-agnostic. Build an ExecutionResult from any agent framework's output using a converter function. The examples/ directory includes ready-to-use converters:
| Framework | Converter | Description |
|---|---|---|
| Strands Agents | strands-file-organizer/converter.py |
Converts AgentResult → ExecutionResult |
| LangChain | langchain-issue-triage/converter.py |
Converts AgentExecutor output → ExecutionResult |
These converters are small (~50 lines) and easy to adapt for your own framework. Built-in framework adapters are planned for a future release (see Roadmap).
Supported LLM Providers
| Provider | Extra |
|---|---|
| OpenAI | pip install agentverify[openai] |
| Amazon Bedrock | pip install agentverify[bedrock] |
| Google Gemini | pip install agentverify[gemini] |
| Anthropic | pip install agentverify[anthropic] |
| LiteLLM | pip install agentverify[litellm] |
| All providers | pip install agentverify[all] |
CI Integration
agentverify is designed for CI pipelines. Commit your cassette files to git and replay them in CI with zero LLM cost.
GitHub Actions
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/setup-python@v6
with:
python-version: "3.12"
- run: pip install -e ".[dev]"
- run: pytest --tb=short -v
Cassette files in tests/cassettes/ are replayed automatically — no API keys or secrets needed in CI.
Error Messages
Clear, structured output when assertions fail:
ToolCallSequenceError: Tool call sequence mismatch at index 1
Expected:
[0] get_location(city="Tokyo")
[1] get_news(topic="weather") ← first mismatch
Actual:
[0] get_location(city="Tokyo")
[1] search_web(query="Tokyo weather") ← actual
CostBudgetError: Token budget exceeded
Actual: 1,250 tokens
Limit: 1,000 tokens
Exceeded by: 250 tokens (25.0%)
SafetyRuleViolationError: 2 forbidden tool calls detected
[1] delete_database(table="users") at position 3
[2] drop_table(name="orders") at position 5
FinalOutputError: final_output does not contain expected substring
Substring: 'Berlin'
Actual: 'The weather in Tokyo is sunny, 22°C.'
Requirements
- Python 3.10+
- pytest 7+
Examples
The examples/ directory contains end-to-end examples with real agent frameworks and MCP servers. Each example ships with pre-recorded cassettes — run the tests without any API keys.
| Example | Framework | Description |
|---|---|---|
strands-file-organizer |
Strands Agents + Bedrock | Scans a directory via Filesystem MCP, suggests organization. Read-only safety verified |
langchain-issue-triage |
LangChain + OpenAI | Triages GitHub issues via GitHub MCP. Label and priority suggestions |
mcp-server |
— | Mock GitHub MCP server for token-free testing |
Try it:
git clone https://github.com/simukappu/agentverify.git
cd agentverify
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Strands File Organizer
pip install -e "examples/strands-file-organizer/.[dev]"
pytest examples/strands-file-organizer/tests -v
tests/test_file_organizer.py::test_tool_call_sequence PASSED
tests/test_file_organizer.py::test_token_budget PASSED
tests/test_file_organizer.py::test_safety_read_only PASSED
# LangChain Issue Triage
pip install -e "examples/langchain-issue-triage/.[dev]"
pytest examples/langchain-issue-triage/tests -v
tests/test_issue_triage.py::TestIssueTriage_MockMCP::test_tool_call_sequence PASSED
tests/test_issue_triage.py::TestIssueTriage_MockMCP::test_safety_read_and_label_only PASSED
See each example's README for agent execution instructions and recording mode details.
Roadmap
- Agent framework adapters — extract
ExecutionResultdirectly from Strands Agents, LangChain, and others without writing a converter - Tool mocking/stubbing — test agent routing logic without calling real tools
- Async support — first-class
asynciotesting for async agents and tools - Cassette request matching — verify request content during replay to detect stale cassettes
- Cassette sanitization — automatic masking of API keys and sensitive data in recorded cassettes
- Cost estimation from tokens — auto-calculate
total_cost_usdfrom token usage and model pricing - YAML/JSON test case definitions — declarative test cases for non-Python CI pipelines
- CLI test runner — run agent tests without pytest
Changelog
See CHANGELOG.md for release history.
License
Contributing
Contributions welcome. Please open an issue first to discuss what you'd like to change.
Development setup:
git clone https://github.com/simukappu/agentverify.git
cd agentverify
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentverify-0.1.0.tar.gz.
File metadata
- Download URL: agentverify-0.1.0.tar.gz
- Upload date:
- Size: 46.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d72be054634116e50c5dc86202f7d016f5e50bf5ed517e107be41e3973813f6d
|
|
| MD5 |
ca763d3573a2519836b5686cbad71a14
|
|
| BLAKE2b-256 |
6f358f80bae65e1969b10135201e96b04ea50ed5ed80350e76ba7308dfcae71a
|
File details
Details for the file agentverify-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentverify-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce96d88fb2dba6c49ac4c5b28e9a1c91027cc92a31d8b5942e74fb4df1c6b7c9
|
|
| MD5 |
6f3ab633b02e679a3bbcf79b62f0be81
|
|
| BLAKE2b-256 |
db65393560ba9105d0d24b369f8d13d43c0345768f35a70de9fe824cb4ca024f
|