Skip to main content

pytest-compatible test harness for AI agents — deterministic record & replay for Anthropic Claude

Project description

agentprobe

pytest-compatible test harness for AI agents — deterministic record & replay for Anthropic Claude.

Test your Claude agents in CI without hitting the real API on every run. Record once, replay forever — zero cost, zero flakiness.

def test_agent_uses_bash(agentprobe):
    with agentprobe.replay("tests/fixtures/list_files.jsonl") as probe:
        result = my_agent.run("list files in /tmp")
        probe.assert_tool_called("bash")
        probe.assert_max_iterations(3)
        probe.assert_output_contains("/tmp")

Install

pip install pytest-agentprobe

Requires Python 3.9+ and anthropic>=0.40.0.


How it works

agentprobe intercepts calls to anthropic.Anthropic.messages.create (and the async equivalent) at the class level — no changes to your agent code needed.

  • Record mode — runs your agent against the real API and saves every request/response pair to a JSONL fixture file.
  • Replay mode — feeds the saved responses back to your agent instead of making real API calls. Deterministic, instant, free.
  • Auto mode — records on first run, replays on every subsequent run.

Quick start

1. Record a session

from agentprobe import Session
import anthropic

session = Session()
client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY

with session.record("tests/fixtures/my_agent.jsonl") as probe:
    result = my_agent.run(client, "what files are in /tmp?")
    # assertions are optional during recording
    probe.assert_tool_called("bash")

# fixture is written to disk — commit it to your repo

2. Replay in CI

def test_my_agent(agentprobe):
    client = anthropic.Anthropic(api_key="dummy")  # not used during replay

    with agentprobe.replay("tests/fixtures/my_agent.jsonl") as probe:
        result = my_agent.run(client, "what files are in /tmp?")

        probe.assert_tool_called("bash")
        probe.assert_not_tool_called("web_search")
        probe.assert_tool_called_with("bash", command="ls /tmp")
        probe.assert_max_iterations(4)
        probe.assert_output_contains("/tmp")
        probe.assert_stop_reason("end_turn")
        probe.assert_max_tokens(500)

3. Auto mode (record-on-first-run)

def test_my_agent(agentprobe):
    with agentprobe.auto("tests/fixtures/my_agent.jsonl") as probe:
        result = my_agent.run(client, "what files are in /tmp?")
        probe.assert_tool_called("bash")

Async agents

Full async support via AsyncAnthropic:

import pytest
import anthropic
from agentprobe import Session

@pytest.mark.asyncio
async def test_async_agent():
    session = Session()
    client = anthropic.AsyncAnthropic(api_key="dummy")

    async with session.async_replay("tests/fixtures/my_agent.jsonl") as probe:
        result = await my_async_agent.run(client, "list files in /tmp")
        probe.assert_tool_called("bash")
        probe.assert_max_iterations(3)

Async equivalents: async_record, async_replay, async_auto.


Assertion API

All assertions return self for chaining.

Iteration count

Assertion Description
assert_max_iterations(n) At most n LLM calls
assert_min_iterations(n) At least n LLM calls
assert_iteration_count(n) Exactly n LLM calls

Tool calls

Assertion Description
assert_tool_called(name) Tool was called at least once
assert_not_tool_called(name) Tool was never called
assert_tool_called_with(name, **kwargs) Tool was called with these input fields
assert_tool_called_before(first, second) First tool was called before second

Output

Assertion Description
assert_output_contains(text) Final text response contains text
assert_output_not_contains(text) Final text response does not contain text
assert_stop_reason(reason) Final call stop reason equals reason (e.g. "end_turn")

Token budget

Assertion Description
assert_max_tokens(n) Total tokens across all calls ≤ n

Introspection

probe.iteration_count       # int — number of LLM calls
probe.tools_called          # list[str] — sorted tool names used
probe.final_output          # str | None — last text block in session
probe.total_tokens          # int — input + output tokens across all calls
probe.total_input_tokens    # int
probe.total_output_tokens   # int

CLI

Inspect and compare fixtures without writing Python:

# Pretty-print a fixture
agentprobe show tests/fixtures/my_agent.jsonl

# Compare two fixtures (exits 1 if differences found)
agentprobe diff tests/fixtures/v1.jsonl tests/fixtures/v2.jsonl

Example show output:

fixture: tests/fixtures/my_agent.jsonl  (2 call(s))

── Call 1/2  model=claude-opus-4-8  stop=tool_use  in=50 out=30  312ms
  [tool_use] bash({"command": "ls /tmp"})

── Call 2/2  model=claude-opus-4-8  stop=end_turn  in=80 out=25  280ms
  [text] The /tmp directory contains: file1.txt, file2.txt, temp.log

total tokens: 185  (130 in + 55 out)

pytest fixture

agentprobe is auto-registered as a pytest plugin. The agentprobe fixture is available in all tests without any conftest.py setup:

def test_something(agentprobe):
    with agentprobe.replay("tests/fixtures/session.jsonl") as probe:
        ...

To use Session directly (e.g. in scripts or non-pytest contexts):

from agentprobe import Session

session = Session()
with session.replay("tests/fixtures/session.jsonl") as probe:
    ...

Fixture format

Fixtures are newline-delimited JSON (.jsonl). Each line is one messages.create call:

{"request": {"model": "...", "messages": [...], "tools": [...]}, "response": {"id": "...", "content": [...], "stop_reason": "tool_use", "usage": {"input_tokens": 50, "output_tokens": 30}}, "timestamp": 1748700000.0, "duration_ms": 312.5}

Fixtures are plain text — safe to commit to git, diff in PRs, and edit by hand.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agentprobe-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agentprobe-0.1.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agentprobe-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_agentprobe-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pytest_agentprobe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dab9f2463ac91924bb131f62348e7f9d6ae6e5f8d5a8a778a0fdca41ae8a3acb
MD5 fec69600b3461bc01541c64fbd0cdc6f
BLAKE2b-256 0333bf781408119691b0f9c1232bbe2d1300257f9e50e31f4ee70de14d19f42d

See more details on using hashes here.

File details

Details for the file pytest_agentprobe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pytest_agentprobe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pytest_agentprobe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11b3e9970075efe9f57f6df21f51f5bba6dde47a39c1079ae1c44ba3ecc9cb38
MD5 f5b08f37e151108e8cdcb1e44cbf48ef
BLAKE2b-256 dd1c4fe3f5965874dac62365d7ae80a6556520421648982d6cdd7258fc58e269

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page