Deterministic snapshot testing harness for AI agents

These details have not been verified by PyPI

Project description

agentsnap

Deterministic snapshot testing for AI agents.

agentsnap records your agent's LLM and tool calls during a golden run and produces a committed snapshot file. On every subsequent run it replays the same inputs and compares the new trace against the snapshot across three dimensions:

Dimension	What it checks	How
Structural	Tool call names and order	Levenshtein edit distance on the tool sequence
Arguments	Tool call arguments	`deepdiff` (if installed) or plain dict diff, with configurable ignored fields
Semantic	LLM responses and final output	Cosine similarity via `all-MiniLM-L6-v2`, or an LLM judge for higher accuracy

If any dimension drifts beyond its threshold, agentsnap raises AgentRegressionError with a structured diff report.

3-minute quickstart

1 — Install

pip install agentsnap

2 — Run setup

agentsnap init

Asks you to choose a semantic comparison backend:

Option	What it needs	Best for
[1] LLM judge (default)	API key (OpenRouter, OpenAI, Anthropic, or custom)	Factual agents, highest accuracy
[2] Offline embeddings	Nothing — ~22 MB model download, runs anywhere	Any machine, no API key
[3] Local LLM judge	(coming soon)	Strong local machine, no cloud

The wizard saves your choice to pyproject.toml and your API key (if any) to .env. Keys are never written to pyproject.toml.

agentsnap check   # verify your setup at any time

3 — Wrap your client and run

from agentsnap import AgentRecorder, AgentAsserter
from agentsnap.adapters.anthropic import AnthropicAdapter
from agentsnap.adapters.tool import ToolAdapter
import anthropic

client = AnthropicAdapter(anthropic.Anthropic())
search_tool = ToolAdapter(search, name="search")

# First run: records the golden snapshot automatically
with AgentRecorder("my_agent") as rec:
    result = my_agent(client, search_tool, input="What is Python?")
    rec.output = result
# Writes __agent_snapshots__/my_agent.json — commit this file

4 — Assert on future runs

with AgentAsserter("my_agent") as a:
    result = my_agent(client, search_tool, input="What is Python?")
    a.output = result
# Raises AgentRegressionError if behavior drifted

5 — Use the pytest fixture (simplest)

snapshot.run() auto-records on first call and auto-asserts on every run after that — no need to switch between AgentRecorder and AgentAsserter:

def test_my_agent(snapshot):
    with snapshot.run("my_agent") as s:
        result = my_agent(client, search_tool, input="What is Python?")
        s.output = result

pytest

Supported providers

Provider	Adapter	Intercepts
Anthropic	`AnthropicAdapter`	`.messages.create()`
OpenAI	`OpenAIAdapter`	`.chat.completions.create()`
Google Gemini	`GeminiAdapter`	`.models.generate_content()`
Cohere	`CohereAdapter`	`.chat()`
Mistral	`MistralAdapter`	`.chat.complete()`
Groq	`GroqAdapter`	`.chat.completions.create()`
OpenRouter	`OpenRouterAdapter`	`.chat.completions.create()`
LangGraph	`LangGraphAdapter`	`.invoke()` + node-level LLM/tool events via callbacks
Any callable	`ToolAdapter`	direct call

Install provider SDKs as needed:

pip install agentsnap[google]    # google-genai
pip install agentsnap[cohere]    # cohere
pip install agentsnap[mistral]   # mistralai
pip install agentsnap[groq]      # groq
pip install agentsnap[all-providers]

Zero-instrumentation capture

If you don't want to wrap your clients, use PatchSet to patch all installed LLM SDKs at the class level. Any raw client created anywhere in the process is captured automatically:

from agentsnap import PatchSet, AgentRecorder

with PatchSet():
    with AgentRecorder("my_agent") as rec:
        client = anthropic.Anthropic()   # no AnthropicAdapter needed
        result = my_agent(client, "What is Python?")
        rec.output = result

Or via the pytest fixture:

def test_my_agent(snapshot, agentsnap_instrument):
    with snapshot.run("my_agent") as s:
        client = anthropic.Anthropic()   # captured automatically
        s.output = my_agent(client, "What is Python?")

# Or enable globally for all tests in a session
pytest --agentsnap-instrument

Note: Do not combine PatchSet with adapter wrappers on the same client — both interceptors will fire and events will be recorded twice.

Configuration

API key for the LLM judge (optional)

The LLM judge uses a small language model to compare outputs instead of embeddings — more accurate for factual content.

agentsnap resolves the API key automatically — you do not need a separate key. It checks in this order:

AGENTSNAP_JUDGE_API_KEY — explicit override, always wins
The provider-specific key that matches judge_base_url:

`judge_base_url` contains	Key used automatically
`openrouter.ai` (default)	`OPENROUTER_API_KEY`
`api.openai.com`	`OPENAI_API_KEY`
`anthropic.com`	`ANTHROPIC_API_KEY`
`api.groq.com`	`GROQ_API_KEY`
`api.mistral.ai`	`MISTRAL_API_KEY`
`api.cohere.com`	`COHERE_API_KEY`

Once any matching key is found, the snapshot pytest fixture enables the LLM judge automatically — no code changes needed in tests.

To use a different provider, change judge_base_url in pyproject.toml and set the matching env var:

export OPENAI_API_KEY=sk-...

[tool.agentsnap]
judge_base_url = "https://api.openai.com/v1"
judge_model    = "gpt-4o-mini"

Project settings (`pyproject.toml`)

[tool.agentsnap]
judge_model        = "openai/gpt-4o-mini"
judge_base_url     = "https://openrouter.ai/api/v1"
semantic_threshold = 0.92   # final agent output (strict)
llm_threshold      = 0.75   # intermediate LLM responses (tolerant)

These can also be set as pytest ini options:

[tool.pytest.ini_options]
agentsnap_judge_model        = "openai/gpt-4o-mini"
agentsnap_judge_base_url     = "https://openrouter.ai/api/v1"
agentsnap_semantic_threshold = "0.92"
agentsnap_llm_threshold      = "0.75"

API reference

`AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")`

Context manager. Intercepts all adapter calls and writes a snapshot on clean exit.

with AgentRecorder("name", model="claude-haiku-4-5") as rec:
    rec.input_data = {"query": "hello"}   # optional metadata
    result = my_agent(wrapped_client, ...)
    rec.output = result

`AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)`

Context manager. Reads the snapshot, intercepts calls, runs the three-layer diff on exit. If no snapshot exists yet, automatically switches to record mode and writes the golden run.

Parameter	Default	Description
`semantic_threshold`	`0.92`	Min similarity for final output
`llm_threshold`	`0.75`	Min similarity for intermediate LLM responses
`ignored_fields`	`None`	Tool arg keys to exclude from argument diff
`embed_fn`	`None`	Custom embedding function (for testing)
`judge`	`None`	`LLMJudge` instance; overrides embedding comparison

with AgentAsserter("name", semantic_threshold=0.95, ignored_fields=["timestamp"]) as a:
    result = my_agent(wrapped_client, ...)
    a.output = result

`PatchSet`

Context manager that monkey-patches all installed LLM SDK classes so any client — wrapped or unwrapped — is captured by an active AgentRecorder or AgentAsserter.

from agentsnap import PatchSet

with PatchSet():
    # all anthropic.Anthropic(), openai.OpenAI(), etc. clients are auto-captured
    ...

`LLMJudge(api_key, model, base_url)`

Uses an LLM to score semantic equivalence instead of embeddings. Returns a 0.0–1.0 score and a one-sentence reason explaining any difference.

from agentsnap import LLMJudge

judge = LLMJudge(api_key="sk-or-...", model="openai/gpt-4o-mini")
judge = LLMJudge.from_env()   # returns None if no key found

with AgentAsserter("name", judge=judge) as a:
    ...

`snapshot` pytest fixture

Auto-wired from [tool.agentsnap] and environment variables. No imports needed.

def test_agent(snapshot):
    # Auto mode: records first time, asserts every run after
    with snapshot.run("name") as s:
        s.output = run_agent(...)

    # Explicit record
    with snapshot.record_agent("name") as rec:
        rec.output = run_agent(...)

    # Explicit assert
    with snapshot.assert_agent("name") as a:
        a.output = run_agent(...)

    # Per-test overrides
    with snapshot.assert_agent("name", judge=False) as a:          # force embeddings
        a.output = run_agent(...)
    with snapshot.assert_agent("name", semantic_threshold=0.98) as a:
        a.output = run_agent(...)

Pytest flags

Flag	Description
`--agentsnap-record`	Force re-record all snapshots, overwriting existing goldens
`--agentsnap-instrument`	Auto-patch all installed LLM SDKs (zero-instrumentation mode)

pytest --agentsnap-record        # re-record everything
pytest --agentsnap-instrument    # capture raw clients without adapters

`agentsnap_instrument` fixture

Standalone fixture for zero-instrumentation capture within a single test:

def test_agent(snapshot, agentsnap_instrument):
    with snapshot.run("name") as s:
        client = anthropic.Anthropic()   # no adapter needed
        s.output = my_agent(client, "query")

Exceptions

Exception	When raised
`AgentRegressionError(message, diff_report)`	Behavior drifted beyond threshold
`SnapshotNotFoundError(test_name)`	No snapshot found (only from direct SDK use; `AgentAsserter` auto-records instead)
`AdapterNotWrappedError`	Unwrapped client used inside a recording context without `PatchSet`

AgentRegressionError.diff_report is a DiffReport dataclass with structural_diff, argument_diffs, semantic_scores, semantic_reasons, and failed_checks.

CLI

agentsnap init                                     # interactive setup wizard — choose backend and save config
agentsnap check                                    # verify current backend is working (exits 0/1)
agentsnap list                                     # list all snapshots
agentsnap diff __agent_snapshots__/my_agent.json   # pretty-print a snapshot
agentsnap update my_agent                          # show diff and approve last run as new golden
agentsnap update my_agent --yes                    # approve without confirmation prompt

Snapshot format

{
  "version": "1.0",
  "recorded_at": "2026-01-01T00:00:00+00:00",
  "model": "claude-haiku-4-5",
  "input": { "query": "What is Python?" },
  "trace": [
    { "step": 0, "type": "llm_call", "messages": [...], "response": "...", "tokens": 350 },
    { "step": 1, "type": "tool_call", "name": "search", "args": {"query": "Python"}, "result": "..." }
  ],
  "output": "Python is a high-level programming language..."
}

Golden snapshots live in __agent_snapshots__/ and are committed to git. The .last_run/ subdirectory is written on every assert run and is gitignored — it is only used by agentsnap update.

CI integration (GitHub Actions)

name: Agent regression tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip

      - name: Install
        run: pip install -e ".[dev]"

      - name: Run agent snapshot tests
        run: pytest tests/ -v
        env:
          # Optional: enables LLM judge for higher-accuracy semantic comparison
          AGENTSNAP_JUDGE_API_KEY: ${{ secrets.AGENTSNAP_JUDGE_API_KEY }}

Snapshots are committed to the repo. CI only runs the asserter — no real agent API calls needed unless your tests explicitly make them.

How to approve an intentional regression

When you intentionally change agent behavior (new prompt, model upgrade, new tool):

# 1. Run tests — they fail, new trace saved to .last_run/
pytest tests/test_my_agent.py

# 2. Approve — shows a diff and prompts for confirmation
agentsnap update my_agent

# 3. Commit the new baseline
git add __agent_snapshots__/my_agent.json
git commit -m "approve: updated golden after Sonnet upgrade"

Thresholds

Two independent thresholds control the semantic layer:

Threshold	Default	Applies to
`semantic_threshold`	`0.92`	Final `output` — the agent's actual answer
`llm_threshold`	`0.75`	Intermediate `llm_call[n]` responses — tolerates natural phrasing variance

Tune per-test:

# Critical factual agent — hold output tightly
with AgentAsserter("rag_agent", semantic_threshold=0.97) as a: ...

# Creative agent — allow more paraphrasing
with AgentAsserter("writer_agent", semantic_threshold=0.75) as a: ...

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jun 30, 2026

0.1.1

Jun 30, 2026

This version

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentsnap-0.1.0.tar.gz (318.5 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentsnap-0.1.0-py3-none-any.whl (35.5 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file agentsnap-0.1.0.tar.gz.

File metadata

Download URL: agentsnap-0.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 318.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentsnap-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8331d8de894a48ed737b05bee34bb58c43d0ea7c9cb04ea88aac90092a8d6721`
MD5	`3412d5a91636b9d8b9ab2a47bb88fa26`
BLAKE2b-256	`626c7115ca4e0cb2c2e905d7ae7042b525ebf7701122a16687a564c2fee4b841`

See more details on using hashes here.

File details

Details for the file agentsnap-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentsnap-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 35.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentsnap-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b0ebdf9983a15ccac1f5d247a8373884334a19dd499c84391480e4826913002`
MD5	`127c03c453573cd98d34833da9ae1ca1`
BLAKE2b-256	`94009f4656c76cfdbf306e13bbe0b32d6e6dcbf864b62c2fa4999c500520e470`

See more details on using hashes here.

agentsnap 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

agentsnap

3-minute quickstart

1 — Install

2 — Run setup

3 — Wrap your client and run

4 — Assert on future runs

5 — Use the pytest fixture (simplest)

Supported providers

Zero-instrumentation capture

Configuration

API key for the LLM judge (optional)

Project settings (pyproject.toml)

API reference

AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")

AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)

PatchSet

LLMJudge(api_key, model, base_url)

snapshot pytest fixture

Pytest flags

agentsnap_instrument fixture

Exceptions

CLI

Snapshot format

CI integration (GitHub Actions)

How to approve an intentional regression

Thresholds

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Project settings (`pyproject.toml`)

`AgentRecorder(test_name, snapshot_dir="__agent_snapshots__", model="unknown")`

`AgentAsserter(test_name, snapshot_dir, semantic_threshold, llm_threshold, ignored_fields, embed_fn, judge)`

`PatchSet`

`LLMJudge(api_key, model, base_url)`

`snapshot` pytest fixture

`agentsnap_instrument` fixture