Guard your LLM agents in CI. Snapshot tests that catch behavioral regressions when models, prompts, or vendors change.

These details have not been verified by PyPI

Project links

Project description

agentprdiff

Guard your LLM agents in CI. Snapshot tests that catch behavioral regressions when models, prompts, or vendors change.

You upgraded Claude. You tweaked a system prompt. You swapped gpt-4o for gpt-4o-mini in the cheap path. Which of your agent's behaviors just changed? agentprdiff tells you — before the PR merges.

pip install agentprdiff

Why

Unit tests assume determinism. Agents aren't deterministic, but they do have behaviors you rely on — a specific tool gets called, a refund amount is quoted, a latency budget is respected, a safety guardrail fires. When a model or prompt changes, those behaviors drift. Today most teams find out in production.

agentprdiff turns those behaviors into versioned, diffable baselines you check into git, and a CI command that fails the build when they regress.

It is not a framework. Your agent stays exactly the way it is. agentprdiff records what it did, lets you assert what should be true about what it did, and compares runs across time.

10-line hello world

# suite.py
from agentprdiff import case, suite
from agentprdiff.graders import contains, tool_called, latency_lt_ms, semantic
from my_agent import run  # your agent — unchanged

support = suite(
    name="customer_support",
    agent=run,
    cases=[
        case(
            name="refund_happy_path",
            input="I want a refund for order #1234",
            expect=[
                contains("refund"),
                tool_called("lookup_order"),
                semantic("agent acknowledges the refund and explains the timeline"),
                latency_lt_ms(10_000),
            ],
        ),
    ],
)

agentprdiff init
agentprdiff record suite.py     # save this run as the baseline
agentprdiff check  suite.py     # in CI: diff vs baseline, exit 1 on regression

That's the whole product. Four CLI commands. One Python file. Zero framework lock-in.

What's in the box

Case + Suite model — tiny, opinionated, no magic.
10 batteries-included graders — contains, contains_any, regex_match, tool_called, tool_sequence, no_tool_called, output_length_lt, latency_lt_ms, cost_lt_usd, semantic (LLM-as-judge with pluggable backend).
Baseline store — JSON files under .agentprdiff/baselines/, meant to be committed. Reviewers see trace changes in pull requests.
Diff engine — per-case TraceDelta with assertion pass/fail changes, cost delta, latency delta, tool-sequence changes, and a unified output diff.
CI-ready CLI — exit 1 on regression, --json-out for artifact archiving, Rich-formatted terminal output.
Zero SDK lock-in — works with OpenAI, Anthropic, Gemini, Bedrock, LangChain, LangGraph, LlamaIndex, Vercel AI SDK, custom wrappers — if you can wrap your agent in a function, agentprdiff can test it.

How it compares

	Unit tests	LLM-as-judge eval	`agentprdiff`
Deterministic pass/fail	yes	no	yes (when assertions are deterministic)
Catches behavioral drift	no	yes	yes
Runs in CI on every PR	yes	too expensive	yes
Human-readable diff of what changed	n/a	rare	yes
Works without API keys	yes	no	yes (deterministic graders + fake judge)

The value is in the combination: deterministic assertions for the 80% of behaviors you can encode as rules ("this tool was called", "this word appeared", "cost stayed under $0.02"), plus a semantic grader for the 20% that need a judge — with a fake-judge fallback so your CI stays green and free when API keys aren't available.

The workflow

Write a Suite alongside your agent code.
Run agentprdiff record once on a known-good version. Commit the resulting .agentprdiff/baselines/ directory.
In CI, on every PR, run agentprdiff check. If any assertion regresses, or cost/latency budgets are breached, the job fails.
When behavior intentionally changes, the PR author re-runs agentprdiff record, commits the new baseline, and explains the change in the PR description. Reviewers see the before/after in the diff.

This is the same loop as Jest snapshot tests or VCR cassettes — applied to LLM agents.

Instrumenting your agent

agentprdiff doesn't monkey-patch anything. Your agent returns (output, Trace):

from agentprdiff import Trace, LLMCall, ToolCall

def my_agent(query: str) -> tuple[str, Trace]:
    trace = Trace(suite_name="", case_name="", input=query)

    # ... call your model, record what happened ...
    trace.record_llm_call(LLMCall(
        provider="anthropic",
        model="claude-sonnet-4-6",
        prompt_tokens=120, completion_tokens=80,
        cost_usd=0.0012, latency_ms=340,
    ))

    # ... call a tool, record what happened ...
    trace.record_tool_call(ToolCall(name="lookup_order", arguments={"id": "1234"}))

    return final_output, trace

Agents that return just an output still work — agentprdiff wraps them and captures wall-clock latency. You can backfill richer instrumentation incrementally, assertion by assertion.

CI integration

# .github/workflows/agents.yml
name: agent-regression
on: [pull_request]
jobs:
  agentprdiff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install -e ".[dev]"
      - run: agentprdiff check suites/*.py --json-out artifacts/agentprdiff.json
      - uses: actions/upload-artifact@v4
        if: always()
        with: { name: agentprdiff, path: artifacts/ }

See docs/ci-integration.md for GitLab, CircleCI, and Buildkite.

Quickstart

A runnable end-to-end demo, no API keys needed:

git clone https://github.com/vnageshwaran-de/agentprdiff
cd agentprdiff
pip install -e ".[dev]"

cd examples/quickstart
agentprdiff init
agentprdiff record suite.py
agentprdiff check  suite.py   # exit 0

# now break the agent and watch agentprdiff catch it
sed -i "s/refund/noundr/g" agent.py
agentprdiff check suite.py    # exit 1; see the diff

Status

agentprdiff is alpha (0.1.0). The core model and CLI are stable; provider-specific SDK wrappers and a LangChain/LangGraph integration are on the 0.2 roadmap. See CHANGELOG.md.

Feedback, bug reports, and PRs extremely welcome. Open an issue or @ me.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

Apr 30, 2026

0.2.3

Apr 28, 2026

0.2.2

Apr 28, 2026

0.2.1

Apr 26, 2026

0.2.0

Apr 26, 2026

This version

0.1.0

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprdiff-0.1.0.tar.gz (18.4 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentprdiff-0.1.0-py3-none-any.whl (21.9 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file agentprdiff-0.1.0.tar.gz.

File metadata

Download URL: agentprdiff-0.1.0.tar.gz
Upload date: Apr 25, 2026
Size: 18.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for agentprdiff-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`34b7b7a4df69b824dd8ed740a02d91ce0decd7f8d963a434cdd9e6d594dd263c`
MD5	`24fd81b27e24410f315d2065ef0dc4a9`
BLAKE2b-256	`baa50f4973edae9722c258af9b173e87e9ddfeb214431fde554386377529d55c`

See more details on using hashes here.

File details

Details for the file agentprdiff-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentprdiff-0.1.0-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 21.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for agentprdiff-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49f9c26757d3bdffa69a50bb2db376c2becd1e724c0b04c99f5b948ba0108950`
MD5	`1b011fa4a21c4b9576633f0ddd603a03`
BLAKE2b-256	`569b266b34e1aee686d0c7ae1ede41219e6ad3d535629499226666b09c3826da`

See more details on using hashes here.

agentprdiff 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agentprdiff

Why

10-line hello world

What's in the box

How it compares

The workflow

Instrumenting your agent

CI integration

Quickstart

Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes