Skip to main content

Pytest for AI agents — record, score, replay, regress. Vendorable, judge-agnostic, domain-neutral. (Placeholder release; full library coming soon.)

Project description

agentpytest

Pytest for AI agents. Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.

PyPI License: MIT CI

pip install agentpytest

What it is

agentpytest is a pytest plugin that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.

It works for any agent in any domain — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point agentpytest at it, pick a judge, and you have regression tests.

60-second example

from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call

judge = Judge("gemini/gemini-2.5-pro")

@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
    result = my_agent("Refactor auth.py to use TokenStore")
    assert goal_completion(result, judge=judge).value > 0.8
    assert right_tool(result, judge=judge).value > 0.9
    assert redundant_call(result).value < 0.1
pytest                           # diffs against cassette, fails on regression
pytest --update-trajectories     # regenerate cassette after intentional changes
agentpytest regress \
  --baseline runs/main.json \
  --candidate runs/pr-412.json   # CI gate with p-value + effect size

Why it exists

The agent-eval ecosystem has converged everywhere except the test layer:

Layer Owned by
Model abstraction LiteLLM
Trace storage + UI Phoenix, Langfuse, Weave
Benchmarks / leaderboards Inspect AI, SWE-bench, τ-bench
Hosted eval platforms Braintrust, LangSmith, Patronus
The pytest layer in your repo agentpytest

agentpytest fills the missing slot. It's the thing you pip install and run locally — like pytest, like mypy, like any other dev tool you'd commit alongside your code.

What makes it different

  • Pytest-native, vendorable, no server. pip install, write tests, commit cassettes to git. No telemetry, no account, runs offline.
  • Any judge model. Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
  • Agent-model decoupled from judge-model. Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
  • Statistical regression detection in core. Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
  • Counterfactual replay. fork_from(cassette, span_id, mutate) replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace.
  • Repo harness — eval on YOUR PRs. Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
  • TRAIL failure-mode detectors. ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.

Comparison

Feature agentpytest Inspect AI Phoenix Braintrust DeepEval LangSmith Ragas
Pytest-native, vendorable, no server ⚠️ harness ❌ SaaS ❌ SaaS
Cassette record/replay with tool-lock ⚠️ cache
Mutate-and-replay debugging ⚠️ prompt-only
Statistical CI gate ⚠️ epochs ✅ paid ⚠️ paid
Eval on YOUR repo's past PRs
Any judge (~100 providers) ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Judge ensemble + variance signal
TRAIL failure-mode detectors

Works for any domain

Same library, different rubrics:

Domain Example test
Coding agents assert diff_minimality(r, judge=judge).value > 0.8
SRE / incident response assert dependency_order(r, expected_dag=runbook).value == 1.0
Outbound sales assert can_spam_compliance(r, judge=judge).value > 0.95
Healthcare scheduling assert phi_leak_check(r, judge=judge).value == 1.0
Finance / spend assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99

Built-in coding-agent harness on day one; Harness plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.

What you need to provide

Required Optional
An agent function Expected tool sequence
A task / prompt Custom rubric text
A judge model + key Past PR list (for repo harness)
Domain policy documents

No labeled corpus. The cassette is the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.

Install & quickstart

pip install agentpytest

export ANTHROPIC_API_KEY=...    # or OPENAI_API_KEY, GEMINI_API_KEY, etc.
# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion

judge = Judge("anthropic/claude-sonnet-4-6")

@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
    result = my_agent("greet the user politely")
    assert goal_completion(result, judge=judge).value > 0.8
pytest tests/test_my_agent.py

First run records the cassette. Future runs diff against it.

Documentation

Examples

Runnable example repos, one per domain:

Status

v0.1.0-alpha — core (trajectory_test, cassettes), 5 scorers (goal_completion, right_tool, right_args, redundant_call, dependency_order), regression_test, OTel export. Roadmap in ROADMAP.md.

License

MIT. No CLA. Fork freely.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md. One rule: no PR that adds a server, dashboard, or hosted feature. This stays a library.

Acknowledgements

Built on the shoulders of LiteLLM, agent-vcr, pytest, and OpenTelemetry. TRAIL detector taxonomy inspired by Patronus AI's published research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentpytest-0.0.0.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentpytest-0.0.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file agentpytest-0.0.0.tar.gz.

File metadata

  • Download URL: agentpytest-0.0.0.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentpytest-0.0.0.tar.gz
Algorithm Hash digest
SHA256 ec36db22504b4882e342c8c5e40f8b4f69a97649dd0e6aa458ec2dea0c817105
MD5 ba39f97c5657d5c1f03002dca08f3e98
BLAKE2b-256 b34bd30844b611093411cfab0092050fccd70f3535c6413e460e497ba6881c3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentpytest-0.0.0.tar.gz:

Publisher: publish.yml on piyushbhavsarr/agentpytest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentpytest-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: agentpytest-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentpytest-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 55a5d9f0b27ff8d28942523f46fcdc1257d1616084a4831bb24e4fcf279dc5d2
MD5 16d486298c86e0b59f276e4f6cad8191
BLAKE2b-256 2d508614f2e1b828ff7ae13fc27dd27ad9f7455f86ffd78c4e3e836ef1bfcc56

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentpytest-0.0.0-py3-none-any.whl:

Publisher: publish.yml on piyushbhavsarr/agentpytest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page