Pytest for AI agents — record, score, replay, regress. Vendorable, judge-agnostic, domain-neutral. (Placeholder release; full library coming soon.)
Project description
agentpytest
Pytest for AI agents. Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.
pip install agentpytest
What it is
agentpytest is a pytest plugin that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.
It works for any agent in any domain — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point agentpytest at it, pick a judge, and you have regression tests.
60-second example
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call
judge = Judge("gemini/gemini-2.5-pro")
@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
result = my_agent("Refactor auth.py to use TokenStore")
assert goal_completion(result, judge=judge).value > 0.8
assert right_tool(result, judge=judge).value > 0.9
assert redundant_call(result).value < 0.1
pytest # diffs against cassette, fails on regression
pytest --update-trajectories # regenerate cassette after intentional changes
agentpytest regress \
--baseline runs/main.json \
--candidate runs/pr-412.json # CI gate with p-value + effect size
Why it exists
The agent-eval ecosystem has converged everywhere except the test layer:
| Layer | Owned by |
|---|---|
| Model abstraction | LiteLLM |
| Trace storage + UI | Phoenix, Langfuse, Weave |
| Benchmarks / leaderboards | Inspect AI, SWE-bench, τ-bench |
| Hosted eval platforms | Braintrust, LangSmith, Patronus |
| The pytest layer in your repo | agentpytest |
agentpytest fills the missing slot. It's the thing you pip install and run locally — like pytest, like mypy, like any other dev tool you'd commit alongside your code.
What makes it different
- Pytest-native, vendorable, no server.
pip install, write tests, commit cassettes to git. No telemetry, no account, runs offline. - Any judge model. Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
- Agent-model decoupled from judge-model. Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
- Statistical regression detection in core. Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
- Counterfactual replay.
fork_from(cassette, span_id, mutate)replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace. - Repo harness — eval on YOUR PRs. Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
- TRAIL failure-mode detectors. ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.
Comparison
| Feature | agentpytest | Inspect AI | Phoenix | Braintrust | DeepEval | LangSmith | Ragas |
|---|---|---|---|---|---|---|---|
| Pytest-native, vendorable, no server | ✅ | ⚠️ harness | ❌ | ❌ SaaS | ✅ | ❌ SaaS | ✅ |
| Cassette record/replay with tool-lock | ✅ | ⚠️ cache | ❌ | ❌ | ❌ | ❌ | ❌ |
| Mutate-and-replay debugging | ✅ | ❌ | ⚠️ prompt-only | ❌ | ❌ | ❌ | ❌ |
| Statistical CI gate | ✅ | ⚠️ epochs | ❌ | ✅ paid | ❌ | ⚠️ paid | ❌ |
| Eval on YOUR repo's past PRs | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Any judge (~100 providers) | ✅ | ✅ | ⚠️ | ⚠️ | ⚠️ | ⚠️ | ⚠️ |
| Judge ensemble + variance signal | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| TRAIL failure-mode detectors | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Works for any domain
Same library, different rubrics:
| Domain | Example test |
|---|---|
| Coding agents | assert diff_minimality(r, judge=judge).value > 0.8 |
| SRE / incident response | assert dependency_order(r, expected_dag=runbook).value == 1.0 |
| Outbound sales | assert can_spam_compliance(r, judge=judge).value > 0.95 |
| Healthcare scheduling | assert phi_leak_check(r, judge=judge).value == 1.0 |
| Finance / spend | assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99 |
Built-in coding-agent harness on day one; Harness plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.
What you need to provide
| Required | Optional |
|---|---|
| An agent function | Expected tool sequence |
| A task / prompt | Custom rubric text |
| A judge model + key | Past PR list (for repo harness) |
| Domain policy documents |
No labeled corpus. The cassette is the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.
Install & quickstart
pip install agentpytest
export ANTHROPIC_API_KEY=... # or OPENAI_API_KEY, GEMINI_API_KEY, etc.
# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion
judge = Judge("anthropic/claude-sonnet-4-6")
@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
result = my_agent("greet the user politely")
assert goal_completion(result, judge=judge).value > 0.8
pytest tests/test_my_agent.py
First run records the cassette. Future runs diff against it.
Documentation
- Quickstart — first test in 5 minutes
- Concepts — trajectories, cassettes, scorers, judges
- Scorers reference — every built-in scorer
- Repo harness guide — eval on your own PRs
- Domain cookbooks — SRE, sales, healthcare, finance, coding
- CI integration — GitHub Actions, GitLab, CircleCI
Examples
Runnable example repos, one per domain:
agentpytest-examples-coding— Claude Code-style agentagentpytest-examples-sre— incident-response agentagentpytest-examples-sales— outbound prospectingagentpytest-examples-healthcare— appointment bookingagentpytest-examples-finance— expense classification
Status
v0.1.0-alpha — core (trajectory_test, cassettes), 5 scorers (goal_completion, right_tool, right_args, redundant_call, dependency_order), regression_test, OTel export. Roadmap in ROADMAP.md.
License
MIT. No CLA. Fork freely.
Contributing
Issues and PRs welcome. See CONTRIBUTING.md. One rule: no PR that adds a server, dashboard, or hosted feature. This stays a library.
Acknowledgements
Built on the shoulders of LiteLLM, agent-vcr, pytest, and OpenTelemetry. TRAIL detector taxonomy inspired by Patronus AI's published research.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentpytest-0.0.0.tar.gz.
File metadata
- Download URL: agentpytest-0.0.0.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec36db22504b4882e342c8c5e40f8b4f69a97649dd0e6aa458ec2dea0c817105
|
|
| MD5 |
ba39f97c5657d5c1f03002dca08f3e98
|
|
| BLAKE2b-256 |
b34bd30844b611093411cfab0092050fccd70f3535c6413e460e497ba6881c3e
|
Provenance
The following attestation bundles were made for agentpytest-0.0.0.tar.gz:
Publisher:
publish.yml on piyushbhavsarr/agentpytest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentpytest-0.0.0.tar.gz -
Subject digest:
ec36db22504b4882e342c8c5e40f8b4f69a97649dd0e6aa458ec2dea0c817105 - Sigstore transparency entry: 1400456549
- Sigstore integration time:
-
Permalink:
piyushbhavsarr/agentpytest@15621a7dcadecb88bb0fd52420a62844e2411cda -
Branch / Tag:
refs/tags/v0.0.0 - Owner: https://github.com/piyushbhavsarr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@15621a7dcadecb88bb0fd52420a62844e2411cda -
Trigger Event:
push
-
Statement type:
File details
Details for the file agentpytest-0.0.0-py3-none-any.whl.
File metadata
- Download URL: agentpytest-0.0.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55a5d9f0b27ff8d28942523f46fcdc1257d1616084a4831bb24e4fcf279dc5d2
|
|
| MD5 |
16d486298c86e0b59f276e4f6cad8191
|
|
| BLAKE2b-256 |
2d508614f2e1b828ff7ae13fc27dd27ad9f7455f86ffd78c4e3e836ef1bfcc56
|
Provenance
The following attestation bundles were made for agentpytest-0.0.0-py3-none-any.whl:
Publisher:
publish.yml on piyushbhavsarr/agentpytest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentpytest-0.0.0-py3-none-any.whl -
Subject digest:
55a5d9f0b27ff8d28942523f46fcdc1257d1616084a4831bb24e4fcf279dc5d2 - Sigstore transparency entry: 1400456684
- Sigstore integration time:
-
Permalink:
piyushbhavsarr/agentpytest@15621a7dcadecb88bb0fd52420a62844e2411cda -
Branch / Tag:
refs/tags/v0.0.0 - Owner: https://github.com/piyushbhavsarr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@15621a7dcadecb88bb0fd52420a62844e2411cda -
Trigger Event:
push
-
Statement type: