Pytest for AI agents — record, score, replay, regress. Vendorable, judge-agnostic, domain-neutral. (Placeholder release; full library coming soon.)

These details have not been verified by PyPI

Project description

agentpytest

Pytest for AI agents. Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.

pip install agentpytest

What it is

agentpytest is a pytest plugin that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.

It works for any agent in any domain — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point agentpytest at it, pick a judge, and you have regression tests.

60-second example

from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call

judge = Judge("gemini/gemini-2.5-pro")

@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
    result = my_agent("Refactor auth.py to use TokenStore")
    assert goal_completion(result, judge=judge).value > 0.8
    assert right_tool(result, judge=judge).value > 0.9
    assert redundant_call(result).value < 0.1

pytest                           # diffs against cassette, fails on regression
pytest --update-trajectories     # regenerate cassette after intentional changes
agentpytest regress \
  --baseline runs/main.json \
  --candidate runs/pr-412.json   # CI gate with p-value + effect size

Why it exists

The agent-eval ecosystem has converged everywhere except the test layer:

Layer	Owned by
Model abstraction	LiteLLM
Trace storage + UI	Phoenix, Langfuse, Weave
Benchmarks / leaderboards	Inspect AI, SWE-bench, τ-bench
Hosted eval platforms	Braintrust, LangSmith, Patronus
The pytest layer in your repo	agentpytest

agentpytest fills the missing slot. It's the thing you pip install and run locally — like pytest, like mypy, like any other dev tool you'd commit alongside your code.

What makes it different

Pytest-native, vendorable, no server. pip install, write tests, commit cassettes to git. No telemetry, no account, runs offline.
Any judge model. Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
Agent-model decoupled from judge-model. Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
Statistical regression detection in core. Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
Counterfactual replay. fork_from(cassette, span_id, mutate) replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace.
Repo harness — eval on YOUR PRs. Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
TRAIL failure-mode detectors. ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.

Comparison

Feature	agentpytest	Inspect AI	Phoenix	Braintrust	DeepEval	LangSmith	Ragas
Pytest-native, vendorable, no server	✅	⚠️ harness	❌	❌ SaaS	✅	❌ SaaS	✅
Cassette record/replay with tool-lock	✅	⚠️ cache	❌	❌	❌	❌	❌
Mutate-and-replay debugging	✅	❌	⚠️ prompt-only	❌	❌	❌	❌
Statistical CI gate	✅	⚠️ epochs	❌	✅ paid	❌	⚠️ paid	❌
Eval on YOUR repo's past PRs	✅	❌	❌	❌	❌	❌	❌
Any judge (~100 providers)	✅	✅	⚠️	⚠️	⚠️	⚠️	⚠️
Judge ensemble + variance signal	✅	❌	❌	❌	❌	❌	❌
TRAIL failure-mode detectors	✅	❌	❌	❌	❌	❌	❌

Works for any domain

Same library, different rubrics:

Domain	Example test
Coding agents	`assert diff_minimality(r, judge=judge).value > 0.8`
SRE / incident response	`assert dependency_order(r, expected_dag=runbook).value == 1.0`
Outbound sales	`assert can_spam_compliance(r, judge=judge).value > 0.95`
Healthcare scheduling	`assert phi_leak_check(r, judge=judge).value == 1.0`
Finance / spend	`assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99`

Built-in coding-agent harness on day one; Harness plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.

What you need to provide

Required	Optional
An agent function	Expected tool sequence
A task / prompt	Custom rubric text
A judge model + key	Past PR list (for repo harness)
	Domain policy documents

No labeled corpus. The cassette is the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.

Install & quickstart

pip install agentpytest

export ANTHROPIC_API_KEY=...    # or OPENAI_API_KEY, GEMINI_API_KEY, etc.

# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion

judge = Judge("anthropic/claude-sonnet-4-6")

@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
    result = my_agent("greet the user politely")
    assert goal_completion(result, judge=judge).value > 0.8

pytest tests/test_my_agent.py

First run records the cassette. Future runs diff against it.

Documentation

Quickstart — first test in 5 minutes
Concepts — trajectories, cassettes, scorers, judges
Scorers reference — every built-in scorer
Repo harness guide — eval on your own PRs
Domain cookbooks — SRE, sales, healthcare, finance, coding
CI integration — GitHub Actions, GitLab, CircleCI

Examples

Runnable example repos, one per domain:

agentpytest-examples-coding — Claude Code-style agent
agentpytest-examples-sre — incident-response agent
agentpytest-examples-sales — outbound prospecting
agentpytest-examples-healthcare — appointment booking
agentpytest-examples-finance — expense classification

Status

v0.1.0-alpha — core (trajectory_test, cassettes), 5 scorers (goal_completion, right_tool, right_args, redundant_call, dependency_order), regression_test, OTel export. Roadmap in ROADMAP.md.

License

MIT. No CLA. Fork freely.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md. One rule: no PR that adds a server, dashboard, or hosted feature. This stays a library.

Acknowledgements

Built on the shoulders of LiteLLM, agent-vcr, pytest, and OpenTelemetry. TRAIL detector taxonomy inspired by Patronus AI's published research.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentpytest-0.0.0.tar.gz (29.4 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentpytest-0.0.0-py3-none-any.whl (6.0 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file agentpytest-0.0.0.tar.gz.

File metadata

Download URL: agentpytest-0.0.0.tar.gz
Upload date: Apr 29, 2026
Size: 29.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentpytest-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ec36db22504b4882e342c8c5e40f8b4f69a97649dd0e6aa458ec2dea0c817105`
MD5	`ba39f97c5657d5c1f03002dca08f3e98`
BLAKE2b-256	`b34bd30844b611093411cfab0092050fccd70f3535c6413e460e497ba6881c3e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentpytest-0.0.0.tar.gz:

Publisher: publish.yml on piyushbhavsarr/agentpytest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentpytest-0.0.0.tar.gz
- Subject digest: ec36db22504b4882e342c8c5e40f8b4f69a97649dd0e6aa458ec2dea0c817105
- Sigstore transparency entry: 1400456549
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: piyushbhavsarr/agentpytest@15621a7dcadecb88bb0fd52420a62844e2411cda
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/piyushbhavsarr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@15621a7dcadecb88bb0fd52420a62844e2411cda
- Trigger Event: push

File details

Details for the file agentpytest-0.0.0-py3-none-any.whl.

File metadata

Download URL: agentpytest-0.0.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 6.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentpytest-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55a5d9f0b27ff8d28942523f46fcdc1257d1616084a4831bb24e4fcf279dc5d2`
MD5	`16d486298c86e0b59f276e4f6cad8191`
BLAKE2b-256	`2d508614f2e1b828ff7ae13fc27dd27ad9f7455f86ffd78c4e3e836ef1bfcc56`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentpytest-0.0.0-py3-none-any.whl:

Publisher: publish.yml on piyushbhavsarr/agentpytest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentpytest-0.0.0-py3-none-any.whl
- Subject digest: 55a5d9f0b27ff8d28942523f46fcdc1257d1616084a4831bb24e4fcf279dc5d2
- Sigstore transparency entry: 1400456684
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: piyushbhavsarr/agentpytest@15621a7dcadecb88bb0fd52420a62844e2411cda
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/piyushbhavsarr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@15621a7dcadecb88bb0fd52420a62844e2411cda
- Trigger Event: push

agentpytest 0.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agentpytest

What it is

60-second example

Why it exists

What makes it different

Comparison

Works for any domain

What you need to provide

Install & quickstart

Documentation

Examples

Status

License

Contributing

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance