Turn failed AI agent runs into replayable regression tests

These details have not been verified by PyPI

Project links

Project description

replayd

Turn failed AI agent runs into replayable regression tests.

When an AI agent fails in production, that failure becomes a test that runs before every future deployment. If the same failure returns after a prompt, model, or tool change, the release is blocked.

pip install replayd

The problem

AI agents regress silently. A team fixes a bug, changes a prompt or model, and the same bug quietly returns. Traditional software has regression tests and CI/CD to catch this. AI agents have nothing equivalent.

replayd is the open source fix. It replays known failures before you ship so the same mistake cannot return.

Quickstart

from replayd import Replayd

rp = Replayd()

# 1. Capture a run — assign run.output inside the block
with rp.capture(input=user_input, model="gpt-4o") as run:
    run.output = your_agent.run(user_input)

# Note: wrap your agent to record tool calls — see "Recording tool calls" below

# 2. Mark it as failed
rp.mark_failed(run.id, reason="agent approved refund after policy limit")

# 3. Save as a regression test
rp.save_test(
    run.id,
    forbidden_actions=["approve_refund"],
    expected_action="escalate",
)

# 4. Later — after changing your prompt or model — replay all tests
results = rp.replay_all(agent=your_agent_fn)

for r in results:
    print(r.verdict, r.reason)

See it working

Run the included example (python examples/basic_example.py) and you get:

Capturing a refund-approval agent run...
  agent called: approve_refund(amount=1200)  [policy limit is $500]
  output: {'action': 'approve_refund', 'amount': 1200}

Marking run as failed...
  reason: agent approved refund of $1200, exceeding $500 policy limit

Saving as regression test...
  forbidden: approve_refund  |  expected: escalate

-----------------------------------------
Replay #1 -- buggy agent (regression should be caught)
  [FAIL] Forbidden action 'approve_refund' was called during replay.

Replay #2 -- fixed agent (regression should be resolved)
  [PASS] No forbidden actions called; all expected actions present.
-----------------------------------------
1 failure caught. 1 resolved.

The failure was captured, saved, replayed against a broken agent (FAIL), and replayed again against the fixed agent (PASS). That is the full loop.

Recording tool calls

replayd cannot intercept tool calls automatically. Wrap your agent's tool dispatcher to record them:

def my_agent(input, run_ctx):
    result = call_tool("search", {"query": input["query"]})
    run_ctx.record_tool_call("search", {"query": input["query"]}, result)
    # ... rest of agent logic
    return final_output

Pass this two-argument callable to replay_all:

results = rp.replay_all(agent=my_agent)

Grading

replayd does not grade on exact output matching. LLMs are non-deterministic — the same correct behavior will produce different output text every run, so exact matching creates false failures. The wrong tool being called, however, is a fact. replayd grades on facts.

Failure type	Grading method
Wrong tool called, wrong argument, wrong state	Deterministic assertion — no LLM needed, never flaky
Policy violated, wrong reasoning, bad decision	LLM-as-judge via `grader_prompt`

The structural check always runs first. If a forbidden action fires, the test fails immediately without calling the LLM.

Semantic grading

For failures that can only be evaluated by reading the output:

rp.save_test(
    run.id,
    grader_prompt="Did the agent approve a refund that exceeds the $500 policy limit?",
)

Requires:

pip install "replayd[semantic]"
export ANTHROPIC_API_KEY=sk-...

Storage

Runs and tests are stored as JSON files in .replayd/ in your working directory:

.replayd/
  runs/<run-id>.json    <- full record of each captured run
  tests/<test-id>.json  <- saved regression tests

No database. No hosted backend. Check .replayd/tests/ into version control to share tests with your team. The .gitignore included in this repo excludes .replayd/ by default — commit only the tests/ subfolder, not captured runs.

CI integration

A ready-to-use script is included at scripts/regression_check.py. Copy it into your repo, replace the agent import, and add this to your workflow:

# .github/workflows/regression.yml
- name: Run regression tests
  run: python scripts/regression_check.py

What replayd is not

replayd is not an observability tool. LangSmith, Braintrust, and Arize tell you what happened after the fact. replayd is an active release gate — it replays known failures before you ship. Passive vs active. That is the distinction.

Part of TAQ by Stonepath Labs

replayd is the open source core of TAQ — the full AI release control platform.

TAQ adds: a dashboard, hosted backend, team access controls, release gate enforcement, and audit logs. replayd gets your team started with the concept. TAQ is what you run it on in production.

stonepathlab.net

Contributing

Bug reports and pull requests are welcome. Open an issue on GitHub to discuss anything before sending a large PR.

The build has no dependencies — pip install -e ".[dev]" gives you everything needed to run tests:

pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jun 3, 2026

This version

0.1.2

May 30, 2026

0.1.1

May 30, 2026

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

replayd-0.1.2.tar.gz (38.7 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

replayd-0.1.2-py3-none-any.whl (15.3 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file replayd-0.1.2.tar.gz.

File metadata

Download URL: replayd-0.1.2.tar.gz
Upload date: May 30, 2026
Size: 38.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for replayd-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9cd855a7eb641feffb8179a970a2b6c1f881dcd1b11fd99d63424118876dbfd4`
MD5	`b258560380f6654474f72d44d881a020`
BLAKE2b-256	`2372b6fbec6fa1a48a9fdf23753df6cd024d346a2156d3ca864f094692a4d563`

See more details on using hashes here.

File details

Details for the file replayd-0.1.2-py3-none-any.whl.

File metadata

Download URL: replayd-0.1.2-py3-none-any.whl
Upload date: May 30, 2026
Size: 15.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for replayd-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91f054af3cae468b5a8de372cfd1d45000f0776e5a62dc3102af53787821f8b1`
MD5	`cb82f56922ae716a22a73f96d354f5b1`
BLAKE2b-256	`bac06186620c713edb66b27429babb9e5b360b41277fe405e8b2d10b62eead9d`

See more details on using hashes here.

replayd 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

replayd

The problem

Quickstart

See it working

Recording tool calls

Grading

Semantic grading

Storage

CI integration

What replayd is not

Part of TAQ by Stonepath Labs

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes