Structural trajectory regression testing for AI agents - diff what your agent did, not just its score.

These details have not been verified by PyPI

Project description

tracediff

Structural trajectory regression testing for AI agents. Diff what your agent did, not just its score.

Every eval tool can tell you "accuracy dropped 3%." tracediff tells you why:

2 regression(s), 1 cost regression(s), 1 behavior change(s) across 4 task(s)

[REGRESSION] summarize-meeting
    - pass rate 100% -> 0%
    - calls read_file instead of read_file at position 0
    - read_file args drifted: path
[BEHAVIOR CHANGE] refund-order
    - issue_refund args drifted: amount   ({"amount": 49.99} -> {"amount": 499.99})
[COST REGRESSION] capital-question
    - now calls search at position 1
    - mean cost $0.0012 -> $0.0029 (2.42x)

That third one is the kind of bug that never shows up in a score: the agent still answers correctly — it just silently started calling search twice and your bill doubled. The second one is worse: output unchanged, refund amount 10x. Score-level diffing misses both.

Why

Scores hide behavior. Pass/fail diffs and LLM-judge verdicts can't tell you "step 4's tool args drifted between commits."
Agents are stochastic. A single run is a sample, not a measurement. tracediff runs repeats and reports variance, so you can tell drift from noise.
Cost is a first-class metric. Research (Kapoor et al., NeurIPS 2024) showed accuracy-only evals reward agents that cost 50x more for the same results.
Benchmarks leak. Most agent benchmarks have no holdout discipline. tracediff suites have a built-in dev/holdout split with a reveal budget — evaluating the holdout more than N times per suite version requires an explicit, recorded override.
BYOK by construction. tracediff never calls a model provider. Your agent runs with your keys; tracediff scores the traces.

Install

pip install tracediff

Quickstart (60 seconds, no API keys)

The repo ships a deterministic demo agent. Run the baseline, "change the code" (set an env var), re-run, and diff:

git clone https://github.com/Abhishekpundir23/tracediff && cd tracediff/examples

tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out baseline.json

# simulate a code change that subtly breaks the agent
TRACEDIFF_DEMO_VARIANT=b tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out current.json

tracediff diff baseline.json current.json --md report.md

Wiring up your agent

Expose one function. It gets the task input and returns a trace in any of three shapes:

# my_agent.py
def run(task_input):
    messages, usage = my_agent_loop(task_input)   # your existing code, your keys
    return {"messages": messages, "usage": usage}  # OpenAI-style messages work as-is

Accepted return shapes:

OpenAI-style message list (assistant tool_calls + tool-role results) — works directly with most frameworks' message history.
tracediff.Trace — build it natively for full control (per-step tokens/cost).
Serialized dict — {"steps": [...], "final_output": "..."}.

Writing a suite

suite: my-agent-suite
seed: 7
holdout_fraction: 0.25     # deterministic split by task-id hash
max_holdout_reveals: 5     # holdout governance: budgeted, recorded reveals

tasks:
  - id: refund-order
    input: { topic: refund, order_id: A-100 }
    expect:
      tools: [lookup_order, issue_refund]   # expected tool trajectory
      mode: strict                          # strict | unordered | subset
      args:
        issue_refund: { order_id: A-100, amount: 49.99 }
      max_tool_calls: 4                     # budgets are first-class
      max_cost_usd: 0.01
    checks:
      - type: output_contains               # output_contains | output_not_contains
        value: refund                       # | output_equals | output_regex

The suite version is a content hash — edit any task and you get a new version. Diffs warn when results from different suite versions are compared, and each new version gets a fresh holdout budget.

tracediff suite suite.yaml      # inspect version hash + dev/holdout split
tracediff run --suite suite.yaml --agent my_agent:run --split holdout   # budgeted

CI: structural diffs on every PR

# .github/workflows/tracediff.yml
name: tracediff
on: pull_request

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }

      # restore the baseline produced on main (artifact, cache, or committed file)
      - name: Restore baseline
        run: cp .tracediff/baseline.json baseline.json

      - name: Run + diff
        run: |
          pip install tracediff
          tracediff run --suite evals/suite.yaml --agent my_agent:run --repeats 3 --out current.json
          tracediff diff baseline.json current.json --md report.md   # exits 1 on regressions

      - name: Comment on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body: fs.readFileSync('report.md', 'utf8'),
            });

A composite action wrapping these steps lives in action/.

What gets detected

Category	Example finding
regression	pass rate 100% → 33% on `summarize-meeting`
behavior change	`issue_refund` args drifted: `amount` 49.99 → 499.99 (output unchanged!)
cost regression	now calls `search` twice; mean cost 2.4x baseline
improvement	pass rate 50% → 100%

Plus: tools added/removed/replaced/reordered with positions, step-count drift, trajectory variance across repeats (flakiness), tasks added/removed, suite-version mismatch warnings.

Roadmap

v0.1 (this): trace ingestion, structural scoring + budgets, repeat variance, structural diff, CLI, CI action, holdout governance
v0.2: adapters for LangGraph / OpenAI Agents SDK / Claude Agent SDK trace exports, OpenTelemetry GenAI spans
v0.3: automated benchmark construction — generate decontaminated, holdout-split task suites from your domain

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 11, 2026

This version

0.1.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracediff-0.1.0.tar.gz (25.4 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracediff-0.1.0-py3-none-any.whl (22.8 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file tracediff-0.1.0.tar.gz.

File metadata

Download URL: tracediff-0.1.0.tar.gz
Upload date: Jun 11, 2026
Size: 25.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracediff-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bc6d300c3d6624ccddbd91a78f633f9a590ff247b2ca4922fc8a1b5d6a92f311`
MD5	`a2c9b50791621f219fe81f0c9582f465`
BLAKE2b-256	`77425afe05f7d8585b06ae6340d3e8f8b67c6891a0b7679fa8b88c94c4be921e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracediff-0.1.0.tar.gz:

Publisher: publish.yml on Abhishekpundir23/tracediff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracediff-0.1.0.tar.gz
- Subject digest: bc6d300c3d6624ccddbd91a78f633f9a590ff247b2ca4922fc8a1b5d6a92f311
- Sigstore transparency entry: 1789361331
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: Abhishekpundir23/tracediff@ad625318ac0c5dc6c77be711b3fa767233f18eaf
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Abhishekpundir23
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ad625318ac0c5dc6c77be711b3fa767233f18eaf
- Trigger Event: release

File details

Details for the file tracediff-0.1.0-py3-none-any.whl.

File metadata

Download URL: tracediff-0.1.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 22.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracediff-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b2063c89a129ef73dd4d21850452e300f1abcb1d86844133895fd33b409761c`
MD5	`f5a9bb8a008559ac8ff2a78e16b6e084`
BLAKE2b-256	`a91813916c69a7db33a4a3e9f3fc41a930b251320184139ebf2b5daebafa6a06`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracediff-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Abhishekpundir23/tracediff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracediff-0.1.0-py3-none-any.whl
- Subject digest: 2b2063c89a129ef73dd4d21850452e300f1abcb1d86844133895fd33b409761c
- Sigstore transparency entry: 1789361374
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: Abhishekpundir23/tracediff@ad625318ac0c5dc6c77be711b3fa767233f18eaf
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Abhishekpundir23
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ad625318ac0c5dc6c77be711b3fa767233f18eaf
- Trigger Event: release

tracediff 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tracediff

Why

Install

Quickstart (60 seconds, no API keys)

Wiring up your agent

Writing a suite

CI: structural diffs on every PR

What gets detected

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance