Skip to main content

Structural trajectory regression testing for AI agents - diff what your agent did, not just its score.

Project description

tracediff

Structural trajectory regression testing for AI agents. Diff what your agent did, not just its score.

Every eval tool can tell you "accuracy dropped 3%." tracediff tells you why:

2 regression(s), 1 cost regression(s), 1 behavior change(s) across 4 task(s)

[REGRESSION] summarize-meeting
    - pass rate 100% -> 0%
    - calls read_file instead of read_file at position 0
    - read_file args drifted: path
[BEHAVIOR CHANGE] refund-order
    - issue_refund args drifted: amount   ({"amount": 49.99} -> {"amount": 499.99})
[COST REGRESSION] capital-question
    - now calls search at position 1
    - mean cost $0.0012 -> $0.0029 (2.42x)

That third one is the kind of bug that never shows up in a score: the agent still answers correctly — it just silently started calling search twice and your bill doubled. The second one is worse: output unchanged, refund amount 10x. Score-level diffing misses both.

Why

  • Scores hide behavior. Pass/fail diffs and LLM-judge verdicts can't tell you "step 4's tool args drifted between commits."
  • Agents are stochastic. A single run is a sample, not a measurement. tracediff runs repeats and reports variance, so you can tell drift from noise.
  • Cost is a first-class metric. Research (Kapoor et al., NeurIPS 2024) showed accuracy-only evals reward agents that cost 50x more for the same results.
  • Benchmarks leak. Most agent benchmarks have no holdout discipline. tracediff suites have a built-in dev/holdout split with a reveal budget — evaluating the holdout more than N times per suite version requires an explicit, recorded override.
  • BYOK by construction. tracediff never calls a model provider. Your agent runs with your keys; tracediff scores the traces.

Install

pip install tracediff

Quickstart (60 seconds, no API keys)

The repo ships a deterministic demo agent. Run the baseline, "change the code" (set an env var), re-run, and diff:

git clone https://github.com/Abhishekpundir23/tracediff && cd tracediff/examples

tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out baseline.json

# simulate a code change that subtly breaks the agent
TRACEDIFF_DEMO_VARIANT=b tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out current.json

tracediff diff baseline.json current.json --md report.md

Wiring up your agent

Expose one function. It gets the task input and returns a trace in any of three shapes:

# my_agent.py
def run(task_input):
    messages, usage = my_agent_loop(task_input)   # your existing code, your keys
    return {"messages": messages, "usage": usage}  # OpenAI-style messages work as-is

Accepted return shapes:

  1. OpenAI-style message list (assistant tool_calls + tool-role results) — works directly with most frameworks' message history.
  2. tracediff.Trace — build it natively for full control (per-step tokens/cost).
  3. Serialized dict{"steps": [...], "final_output": "..."}.

Writing a suite

suite: my-agent-suite
seed: 7
holdout_fraction: 0.25     # deterministic split by task-id hash
max_holdout_reveals: 5     # holdout governance: budgeted, recorded reveals

tasks:
  - id: refund-order
    input: { topic: refund, order_id: A-100 }
    expect:
      tools: [lookup_order, issue_refund]   # expected tool trajectory
      mode: strict                          # strict | unordered | subset
      args:
        issue_refund: { order_id: A-100, amount: 49.99 }
      max_tool_calls: 4                     # budgets are first-class
      max_cost_usd: 0.01
    checks:
      - type: output_contains               # output_contains | output_not_contains
        value: refund                       # | output_equals | output_regex

The suite version is a content hash — edit any task and you get a new version. Diffs warn when results from different suite versions are compared, and each new version gets a fresh holdout budget.

tracediff suite suite.yaml      # inspect version hash + dev/holdout split
tracediff run --suite suite.yaml --agent my_agent:run --split holdout   # budgeted

CI: structural diffs on every PR

# .github/workflows/tracediff.yml
name: tracediff
on: pull_request

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }

      # restore the baseline produced on main (artifact, cache, or committed file)
      - name: Restore baseline
        run: cp .tracediff/baseline.json baseline.json

      - name: Run + diff
        run: |
          pip install tracediff
          tracediff run --suite evals/suite.yaml --agent my_agent:run --repeats 3 --out current.json
          tracediff diff baseline.json current.json --md report.md   # exits 1 on regressions

      - name: Comment on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body: fs.readFileSync('report.md', 'utf8'),
            });

A composite action wrapping these steps lives in action/.

What gets detected

Category Example finding
regression pass rate 100% → 33% on summarize-meeting
behavior change issue_refund args drifted: amount 49.99 → 499.99 (output unchanged!)
cost regression now calls search twice; mean cost 2.4x baseline
improvement pass rate 50% → 100%

Plus: tools added/removed/replaced/reordered with positions, step-count drift, trajectory variance across repeats (flakiness), tasks added/removed, suite-version mismatch warnings.

Roadmap

  • v0.1 (this): trace ingestion, structural scoring + budgets, repeat variance, structural diff, CLI, CI action, holdout governance
  • v0.2: adapters for LangGraph / OpenAI Agents SDK / Claude Agent SDK trace exports, OpenTelemetry GenAI spans
  • v0.3: automated benchmark construction — generate decontaminated, holdout-split task suites from your domain

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracediff-0.1.0.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracediff-0.1.0-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file tracediff-0.1.0.tar.gz.

File metadata

  • Download URL: tracediff-0.1.0.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracediff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bc6d300c3d6624ccddbd91a78f633f9a590ff247b2ca4922fc8a1b5d6a92f311
MD5 a2c9b50791621f219fe81f0c9582f465
BLAKE2b-256 77425afe05f7d8585b06ae6340d3e8f8b67c6891a0b7679fa8b88c94c4be921e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracediff-0.1.0.tar.gz:

Publisher: publish.yml on Abhishekpundir23/tracediff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracediff-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tracediff-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracediff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b2063c89a129ef73dd4d21850452e300f1abcb1d86844133895fd33b409761c
MD5 f5a9bb8a008559ac8ff2a78e16b6e084
BLAKE2b-256 a91813916c69a7db33a4a3e9f3fc41a930b251320184139ebf2b5daebafa6a06

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracediff-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Abhishekpundir23/tracediff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page