Skip to main content

Structural trajectory regression testing for AI agents - diff what your agent did, not just its score.

Project description

tracediff

tests PyPI Python License

Structural trajectory regression testing for AI agents. Diff what your agent did, not just its score.

Every eval tool can tell you "accuracy dropped 3%." tracediff tells you why:

2 regression(s), 1 cost regression(s), 1 behavior change(s) across 4 task(s)

[REGRESSION] summarize-meeting
    - pass rate 100% -> 0%
    - calls read_file instead of read_file at position 0
    - read_file args drifted: path
[BEHAVIOR CHANGE] refund-order
    - issue_refund args drifted: amount   ({"amount": 49.99} -> {"amount": 499.99})
[COST REGRESSION] capital-question
    - now calls search at position 1
    - mean cost $0.0012 -> $0.0029 (2.42x)

That third one is the kind of bug that never shows up in a score: the agent still answers correctly — it just silently started calling search twice and your bill doubled. The second one is worse: output unchanged, refund amount 10x. Score-level diffing misses both.

Why

  • Scores hide behavior. Pass/fail diffs and LLM-judge verdicts can't tell you "step 4's tool args drifted between commits."
  • Agents are stochastic. A single run is a sample, not a measurement. tracediff runs repeats and reports variance, so you can tell drift from noise.
  • Cost is a first-class metric. Research (Kapoor et al., NeurIPS 2024) showed accuracy-only evals reward agents that cost 50x more for the same results.
  • Benchmarks leak. Most agent benchmarks have no holdout discipline. tracediff suites have a built-in dev/holdout split with a reveal budget — evaluating the holdout more than N times per suite version requires an explicit, recorded override.
  • BYOK by construction. tracediff never calls a model provider. Your agent runs with your keys; tracediff scores the traces.

Install

pip install tracediff

Quickstart (60 seconds, no API keys)

The repo ships a deterministic demo agent. Run the baseline, "change the code" (set an env var), re-run, and diff:

git clone https://github.com/Abhishekpundir23/tracediff && cd tracediff/examples

tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out baseline.json

# simulate a code change that subtly breaks the agent
TRACEDIFF_DEMO_VARIANT=b tracediff run --suite suite.yaml --agent demo_agent:run --repeats 3 --out current.json

tracediff diff baseline.json current.json --md report.md

Wiring up your agent

Expose one function. It gets the task input and returns a trace in any of three shapes:

# my_agent.py
def run(task_input):
    messages, usage = my_agent_loop(task_input)   # your existing code, your keys
    return {"messages": messages, "usage": usage}  # OpenAI-style messages work as-is

Accepted return shapes:

  1. OpenAI-style message list (assistant tool_calls + tool-role results) — works directly with most frameworks' message history.
  2. tracediff.Trace — build it natively for full control (per-step tokens/cost).
  3. Serialized dict{"steps": [...], "final_output": "..."}.

Framework adapters

One-line conversion for the major agent frameworks — duck-typed, so tracediff has no dependency on any of them, and dict-serialized traces work too:

from tracediff import from_langgraph, from_openai_agents, from_claude_agent_sdk

# LangGraph / LangChain: pass the final state (or its messages list)
def run(task_input):
    state = graph.invoke({"messages": [("user", task_input["question"])]})
    return from_langgraph(state, pricing=(3.0, 15.0))   # $/M input, $/M output tokens

# OpenAI Agents SDK: pass the RunResult
def run(task_input):
    result = Runner.run_sync(agent, task_input["question"])
    return from_openai_agents(result, pricing=(2.5, 10.0))

# Claude Agent SDK: pass the collected message list
async def run(task_input):
    messages = [m async for m in query(prompt=task_input["question"])]
    return from_claude_agent_sdk(messages)   # uses the SDK's own total_cost_usd

The optional pricing=(input_usd_per_mtok, output_usd_per_mtok) turns the framework's token counts into the cost metric used by max_cost_usd budgets and cost-regression detection. The Claude Agent SDK reports cost directly, so no pricing is needed there.

Writing a suite

suite: my-agent-suite
seed: 7
holdout_fraction: 0.25     # deterministic split by task-id hash
max_holdout_reveals: 5     # holdout governance: budgeted, recorded reveals

tasks:
  - id: refund-order
    input: { topic: refund, order_id: A-100 }
    expect:
      tools: [lookup_order, issue_refund]   # expected tool trajectory
      mode: strict                          # strict | unordered | subset
      args:
        issue_refund: { order_id: A-100, amount: 49.99 }
      max_tool_calls: 4                     # budgets are first-class
      max_cost_usd: 0.01
    checks:
      - type: output_contains               # output_contains | output_not_contains
        value: refund                       # | output_equals | output_regex

The suite version is a content hash — edit any task and you get a new version. Diffs warn when results from different suite versions are compared, and each new version gets a fresh holdout budget.

tracediff suite suite.yaml      # inspect version hash + dev/holdout split
tracediff run --suite suite.yaml --agent my_agent:run --split holdout   # budgeted

CI: structural diffs on every PR

# .github/workflows/tracediff.yml
name: tracediff
on: pull_request

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }

      # restore the baseline produced on main (artifact, cache, or committed file)
      - name: Restore baseline
        run: cp .tracediff/baseline.json baseline.json

      - name: Run + diff
        run: |
          pip install tracediff
          tracediff run --suite evals/suite.yaml --agent my_agent:run --repeats 3 --out current.json
          tracediff diff baseline.json current.json --md report.md   # exits 1 on regressions

      - name: Comment on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body: fs.readFileSync('report.md', 'utf8'),
            });

A composite action wrapping these steps lives in action/.

What gets detected

Category Example finding
regression pass rate 100% → 33% on summarize-meeting
behavior change issue_refund args drifted: amount 49.99 → 499.99 (output unchanged!)
cost regression now calls search twice; mean cost 2.4x baseline
improvement pass rate 50% → 100%

Plus: tools added/removed/replaced/reordered with positions, step-count drift, trajectory variance across repeats (flakiness), tasks added/removed, suite-version mismatch warnings.

Roadmap

  • v0.1: trace ingestion, structural scoring + budgets, repeat variance, structural diff, CLI, CI action, holdout governance
  • v0.2 (this): adapters for LangGraph / OpenAI Agents SDK / Claude Agent SDK
  • v0.3: OpenTelemetry GenAI span ingestion; richer flakiness analysis
  • v0.4: automated benchmark construction — generate decontaminated, holdout-split task suites from your domain

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracediff-0.2.0.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracediff-0.2.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file tracediff-0.2.0.tar.gz.

File metadata

  • Download URL: tracediff-0.2.0.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracediff-0.2.0.tar.gz
Algorithm Hash digest
SHA256 366ba68144f5df23e12cba2f1f223985357b31b7271827ea11adc1afc9270f8b
MD5 499a4039e0ad264004ce3495c0c78cb6
BLAKE2b-256 f5b0a149dca0539048f40dfa7205fe106d7bf804b6ec1a75f75ccdaa936420ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracediff-0.2.0.tar.gz:

Publisher: publish.yml on Abhishekpundir23/tracediff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracediff-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tracediff-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracediff-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d692aa30b060ce3dfe3921b057e6e149d0f56d4bb4db11f94b0114dafb492757
MD5 995cf75f5ef94201a9cb3eb8fc0d0192
BLAKE2b-256 c7473ceae1a5e266d5baa0550e3e2f0710f78bb6d660d88b8d2b646eab771e21

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracediff-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Abhishekpundir23/tracediff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page