Skip to main content

Compare baseline vs current agent runs and surface regressions as structured reasons: success loss, new errors, failed tool calls, output drift, step/latency/cost bloat.

Project description

agent-run-diff

CI PyPI Python License: MIT

Compare baseline vs current agent runs and surface regressions as structured reasons, not vibes. Answers "why does the agent feel worse?" with seven named signals:

  1. Success loss — cases that passed baseline, fail current
  2. New error signatures — error kinds present in current but absent in baseline
  3. Tool failure rises — tools that failed more often in current than baseline
  4. Output drift — for cases both passed, how much the final text changed (token-F1)
  5. Step bloat — step count ratio above threshold
  6. Latency bloat — wall-clock ratio above threshold
  7. Cost bloat — USD ratio above threshold

Zero runtime dependencies. Works on any agent framework — you hand it JSONL traces, it tells you what changed.

Install

pip install agent-run-diff

Quick start

agent-run-diff baseline.jsonl current.jsonl

Exits 0 on clean, 1 on any regression, 2 on bad input. The markdown report prints to stdout:

# Agent Regression Report

Pass rate: baseline **92%** → current **78%** (-14pp)
Total latency: 48,200 ms → 81,400 ms
Total cost:    $1.2400 → $2.1800

## Summary
- Success losses:    **3**
- New error kinds:   **1**
- Tool failure rises: **2**
- Step bloat cases:  **4**
- Latency bloat:     **6**
- Cost bloat:        **5**

## Success losses (was passing, now not)
- `login-flow-happy-path` — status: success → failed
...

## Tool failures rising
| Tool | Baseline | Current | Sample errors |
| --- | ---: | ---: | --- |
| browser.click | 0 | 4 | `Element not found: #submit` |
...

Trace format

A run is one JSON object. The shape is lenient — accepts snake_case or camelCase, and several common aliases:

{
  "run_id": "run-2026-04-24-abc",
  "case_id": "login-flow-happy-path",
  "status": "success",
  "final_output": "Logged in as alice.",
  "steps": [
    {
      "type": "tool_call",
      "tool_name": "browser.click",
      "tool_args": {"selector": "#submit"},
      "error": null,
      "latency_ms": 420,
      "cost_usd": 0.001
    }
  ],
  "total_latency_ms": 2300,
  "total_cost_usd": 0.08
}

Aliases recognized:

  • case_id | caseId | test_id | testId (falls back to run_id if none present)
  • status | outcome | result.status — values like pass/passed/ok/success all normalize to "success"
  • final_output | finalOutput | output
  • steps | trace | events
  • per-step: tool_name/toolName/name, error/error.message, latency_ms/latencyMs/duration_ms
  • totals: total_cost_usd/totalCostUsd, total_latency_ms/totalLatencyMs (auto-summed from steps if absent)

Runs are matched between baseline and current by case_id.

Configuring thresholds

Defaults are transparent (1.3x bloat, 0.7 output-drift F1) and overridable:

agent-run-diff base.jsonl curr.jsonl \
  --step-ratio 1.5 \
  --latency-ratio 2.0 \
  --cost-ratio 1.2 \
  --output-drift-f1 0.6 \
  --min-latency-ms 200 \
  --min-cost-usd 0.01

The --min-latency-ms and --min-cost-usd floors prevent noise on trivially-short or trivially-cheap runs.

CI usage

- name: Compare against baseline
  run: |
    agent-run-diff \
      baselines/2026-04-20.jsonl \
      runs/$GITHUB_SHA.jsonl \
      --format json > regression.json

The workflow step fails when any regression is detected (exit 1). Attach regression.json as an artifact to make the seven signals browsable in the PR review.

Honest scope

  • Not semantic similarity. Output drift is token-F1, not embeddings. Flagging is "text changed a lot," not "meaning changed." Labeled as such in every code path.
  • Not a framework. The tool never touches your agent, never calls an LLM, never runs your tests. It only reads traces you already have.
  • No pricing built-in. If your traces include cost_usd we use it; otherwise cost bloat is skipped. Pricing is your framework's job, not this tool's.

Library API

from pathlib import Path
from agent_run_diff import parse_runs_file, analyze, render_markdown, Thresholds

baseline = parse_runs_file("baseline.jsonl")
current = parse_runs_file("current.jsonl")
report = analyze(baseline, current, thresholds=Thresholds(latency_ratio=1.5))

print(render_markdown(report))
if report.has_regressions:
    raise SystemExit(1)

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_run_diff-0.1.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_run_diff-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file agent_run_diff-0.1.0.tar.gz.

File metadata

  • Download URL: agent_run_diff-0.1.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_run_diff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a1d3bcbcc1bdb8fafc3196c190c2e9b2e89b61e01abd54558adac4efa02388ff
MD5 bfcb602b4e1cd7e9e87ac5b89822ab7a
BLAKE2b-256 5a8d9e3d84f2059b17f5f7a4a8b63bec250b37a28b910eee52ddf9187a1f9ba8

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_run_diff-0.1.0.tar.gz:

Publisher: publish.yml on MukundaKatta/agent-run-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_run_diff-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_run_diff-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_run_diff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dcbf66f0f0c497b1a5f83d2584d34e18156f3dc714e698870f97f70289584a9c
MD5 97441f6cb33bd9a16a58f49b9a192875
BLAKE2b-256 977e5752ee3c2af9e5a36a948471a322d73633b4caddb88b9dec08239f42dc94

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_run_diff-0.1.0-py3-none-any.whl:

Publisher: publish.yml on MukundaKatta/agent-run-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page