Compare baseline vs current agent runs and surface regressions as structured reasons: success loss, new errors, failed tool calls, output drift, step/latency/cost bloat.
Project description
agent-run-diff
Compare baseline vs current agent runs and surface regressions as structured reasons, not vibes. Answers "why does the agent feel worse?" with seven named signals:
- Success loss — cases that passed baseline, fail current
- New error signatures — error kinds present in current but absent in baseline
- Tool failure rises — tools that failed more often in current than baseline
- Output drift — for cases both passed, how much the final text changed (token-F1)
- Step bloat — step count ratio above threshold
- Latency bloat — wall-clock ratio above threshold
- Cost bloat — USD ratio above threshold
Zero runtime dependencies. Works on any agent framework — you hand it JSONL traces, it tells you what changed.
Install
pip install agent-run-diff
Quick start
agent-run-diff baseline.jsonl current.jsonl
Exits 0 on clean, 1 on any regression, 2 on bad input. The markdown report prints to stdout:
# Agent Regression Report
Pass rate: baseline **92%** → current **78%** (-14pp)
Total latency: 48,200 ms → 81,400 ms
Total cost: $1.2400 → $2.1800
## Summary
- Success losses: **3**
- New error kinds: **1**
- Tool failure rises: **2**
- Step bloat cases: **4**
- Latency bloat: **6**
- Cost bloat: **5**
## Success losses (was passing, now not)
- `login-flow-happy-path` — status: success → failed
...
## Tool failures rising
| Tool | Baseline | Current | Sample errors |
| --- | ---: | ---: | --- |
| browser.click | 0 | 4 | `Element not found: #submit` |
...
Trace format
A run is one JSON object. The shape is lenient — accepts snake_case or camelCase, and several common aliases:
{
"run_id": "run-2026-04-24-abc",
"case_id": "login-flow-happy-path",
"status": "success",
"final_output": "Logged in as alice.",
"steps": [
{
"type": "tool_call",
"tool_name": "browser.click",
"tool_args": {"selector": "#submit"},
"error": null,
"latency_ms": 420,
"cost_usd": 0.001
}
],
"total_latency_ms": 2300,
"total_cost_usd": 0.08
}
Aliases recognized:
case_id|caseId|test_id|testId(falls back torun_idif none present)status|outcome|result.status— values likepass/passed/ok/successall normalize to"success"final_output|finalOutput|outputsteps|trace|events- per-step:
tool_name/toolName/name,error/error.message,latency_ms/latencyMs/duration_ms - totals:
total_cost_usd/totalCostUsd,total_latency_ms/totalLatencyMs(auto-summed from steps if absent)
Runs are matched between baseline and current by case_id.
Configuring thresholds
Defaults are transparent (1.3x bloat, 0.7 output-drift F1) and overridable:
agent-run-diff base.jsonl curr.jsonl \
--step-ratio 1.5 \
--latency-ratio 2.0 \
--cost-ratio 1.2 \
--output-drift-f1 0.6 \
--min-latency-ms 200 \
--min-cost-usd 0.01
The --min-latency-ms and --min-cost-usd floors prevent noise on trivially-short or trivially-cheap runs.
CI usage
- name: Compare against baseline
run: |
agent-run-diff \
baselines/2026-04-20.jsonl \
runs/$GITHUB_SHA.jsonl \
--format json > regression.json
The workflow step fails when any regression is detected (exit 1). Attach regression.json as an artifact to make the seven signals browsable in the PR review.
Honest scope
- Not semantic similarity. Output drift is token-F1, not embeddings. Flagging is "text changed a lot," not "meaning changed." Labeled as such in every code path.
- Not a framework. The tool never touches your agent, never calls an LLM, never runs your tests. It only reads traces you already have.
- No pricing built-in. If your traces include
cost_usdwe use it; otherwise cost bloat is skipped. Pricing is your framework's job, not this tool's.
Library API
from pathlib import Path
from agent_run_diff import parse_runs_file, analyze, render_markdown, Thresholds
baseline = parse_runs_file("baseline.jsonl")
current = parse_runs_file("current.jsonl")
report = analyze(baseline, current, thresholds=Thresholds(latency_ratio=1.5))
print(render_markdown(report))
if report.has_regressions:
raise SystemExit(1)
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_run_diff-0.1.0.tar.gz.
File metadata
- Download URL: agent_run_diff-0.1.0.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1d3bcbcc1bdb8fafc3196c190c2e9b2e89b61e01abd54558adac4efa02388ff
|
|
| MD5 |
bfcb602b4e1cd7e9e87ac5b89822ab7a
|
|
| BLAKE2b-256 |
5a8d9e3d84f2059b17f5f7a4a8b63bec250b37a28b910eee52ddf9187a1f9ba8
|
Provenance
The following attestation bundles were made for agent_run_diff-0.1.0.tar.gz:
Publisher:
publish.yml on MukundaKatta/agent-run-diff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_run_diff-0.1.0.tar.gz -
Subject digest:
a1d3bcbcc1bdb8fafc3196c190c2e9b2e89b61e01abd54558adac4efa02388ff - Sigstore transparency entry: 1368109271
- Sigstore integration time:
-
Permalink:
MukundaKatta/agent-run-diff@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/MukundaKatta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd -
Trigger Event:
release
-
Statement type:
File details
Details for the file agent_run_diff-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_run_diff-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcbf66f0f0c497b1a5f83d2584d34e18156f3dc714e698870f97f70289584a9c
|
|
| MD5 |
97441f6cb33bd9a16a58f49b9a192875
|
|
| BLAKE2b-256 |
977e5752ee3c2af9e5a36a948471a322d73633b4caddb88b9dec08239f42dc94
|
Provenance
The following attestation bundles were made for agent_run_diff-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on MukundaKatta/agent-run-diff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_run_diff-0.1.0-py3-none-any.whl -
Subject digest:
dcbf66f0f0c497b1a5f83d2584d34e18156f3dc714e698870f97f70289584a9c - Sigstore transparency entry: 1368109316
- Sigstore integration time:
-
Permalink:
MukundaKatta/agent-run-diff@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/MukundaKatta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd -
Trigger Event:
release
-
Statement type: