Compare baseline vs current agent runs and surface regressions as structured reasons: success loss, new errors, failed tool calls, output drift, step/latency/cost bloat.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mukundaraokatta

These details have not been verified by PyPI

Project description

agent-run-diff

Compare baseline vs current agent runs and surface regressions as structured reasons, not vibes. Answers "why does the agent feel worse?" with seven named signals:

Success loss — cases that passed baseline, fail current
New error signatures — error kinds present in current but absent in baseline
Tool failure rises — tools that failed more often in current than baseline
Output drift — for cases both passed, how much the final text changed (token-F1)
Step bloat — step count ratio above threshold
Latency bloat — wall-clock ratio above threshold
Cost bloat — USD ratio above threshold

Zero runtime dependencies. Works on any agent framework — you hand it JSONL traces, it tells you what changed.

Install

pip install agent-run-diff

Quick start

agent-run-diff baseline.jsonl current.jsonl

Exits 0 on clean, 1 on any regression, 2 on bad input. The markdown report prints to stdout:

# Agent Regression Report

Pass rate: baseline **92%** → current **78%** (-14pp)
Total latency: 48,200 ms → 81,400 ms
Total cost:    $1.2400 → $2.1800

## Summary
- Success losses:    **3**
- New error kinds:   **1**
- Tool failure rises: **2**
- Step bloat cases:  **4**
- Latency bloat:     **6**
- Cost bloat:        **5**

## Success losses (was passing, now not)
- `login-flow-happy-path` — status: success → failed
...

## Tool failures rising
| Tool | Baseline | Current | Sample errors |
| --- | ---: | ---: | --- |
| browser.click | 0 | 4 | `Element not found: #submit` |
...

Trace format

A run is one JSON object. The shape is lenient — accepts snake_case or camelCase, and several common aliases:

{
  "run_id": "run-2026-04-24-abc",
  "case_id": "login-flow-happy-path",
  "status": "success",
  "final_output": "Logged in as alice.",
  "steps": [
    {
      "type": "tool_call",
      "tool_name": "browser.click",
      "tool_args": {"selector": "#submit"},
      "error": null,
      "latency_ms": 420,
      "cost_usd": 0.001
    }
  ],
  "total_latency_ms": 2300,
  "total_cost_usd": 0.08
}

Aliases recognized:

case_id | caseId | test_id | testId (falls back to run_id if none present)
status | outcome | result.status — values like pass/passed/ok/success all normalize to "success"
final_output | finalOutput | output
steps | trace | events
per-step: tool_name/toolName/name, error/error.message, latency_ms/latencyMs/duration_ms
totals: total_cost_usd/totalCostUsd, total_latency_ms/totalLatencyMs (auto-summed from steps if absent)

Runs are matched between baseline and current by case_id.

Configuring thresholds

Defaults are transparent (1.3x bloat, 0.7 output-drift F1) and overridable:

agent-run-diff base.jsonl curr.jsonl \
  --step-ratio 1.5 \
  --latency-ratio 2.0 \
  --cost-ratio 1.2 \
  --output-drift-f1 0.6 \
  --min-latency-ms 200 \
  --min-cost-usd 0.01

The --min-latency-ms and --min-cost-usd floors prevent noise on trivially-short or trivially-cheap runs.

CI usage

- name: Compare against baseline
  run: |
    agent-run-diff \
      baselines/2026-04-20.jsonl \
      runs/$GITHUB_SHA.jsonl \
      --format json > regression.json

The workflow step fails when any regression is detected (exit 1). Attach regression.json as an artifact to make the seven signals browsable in the PR review.

Honest scope

Not semantic similarity. Output drift is token-F1, not embeddings. Flagging is "text changed a lot," not "meaning changed." Labeled as such in every code path.
Not a framework. The tool never touches your agent, never calls an LLM, never runs your tests. It only reads traces you already have.
No pricing built-in. If your traces include cost_usd we use it; otherwise cost bloat is skipped. Pricing is your framework's job, not this tool's.

Library API

from pathlib import Path
from agent_run_diff import parse_runs_file, analyze, render_markdown, Thresholds

baseline = parse_runs_file("baseline.jsonl")
current = parse_runs_file("current.jsonl")
report = analyze(baseline, current, thresholds=Thresholds(latency_ratio=1.5))

print(render_markdown(report))
if report.has_regressions:
    raise SystemExit(1)

License

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mukundaraokatta

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_run_diff-0.1.0.tar.gz (14.7 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_run_diff-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file agent_run_diff-0.1.0.tar.gz.

File metadata

Download URL: agent_run_diff-0.1.0.tar.gz
Upload date: Apr 24, 2026
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_run_diff-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a1d3bcbcc1bdb8fafc3196c190c2e9b2e89b61e01abd54558adac4efa02388ff`
MD5	`bfcb602b4e1cd7e9e87ac5b89822ab7a`
BLAKE2b-256	`5a8d9e3d84f2059b17f5f7a4a8b63bec250b37a28b910eee52ddf9187a1f9ba8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_run_diff-0.1.0.tar.gz:

Publisher: publish.yml on MukundaKatta/agent-run-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_run_diff-0.1.0.tar.gz
- Subject digest: a1d3bcbcc1bdb8fafc3196c190c2e9b2e89b61e01abd54558adac4efa02388ff
- Sigstore transparency entry: 1368109271
- Sigstore integration time: Apr 24, 2026
Source repository:
- Permalink: MukundaKatta/agent-run-diff@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/MukundaKatta
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd
- Trigger Event: release

File details

Details for the file agent_run_diff-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_run_diff-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_run_diff-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dcbf66f0f0c497b1a5f83d2584d34e18156f3dc714e698870f97f70289584a9c`
MD5	`97441f6cb33bd9a16a58f49b9a192875`
BLAKE2b-256	`977e5752ee3c2af9e5a36a948471a322d73633b4caddb88b9dec08239f42dc94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_run_diff-0.1.0-py3-none-any.whl:

Publisher: publish.yml on MukundaKatta/agent-run-diff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_run_diff-0.1.0-py3-none-any.whl
- Subject digest: dcbf66f0f0c497b1a5f83d2584d34e18156f3dc714e698870f97f70289584a9c
- Sigstore transparency entry: 1368109316
- Sigstore integration time: Apr 24, 2026
Source repository:
- Permalink: MukundaKatta/agent-run-diff@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/MukundaKatta
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9b2e598d027c128d749d3b9cdbb2462e3d4ff5bd
- Trigger Event: release

agent-run-diff 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agent-run-diff

Install

Quick start

Trace format

Configuring thresholds

CI usage

Honest scope

Library API

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance