Skip to main content

Semantic diff engine for AI agent behavior — find the exact decision fork between two runs

Project description

agentdelta — semantic diff for AI agent behavior

git diff for how your AI agent thinks.

Detect the exact step where two agent runs diverged — which tool it switched to, when its reasoning changed, what prompt edit caused the fork. Built for CI/CD on AI agents.

CI PyPI version Python 3.10+ Downloads License: MIT codecov Typed

Quick Start · How It Works · CLI Reference · GitHub Action · vs. Alternatives · Contributing · Changelog


[!NOTE] agentdelta evaluates behavior, not output. Two runs can produce identical final answers while the agent took completely different paths — calling different tools, in different orders, with different reasoning chains. agentdelta catches that.


Why

Most LLM evaluations check: did the agent get the right answer? They miss the harder question: did it get there the same way?

  • Prompt changes are invisible — tweaking a system prompt can silently flip which tool an agent calls first, changing latency, cost, and reliability without touching the output
  • Model upgrades change behavior — moving from GPT-4o-mini to GPT-4o or Claude 3.5 Sonnet → Opus changes reasoning paths even when benchmark scores stay flat
  • Tool-calling regressions are silent — an agent that starts calling web_search instead of read_database may produce correct answers today and fail tomorrow when the web page moves

agentdelta gives every agent deployment a behavioral fingerprint so you can detect divergence in CI before it reaches production.


Quick start

Install:

pip install agentdelta
# or zero-install with pipx:
pipx run agentdelta --help

Note: First install downloads PyTorch and CUDA libs (~2.5 GB). For CPU-only systems: pip install agentdelta --extra-index-url https://download.pytorch.org/whl/cpu

With LangChain/LangGraph:

pip install "agentdelta[langchain]"

The [langchain] extra is only needed if instrumenting a real LangChain agent. AgentdeltaCallback can be used directly with any framework.

Capture two runs (self-contained — no LangChain required):

from agentdelta.instrument import AgentdeltaCallback
from agentdelta import AgentTrace, diff_traces
from agentdelta.report import print_diff

class _FakeLLMResponse:
    def __init__(self, text):
        self.generations = [[type("G", (), {"text": text})()]]

def build_trace(run_id, tool_name):
    cb = AgentdeltaCallback(run_id=run_id)
    cb.on_chain_start({}, {"input": "What is the weather in Tokyo?"})
    cb.on_llm_end(_FakeLLMResponse("I should look up the weather."))
    cb.on_tool_start({"name": tool_name}, "location='Tokyo'")
    cb.on_tool_end('{"temp": 22, "condition": "sunny"}')
    cb.on_llm_end(_FakeLLMResponse("The weather in Tokyo is 22C and sunny."))
    cb.on_chain_end({"output": "Tokyo: 22C, sunny."})
    return cb.trace

# Baseline run (before your change)
trace_a = build_trace("v1.0", tool_name="get_weather")
trace_a.save("baseline.jsonl")

# Candidate run (after your change — switched to a different tool)
trace_b = build_trace("v1.1", tool_name="web_search")
trace_b.save("candidate.jsonl")

Diff them:

agentdelta diff baseline.jsonl candidate.jsonl
╭───────────────────────────────────────────────╮
│ agentdelta  v1.0 vs v1.1                      │
╰───────────────────────────────────────────────╯
  🔴 REGRESSION DETECTED  3/6 steps matched (50.0%)  1 changed  +1 added  -1 removed

╭────────────────────── Fork Point ──────────────────────╮
│ ⚡ First fork at step 3                               │
│ Tool selection changed: 'get_weather' → 'web_search'  │
│                                                        │
│   Before: get_weather(location='Tokyo')                │
│   After:  web_search(query='Tokyo weather today')      │
╰────────────────────────────────────────────────────────╯

 Step   Status    Type          Detail
    3   CHANGED   🔧 tool_call  Tool selection changed: 'get_weather' → 'web_search'
    4   REMOVED   ↩ tool_return  - [tool_return] {"temp": 22, "condition": "sunny"}
    5   CHANGED   🧠 llm        Reasoning path diverged (similarity: 0.85)

How it works

flowchart LR
    A[Agent Run A\nbaseline.jsonl] --> E[embed_trace\nall-MiniLM-L6-v2]
    B[Agent Run B\ncandidate.jsonl] --> E
    E --> AL[align_traces\nsliding-window cosine similarity]
    AL --> D[diff_traces\nfork threshold = 0.70]
    D --> FP[ForkPoint\nfirst divergent step]
    D --> R[Report\nRich · JSON · Markdown]
  1. Embed — each node's content (LLM reasoning, tool calls, tool outputs) is embedded with all-MiniLM-L6-v2 (22M params, runs locally, no API key)
  2. Align — sliding-window cosine similarity matches nodes by meaning, not by position — insertions and deletions are handled gracefully
  3. Fork — the first aligned pair whose similarity falls below fork_threshold (default 0.70) becomes the ForkPoint
  4. Report — Rich terminal table, JSON for programmatic use, or Markdown for GitHub PR comments

See ARCHITECTURE.md for the full data flow and algorithm details.


Features

Feature Description
Semantic step alignment Matches steps by meaning, not index — handles insertions and deletions
Fork point detection Pinpoints the first divergent step with a human-readable explanation
Tool change detection Identifies when the agent switched tools, even with identical arguments
Reasoning path diff Detects LLM reasoning divergence, not just output changes
LangChain instrumentation One-line record() context manager — no agent code changes
Offline inference Runs entirely locally — no OpenAI/Anthropic API calls for the diff itself
CI/CD integration --exit-code flag for pipeline failures; GitHub Action available
Multiple output formats Rich terminal · JSON · GitHub PR Markdown
JSONL trace format Human-readable, git-diffable, framework-agnostic
Content-addressed IDs Same reasoning step → same node ID across runs

Python API

from agentdelta import AgentTrace, diff_traces
from agentdelta.report import print_diff, to_json, to_markdown

trace_a = AgentTrace.load("baseline.jsonl")
trace_b = AgentTrace.load("candidate.jsonl")

result = diff_traces(trace_a, trace_b, fork_threshold=0.70, match_threshold=0.85)

# Terminal output
print_diff(result)

# Programmatic access
if result.has_regression:
    fp = result.fork_point
    print(f"Fork at step {fp.step_a}: {fp.description}")
    print(f"Similarity: {fp.similarity:.2f}")

# CI/CD JSON
json_str = to_json(result)

# GitHub PR comment
markdown_str = to_markdown(result)

CLI Reference

agentdelta diff TRACE_A TRACE_B [OPTIONS]
Option Default Description
--format rich Output format: rich | json | markdown
--fork-threshold 0.70 Similarity below this marks a fork point
--match-threshold 0.85 Similarity above this is a match (no change)
--show-matches false Include unchanged steps in terminal output
--exit-code false Exit 1 if regression detected (for CI)
agentdelta inspect TRACE_FILE

Prints a step-by-step summary of a single trace file.


Trace format

Traces are .jsonl files — one JSON object per line. Human-readable and git-diffable.

{"type": "trace_meta", "run_id": "v1.0"}
{"type": "node", "step": 1, "node_type": "start",      "content": "What is the weather in Tokyo?", ...}
{"type": "node", "step": 2, "node_type": "llm",        "content": "I should look up the current weather.", ...}
{"type": "node", "step": 3, "node_type": "tool_call",  "content": "get_weather(location='Tokyo')", ...}
{"type": "node", "step": 4, "node_type": "tool_return","content": "{\"temp\": 22, \"condition\": \"sunny\"}", ...}
{"type": "edge", "source_step": 1, "target_step": 2, "edge_type": "sequence", ...}

You can generate traces from any agent framework by writing nodes and edges directly, or use the LangChain callback for automatic capture.


GitHub Action

Use agentdelta directly in your GitHub Actions workflow:

# .github/workflows/agent-regression.yml
- name: Behavioral diff
  uses: sandeep-alluru/agentdelta@v0.1.0
  with:
    baseline: traces/baseline.jsonl
    candidate: traces/candidate.jsonl
    fail-on-regression: "true"

- name: Post diff as PR comment
  uses: marocchino/sticky-pull-request-comment@v2
  with:
    path: agentdelta-diff.md

Or use the CLI directly:

- name: Install agentdelta
  run: pip install agentdelta

- name: Behavioral diff
  run: |
    agentdelta diff traces/baseline.jsonl traces/candidate.jsonl \
      --format markdown --exit-code > diff.md

- name: Post comment
  uses: marocchino/sticky-pull-request-comment@v2
  with:
    path: diff.md

OpenAI integration

Codex CLI — the CODEX.md file at repo root gives OpenAI Codex full project context (architecture, invariants, build commands). Clone the repo and Codex is immediately project-aware.

Assistants API / Responses API — paste tools/openai-tools.json directly into your assistant definition to give it diff_traces, inspect_trace, and record_snippet as callable functions:

import json, openai

tools = json.loads(open("tools/openai-tools.json").read())

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Did my agent regress?"}],
    tools=tools,
)

GPT Actions / Custom GPTs — the openapi.yaml at repo root is a complete OpenAPI 3.1 spec. To register agentdelta as a ChatGPT Action:

  1. Run pip install "agentdelta[api]" && uvicorn agentdelta.api:app (or deploy to any host)
  2. In ChatGPT → My GPTs → Create → Add Action → import from openapi.yaml

Claude / MCP integration

Install the MCP server to use agentdelta as a native Claude tool — no CLI needed:

pip install "agentdelta[mcp]"

Add to your Claude Desktop config (~/.config/claude/claude_desktop_config.json on Linux, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "agentdelta": {
      "command": "agentdelta-mcp"
    }
  }
}

Claude then has three tools: diff_traces, inspect_trace, record_snippet — callable directly in conversation with no shell commands.

Claude Code slash commands are included in the repo. After cloning, type /project: to see:

Command What it does
/project:diff Diff two trace files and explain the fork
/project:inspect Summarise a single trace's execution path
/project:record Generate copy-paste recording boilerplate
/project:add-adapter Scaffold a new framework instrumentation adapter
/project:pr-prep Run lint + types + tests + CHANGELOG check

vs. Alternatives

agentdelta LangSmith Arize / Phoenix Weave (W&B)
Behavioral diff (two runs) ✅ core feature
Fork point detection ✅ step-level
Offline / local ✅ no API key ❌ SaaS ❌ SaaS ❌ SaaS
CI exit code on regression --exit-code
Git-diffable trace format ✅ JSONL ❌ proprietary ❌ proprietary ❌ proprietary
GitHub Action
Trace collection LangChain/custom ✅ full platform ✅ full platform ✅ full platform
Eval / scoring planned
Cost free / MIT free tier + paid free tier + paid free tier + paid

agentdelta is not an observability platform — it is a diff tool. Use it alongside LangSmith or Phoenix for collection and scoring, and agentdelta for behavioral regression detection in CI.


Case Studies

See how teams are using agentdelta in production:


Repository structure

agentdelta/
├── src/agentdelta/
│   ├── trace.py          # Data model: TraceNode, TraceEdge, AgentTrace
│   ├── embed.py          # Embeddings + sliding-window alignment
│   ├── diff.py           # Fork detection → DiffResult, ForkPoint
│   ├── instrument.py     # LangChain callback + record() context manager
│   ├── report.py         # Rich / JSON / Markdown output formatters
│   └── cli.py            # Click CLI (diff, inspect)
├── tests/                # 43 unit tests — pytest
├── examples/
│   └── demo.py           # Runnable end-to-end demo
├── assets/               # Logo, banner, demo GIF
├── .github/
│   ├── workflows/
│   │   ├── ci.yml        # Lint + test + coverage on push/PR
│   │   └── release.yml   # PyPI publish on tag push
│   ├── ISSUE_TEMPLATE/   # Bug report + feature request templates
│   └── PULL_REQUEST_TEMPLATE.md
├── action.yml            # Use agentdelta as a GitHub Action
├── ARCHITECTURE.md       # Full data flow + algorithm details
├── CONTRIBUTING.md       # How to contribute
├── CHANGELOG.md          # Release history
└── SECURITY.md           # Vulnerability reporting

Development

git clone https://github.com/sandeep-alluru/agentdelta
cd agentdelta
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pre-commit install

make test       # run the full test suite (43 tests)
make lint       # ruff check + format
make typecheck  # mypy
make all        # lint + typecheck + test

See CONTRIBUTING.md for the full guide including how to add output formats and instrumentation adapters.


License

MIT — see LICENSE.


GitHub Topics

If you're adding this repo to GitHub, set these topics for maximum discoverability:

llm  agents  langchain  diff  regression-testing  mcp  behavioral-testing  ci-cd  openai  python

Star History Chart

If agentdelta saved you from a silent behavioral regression, consider giving it a ⭐

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentdelta-0.1.1.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentdelta-0.1.1-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file agentdelta-0.1.1.tar.gz.

File metadata

  • Download URL: agentdelta-0.1.1.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentdelta-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c81e1e6834c25fc032dbf8101ae69947706d32ec62fe009c8a65a636d086350c
MD5 f0966338e9669e3c49d9e04fc6f79900
BLAKE2b-256 269fc8ac2b4b3800488cc938e7472cb12bbf4f7639c70c94eb0a3a918172b826

See more details on using hashes here.

File details

Details for the file agentdelta-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: agentdelta-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentdelta-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 adab1d14d257e692f1d00ce297f0a1691407b43d17b4ef800650541830a99748
MD5 8103cc60b195bc59656a8ce0d77046df
BLAKE2b-256 e622458128e50c9da0f7fd4120d93e54285c581cb560b12e5a82c763df3a6c3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page