Local-first debugging, grading, and replay tooling for AI agent traces.

These details have not been verified by PyPI

Project links

Project description

agent-xray

Reconstruct what your AI agent saw at each decision point — so you can debug behavioral failures that produce no error logs.

Why agent-xray?

LLM agents fail silently. There are no stack traces. No error logs. The agent just does the wrong thing, and you are left reading a raw trace trying to figure out why.

Existing observability tools show you what happened — traces, spans, latencies. agent-xray shows you why — by reconstructing the decision surface at each step: what was in the prompt, which tools were available, what the model was reasoning about, what the page looked like, and how context pressure was shaping choices.

agent-xray surfaces evidence and keeps agents on track. It does not make determinations for you.

Local-first: inspect traces on your machine, in CI, or fully offline. No account, no telemetry, no network calls.
Framework-agnostic: normalize traces from OpenAI, Anthropic, LangChain, CrewAI, OpenTelemetry, or plain JSONL into one schema.
Zero dependencies: the core library has no required dependencies. Optional extras add framework-specific instrumentation.

Use your tracing stack to collect runs. Use agent-xray to understand why a specific decision went wrong.

The 5-Tool Workflow

Most investigations follow the same five steps. This is what 90% of users need:

triage  →  surface_task  →  grade  →  root_cause  →  inspect_task

Step	Tool	What it does
1	`triage`	One call to see the full picture: structural grade distribution, the worst failure surfaced step-by-step, and a prioritized fix plan. Start here.
2	`surface_task`	Reconstructs the complete decision surface for a single task — prompt context, tool set, conversation history, browser state, model reasoning, corrections. This is the evidence you need to understand why the agent chose what it chose.
3	`grade`	Assigns structural execution grades to all tasks in a trace set. Useful as a triage signal to separate clean executions from problematic ones.
4	`root_cause`	Classifies likely failure modes (spin, tool bug, early abort, routing bug, etc.) for tasks that graded poorly. Points you toward the right subsystem to fix.
5	`inspect_task`	All-in-one deep dive on a single task: structural grade + root cause + full surface + reasoning chain in one call. Use this when you know which task to investigate.

# The typical investigation
agent-xray triage ./traces
agent-xray surface task-123 --log-dir ./traces
agent-xray inspect task-123 ./traces

What Grades Mean

Grades measure execution structure, not output correctness. Default rules are generic starting points — create custom rulesets that encode your product's definition of quality before running optimization campaigns. See Adapt the Rules Before You Optimize.

This distinction matters. Structural grades are triage signals, not verdicts:

Grade	Meaning
GOLDEN	Structurally clean execution: good tool diversity, no loops, low error rate, task completed
GOOD	Minor structural issues but generally sound execution
OK	Some structural concerns worth investigating
WEAK	Significant structural problems detected
BROKEN	Structural failures: spinning, high error rate, tool bugs, early aborts

Grades evaluate how the agent executed — tool diversity, loop resistance, error rate, completion signals. They do not evaluate whether the agent achieved the right outcome or produced correct output.

A GOLDEN task could still have the wrong answer.
A BROKEN task might have eventually succeeded through a messy path.
Use surface_task to verify what actually happened at each decision point.
Use task_bank_validate with curated success criteria to check correctness.

Surface evidence, let the agent decide.

The Flywheel

The flywheel is the disciplined improvement loop that turns individual investigations into systematic gains:

triage
  → identify worst failure
    → surface the decision point
      → fix root cause
        → enforce_check (controlled experiment)
          → compare_runs (quantify impact across ALL tasks)
            → repeat

Each step produces evidence for the next. The enforce workflow provides controlled experiments with gaming detection — so you know whether the fix actually improved things or just papered over the failure. compare_runs quantifies before/after impact across your entire task set, not just the one you fixed.

# Run the full flywheel in one shot
agent-xray flywheel ./traces --fixture-dir ./fixtures --baseline ./last-flywheel.json

# Or step by step:
agent-xray triage ./traces                                    # Find the worst problem
agent-xray surface broken-task --log-dir ./traces              # See what the agent saw
# ... fix the root cause ...
agent-xray enforce init --test "pytest tests/ -q"              # Baseline before change
agent-xray enforce check                                       # Evaluate the fix
agent-xray compare ./traces-before ./traces-after              # Quantify across all tasks

The flywheel also detects evaluator drift — if your rules or replay logic change mid-run, it raises EVALUATION_DRIFT rather than mixing grades from different rule versions.

Quick Start

pip install agent-xray

Generate a demo trace set and walkthrough:

agent-xray quickstart

Or paste this inline example into ./traces/demo.jsonl:

mkdir -p traces
cat > traces/demo.jsonl <<'JSONL'
{"task_id":"demo-checkout","step":1,"tool_name":"browser_navigate","tool_input":{"url":"https://demo-shop.example.test"},"tool_result":"Homepage loaded.","timestamp":"2026-03-27T12:00:00Z","duration_ms":900,"user_text":"Buy the blue mug on demo-shop.example.test and stop once checkout is visible.","task_category":"commerce","model_name":"gpt-5-mini","temperature":0.0,"tool_choice":"auto","tools_available":["browser_navigate","browser_click","browser_fill_ref","browser_snapshot"],"message_count":1,"llm_reasoning":"Open the storefront first.","page_url":"https://demo-shop.example.test/"}
{"task_id":"demo-checkout","step":2,"tool_name":"browser_click","tool_input":{"ref":"product-blue-mug"},"tool_result":"Product page opened.","timestamp":"2026-03-27T12:00:04Z","duration_ms":420,"model_name":"gpt-5-mini","temperature":0.0,"tool_choice":"auto","tools_available":["browser_click","browser_fill_ref","browser_snapshot"],"message_count":2,"llm_reasoning":"Open the mug detail page.","page_url":"https://demo-shop.example.test/products/blue-mug"}
{"task_id":"demo-checkout","step":3,"tool_name":"browser_fill_ref","tool_input":{"ref":"shipping-form","fields":["email","zip"],"text":"alex@example.test 60601"},"tool_result":"Shipping details accepted.","timestamp":"2026-03-27T12:00:09Z","duration_ms":610,"model_name":"gpt-5-mini","temperature":0.0,"tool_choice":"auto","tools_available":["browser_click","browser_fill_ref","browser_snapshot"],"message_count":3,"llm_reasoning":"Provide the minimum details required to continue.","page_url":"https://demo-shop.example.test/checkout"}
{"task_id":"demo-checkout","step":4,"tool_name":"browser_snapshot","tool_input":{},"tool_result":"Checkout page is visible with the order summary.","timestamp":"2026-03-27T12:00:12Z","duration_ms":250,"model_name":"gpt-5-mini","temperature":0.0,"tool_choice":"auto","tools_available":["browser_click","browser_snapshot"],"message_count":4,"llm_reasoning":"Verify that checkout is now visible.","page_url":"https://demo-shop.example.test/checkout"}
{"event":"task_complete","task_id":"demo-checkout","status":"success","final_answer":"Checkout page is open for the blue mug.","total_steps":4,"total_duration_s":2.18,"timestamp":"2026-03-27T12:00:12Z"}
JSONL

# Start here — one call to see the full picture
agent-xray triage ./traces

# Reconstruct the decision surface for the task
agent-xray surface demo-checkout --log-dir ./traces

# Structural grade
agent-xray grade ./traces

The tutorial walks through the full expected output and explains how to read each section.

Supported Frameworks

Framework	Format Flag	Support Level
Generic JSONL	`--format generic`	Full — native format, all features
OpenAI Agents / Responses-style traces	`--format openai`	Full
Anthropic Messages traces	`--format anthropic`	Full
LangChain / LangGraph traces	`--format langchain`	Good — callback-based, covers most workflows
OpenTelemetry GenAI spans	`--format otel`	Good — experimental, GenAI semantic conventions
CrewAI traces	`--format crewai`	Basic — fixture/replay support, limited live tracing
Auto-detect	`--format auto`	Stable

Planned: deeper LangGraph state-graph support, AutoGPT adapter.

Use agent-xray format-detect <file> to check what format your traces are in and get a confidence score.

Install

pip install agent-xray

Optional extras for specific integrations:

pip install "agent-xray[openai]"       # OpenAI SDK auto-instrumentation
pip install "agent-xray[anthropic]"    # Anthropic SDK auto-instrumentation
pip install "agent-xray[langchain]"    # LangChain callback handler
pip install "agent-xray[otel]"         # OpenTelemetry adapter
pip install "agent-xray[tui]"          # Interactive terminal inspector
pip install "agent-xray[mcp]"          # MCP server and proxy
pip install "agent-xray[all]"          # Everything

Zero required dependencies for the core library. Python 3.10+.

Full Tool Reference

CLI Quick Reference

agent-xray analyze ./traces
agent-xray capture task-123 ./traces
agent-xray compare ./traces-before ./traces-after
agent-xray completeness ./traces
agent-xray diagnose ./traces
agent-xray diff task-a task-b ./traces
agent-xray flywheel ./traces
agent-xray grade ./traces
agent-xray inspect task-123 ./traces
agent-xray quickstart
agent-xray reasoning task-123 ./traces
agent-xray record -- python your_agent.py
agent-xray replay ./fixture.json ./traces
agent-xray report ./traces
agent-xray root-cause ./traces
agent-xray search "checkout" ./traces
agent-xray signal-detect task-123 ./traces
agent-xray surface task-123 --log-dir ./traces
agent-xray tree ./traces
agent-xray triage ./traces
agent-xray tui ./traces
agent-xray validate-targets ./fix-plan.json
agent-xray watch ./traces/demo.jsonl
agent-xray enforce check
agent-xray rules list
agent-xray pricing list
agent-xray golden rank ./traces
agent-xray baseline capture ./traces
agent-xray task-bank validate ./task-bank.json

Core 5 — The Investigation Workflow

Tool	What it does
`triage`	One-call investigation: structural grade distribution, worst failure surfaced step-by-step, prioritized fix plan
`surface_task`	Reconstructs the full decision surface for one task: prompt, tools, reasoning, browser state, corrections
`grade`	Structural execution grades for all tasks with configurable JSON rulesets
`root_cause`	Classifies failure modes using a 19-category cascade classifier with evidence
`inspect_task`	All-in-one single-task deep dive: structural grade + root cause + surface + reasoning

Investigation (10 tools)

Tool	What it does
`analyze`	High-level trace analysis with per-task metrics and grade summary
`reasoning`	Extracts the model's reasoning chain for a specific task
`search_tasks`	Finds tasks by user_text substring with optional grade filter
`diff_tasks`	Compares two tasks side by side: tool calls, timing, outcomes, evidence
`compare_runs`	Side-by-side comparison of two trace directories across matched tasks
`tree`	Day/site/task hierarchy with optional grade enrichment
`diagnose`	Structured diagnostics with tool gaps, approval blocks, and fix plan
`completeness`	Audits trace coverage across 14 observability dimensions
`signal_detect`	Runs domain signal detectors (commerce, coding, research, etc.) on a single task
`format_detect`	Auto-detects trace format of a log file with confidence score

Fixtures and Replay (4 tools)

Tool	What it does
`capture_task`	Captures a sanitized golden fixture for replay and regression testing
`replay`	Compares saved golden fixture against current traces (IMPROVED/REGRESSION/STABLE)
`match_task`	Fuzzy-matches a task to the best task bank entry
`flywheel`	End-to-end quality loop: grading + root cause + baseline comparison + drift detection

Golden and Baseline (7 tools)

Tool	What it does
`golden_rank`	Ranks golden/good runs by efficiency within each site
`golden_best`	Best exemplar per site for baseline/fixture capture
`golden_compare`	Compares current runs against golden fixtures for regressions
`golden_capture`	Captures a golden exemplar for future comparison
`golden_profiles`	Shows optimization profiles with weight distributions
`baseline_capture`	Captures task metrics as a baseline for overhead measurement
`baseline_list`	Lists saved baselines with their metrics
`baseline_generate`	Generates a naked prompt for baseline comparison

Enforce — Controlled Experiments (10 tools)

Tool	What it does
`enforce_init`	Starts an enforcement session and captures baseline test results
`enforce_check`	Evaluates current changes: recommends commit, revert, or reject with evidence
`enforce_diff`	Previews current diff size and rejection status
`enforce_plan`	Registers a hypothesis and expected test outcomes before a change
`enforce_guard`	Detects out-of-band changes that bypassed the enforce pipeline
`enforce_status`	Returns the current session state
`enforce_quick`	Runs a single enforce init+check cycle in one call and returns the combined review payload
`enforce_challenge`	Adversarial review across iterations: flip-flops, gaming, scope creep
`enforce_reset`	Abandons the active session
`enforce_report`	Full session report in JSON, text, or Markdown
`preflight_diff`	Checks git diff against project guardrails before an enforce iteration

Rules, Pricing, Task Bank, and Reports (13 tools)

Tool	What it does
`rules_list`	Lists built-in rulesets
`rules_show`	Shows a ruleset's signals, thresholds, and requirements
`rules_init`	Scaffolds a custom ruleset from an existing one
`pricing_list`	Lists all known models with token pricing
`pricing_show`	Per-token pricing for a specific model
`pricing_update`	Refreshes local pricing cache from GitHub
`task_bank_validate`	Validates task bank schema and criterion names
`task_bank_list`	Lists all task bank entries
`task_bank_show`	Shows a single task bank entry by ID or prefix
`report`	16 report types: health, golden, broken, tools, flows, cost, spins, timeline, and more
`validate_targets`	Checks that fix-plan file-path targets exist on disk
`gaming_audit`	Runs 9 gaming detectors on a diff (test-gaming, hardcoded values, mock insertion)

Total: 49 MCP tools available via python -m agent_xray.mcp_server or the agent-xray-mcp entrypoint.

Structural Grading System

Structural grades are produced by configurable JSON rulesets. Bundled rulesets:

Ruleset	Domain
`default`	General reliability and loop-detection signals
`browser_flow`	Browser and commerce progression signals
`coding_agent`	File-edit, test, lint, and shell behavior
`research_agent`	Browsing and evidence-gathering signals

Rules use a field/operator format:

{
  "name": "browser_flow",
  "signals": [
    {
      "label": "high_error_rate",
      "field": "error_rate",
      "op": "gte",
      "value": 0.5,
      "points": -3,
      "reason": "error rate is too high"
    },
    {
      "label": "good_tool_diversity",
      "field": "unique_tools",
      "op": "gte",
      "value": 3,
      "points": 2,
      "reason": "used multiple tools"
    }
  ],
  "grade_thresholds": {
    "GOLDEN": 8,
    "GOOD": 5,
    "OK": 2,
    "WEAK": 0
  }
}

Create custom rulesets:

agent-xray rules init --base default > my_rules.json
agent-xray grade ./traces --rules my_rules.json

Root Cause Classification

When a task grades poorly (structural signal), root_cause classifies the likely failure mode using a 19-category cascade classifier. Each classification includes evidence from the decision surface:

Root Cause	Evidence pattern
`spin`	Same action repeated without forward movement
`tool_bug`	Tool call failed or returned unusable output
`early_abort`	Run ended before enough work was attempted
`routing_bug`	Task never got the right tool exposure
`approval_block`	Policy or permission gates blocked progress
`stuck_loop`	Kept acting without meaningful state progress
`tool_selection_bug`	Right tool existed but model chose poorly
`reasoning_bug`	Strategy was wrong even with adequate tools
`test_failure_loop`	Kept rerunning failing tests without changing approach
`environment_drift`	Target environment changed underneath the agent
`memory_overload`	Context pressure degraded later reasoning
`delegation_failure`	Multi-agent workflow failed at delegation boundary
`insufficient_sources`	Research task answered before gathering enough evidence
`tool_rejection_mismatch`	Needed tool actively rejected by policy
`prompt_bug`	Instructions likely nudged the model toward failure
`model_limit`	Task appears to exceed model capability
`valid_alternative_path`	Goal achieved through an unexpected but valid path
`consultative_success`	Well-reasoned answer without browser interaction
`unclassified`	No specific root cause identified from available evidence

The classifier also detects soft errors — logical failures in tool results (e.g., "NOT ON A PAYMENT PAGE") even when the tool reported technical success.

Task Bank — Correctness Validation

Structural grades assess execution quality. For correctness, use the task bank system with curated success criteria:

{
  "tasks": [
    {
      "id": "checkout-wireless-headset",
      "user_text": "Buy the wireless headset and complete checkout on shop.example.test.",
      "site": "shop.example.test",
      "category": "commerce",
      "success_criteria": {
        "must_reach_checkout": true,
        "must_reach_url": "order/confirmation",
        "must_use_tools": ["browser_click", "browser_fill_ref"]
      }
    }
  ]
}

14 supported criteria: must_answer_contains, answer_type, must_reach_url, must_fill_fields, min_urls, max_steps, payment_fields_visible, must_not_fill_payment, must_reach_cart, must_reach_checkout, must_use_tools, no_browser_needed, must_have_answer, min_tool_count.

agent-xray task-bank validate ./task_bank.json
agent-xray grade ./traces --task-bank ./task_bank.json

Integration

Minimal: write JSONL directly

import json, time
from pathlib import Path

trace = Path("traces/run.jsonl")
trace.parent.mkdir(exist_ok=True)

# In your agent loop, after each tool call:
with trace.open("a") as f:
    f.write(json.dumps({
        "task_id": "my-task",
        "step": step,
        "tool_name": tool_name,
        "tool_input": tool_args,
        "tool_result": result_text,
        "model_name": "gpt-4.1-mini",
    }) + "\n")

Auto-instrumentation

from agent_xray.instrument import auto_instrument

# Patches installed OpenAI/Anthropic SDKs automatically
auto_instrument(output_dir="./traces", task_id="my-task")

SDK / Integration	Instrumentor	How it works
OpenAI Python SDK	`OpenAIInstrumentor`	Monkey-patches chat completion calls and records tool calls
Anthropic Python SDK	`AnthropicInstrumentor`	Monkey-patches `messages.create` and records tool-use blocks
LangChain / LangGraph	`XRayCallbackHandler`	Callback handler that records tool start/end events
MCP clients	`XRayMCPProxy`	Wraps `call_tool` / `list_tools` and logs every tool request

Working examples for each framework are in examples/:

Framework	Example
Anthropic SDK	`instrument_anthropic.py`
OpenAI SDK	`instrument_openai.py`
MCP (Playwright, etc.)	`instrument_mcp.py`
LangChain	`instrument_langchain.py`

See the full integration guide for the JSONL format reference, field descriptions, and common pitfalls.

Record mode

Capture output from any JSON-emitting subprocess:

agent-xray record --output-dir ./traces -- python my_agent.py

Pytest plugin

agent-xray ships a pytest plugin (registered via the pytest11 entry point) that provides an xray fixture for asserting on structural execution quality inside test suites:

def test_checkout_agent(xray):
    steps = [{"step": 1, "tool_name": "navigate", "tool_result": "ok"}, ...]
    report = xray.analyze(steps)
    assert report.grade in ("GOLDEN", "GOOD")
    assert report.error_rate < 0.1

pytest --xray-rules my_rules.json tests/

Enforce Mode — Controlled Experiments

Enforce mode turns agent-xray into an advisor for AI agents making code changes. It enforces one-change-at-a-time discipline, runs A/B testing against baseline, detects gaming patterns, and surfaces evidence for commit-or-revert decisions.

Why enforce mode?

AI agents left to fix failing tests will often game them: weaken assertions, insert hardcoded values, swallow exceptions, or add special-case branches. Enforce mode catches this automatically and surfaces the evidence so you can revert bad changes before they accumulate.

The enforce loop

Each enforce check iteration:

Captures the git diff and parses it into hunks
Runs the test suite and compares results against baseline
Detects gaming signals (8 heuristic detectors)
Classifies change quality (productive, refactor, test-only, mixed, suspicious)
Flags changes that exceed size limits
Checks project-specific rules if configured
Surfaces likely root causes for any regressions with evidence
Reports a recommendation: RECOMMEND_COMMIT, RECOMMEND_REVERT, or REJECTED — always with evidence

The recommendation is not a verdict. The calling agent or user decides whether to act on it.

Gaming detection

Eight heuristic detectors catch common gaming patterns:

Detector	What it catches
`detect_test_file_modification`	Agent edits test files to make them pass
`detect_hardcoded_values`	Hardcoded return values that bypass real logic
`detect_special_case_branching`	`if test_mode` or `if x == expected_value` branches
`detect_mock_insertion`	Mock objects added to bypass real dependencies
`detect_assertion_weakening`	Replacing strict assertions with permissive ones
`detect_exception_swallowing`	Broad `except: pass` blocks hiding failures
`detect_early_return`	Short-circuit returns before real logic executes
`detect_import_removal`	Removing imports to silence import errors

Combined confidence: VALID (<0.3), SUSPICIOUS (0.3-0.6), GAMING (>0.6).

Adversarial challenges

enforce challenge runs 9 cross-iteration checks: flip-flop detection, dependency risk, coverage gaps, cumulative gaming, assertion erosion, diminishing returns, persistent failures, and scope creep.

agent-xray enforce init --test "pytest tests/ -q"
agent-xray enforce check --hypothesis "missing await"
agent-xray enforce challenge
agent-xray enforce report --format markdown

Signal Detector Packs

Six built-in detector packs. Their metrics merge into TaskAnalysis.metrics() and can be referenced directly in rules:

Detector	What it measures
Commerce	`reached_payment`, `reached_checkout`, `reached_cart`, fill counts, payment-field visibility
Coding	File edits, test/build/lint/git/shell activity, error counts, `test_to_edit_ratio`
Research	Searches, reads, source diversity, citations, synthesis steps, URL references
Planning	Plans created, steps executed, revisions, completion rate
Memory	Memory store/recall activity, recall hit rate, RAG queries, unique keys
Multi-agent	Delegation count, unique agents, delegation success rate, depth

Concrete detector classes: CommerceDetector, CodingDetector, ResearchDetector, PlanningDetector, MemoryDetector, MultiAgentDetector.

Custom signal plugins

[project.entry-points."agent_xray.signals"]
my_detector = "my_package.detectors:MyDetector"

from agent_xray.schema import AgentStep, AgentTask

class MyDetector:
    name = "my_detector"

    def detect_step(self, step: AgentStep) -> dict[str, bool]:
        return {"is_special": "special" in step.tool_name.lower()}

    def summarize(self, task: AgentTask, step_signals: list[dict[str, bool]]) -> dict[str, int]:
        return {"special_steps": sum(1 for item in step_signals if item["is_special"])}

Library Usage

from agent_xray import AgentStep, AgentTask, surface_for_task, classify_task, grade_task, load_rules

records = [
    {"task_id": "task-1", "step": 1, "tool_name": "browser_navigate", "tool_input": {"url": "https://example.test"}},
    {"task_id": "task-1", "step": 2, "tool_name": "browser_snapshot", "tool_input": {}, "tool_result": "checkout"},
]

steps = [AgentStep.from_dict(record) for record in records]
task = AgentTask.from_steps(steps, task_text="Inspect the checkout flow")

# Structural grade — a triage signal, not a verdict
grade = grade_task(task, load_rules("browser_flow"))

# Decision surface — the evidence
surface = surface_for_task(task)

print(grade.grade, grade.score)
print(surface["steps"][0]["tools_available_names"])

# Root cause — classified from evidence in the decision surface
if grade.grade in {"WEAK", "BROKEN"}:
    cause = classify_task(task, grade)
    print(cause.root_cause, cause.evidence)

Architecture

JSONL / framework trace files
  -> adapters/            normalize source formats into AgentStep
  -> schema.py            AgentStep / AgentTask / typed contexts
  -> signals/             detector packs (commerce, coding, research, planning, memory, multi-agent)
  -> analyzer.py          task metrics, cost tracking, soft-error detection
  -> surface.py           decision-surface reconstruction, reasoning chains, tree, diff
  -> grader.py            configurable JSON rulesets (structural execution grades)
  -> root_cause.py        19-category failure classifier with cascade ordering
  -> contrib/task_bank.py correctness validation with fuzzy task matching
  -> capture.py           sanitized golden fixtures
  -> replay.py            fixture-vs-run comparison
  -> flywheel.py          end-to-end quality loop, baseline comparison, drift detection
  -> enforce.py           controlled experiment loop (session, check, auto)
  -> enforce_audit.py     gaming detection and adversarial challenges
  -> enforce_report.py    enforce session reports (text, JSON, markdown)
  -> comparison.py        model-vs-model divergence analysis
  -> reports.py           16 report types (health, tools, flows, cost, spins, timeline, etc.)
  -> watch.py             live-tail grading as JSONL files grow
  -> pricing.py           pricing database, aliases, cache, and live refresh
  -> baseline.py          naked prompt generation and overhead measurement
  -> instrument/          OpenAI, Anthropic, LangChain, and MCP auto-instrumentation
  -> runner.py            TaskRunner protocol plus HTTP runner
  -> protocols.py         ToolRegistry / PromptBuilder / StepAdapter protocols
  -> mcp_server.py        49 MCP tools for investigation, enforce, and analysis

How agent-xray Compares

Feature	agent-xray	LangSmith	Langfuse	Arize Phoenix	Braintrust	AgentOps
Local-first	Yes	No	Self-host option	Self-host option	No	No
Fully offline	Yes	No	No	Partial	No	No
Open source	MIT	Proprietary	MIT (server)	Apache 2.0	Proprietary	Proprietary
No account needed	Yes	No	No	No	No	No
Zero dependencies	Yes	Many	Many	Many	Many	Many
Framework agnostic	Yes	LangChain-first	LangChain-first	Yes	Yes	Yes
Decision surface reconstruction	Yes	No	No	No	No	No
Root-cause classification	Yes	No	No	No	No	No
Enforce mode (controlled experiments)	Yes	No	No	No	No	No
Golden fixture regression	Yes	Dataset comparison	No	No	Dataset comparison	No
Gaming detection	Yes	No	No	No	No	No
pytest plugin	Yes	No	No	No	No	No
Pricing	Free forever	Free tier + paid	Free tier + paid	Free tier + paid	Free tier + paid	Free tier + paid

agent-xray is complementary, not competing. Use LangSmith or Langfuse to collect production traces at scale. Use agent-xray to deeply debug why a specific agent decision went wrong — locally, offline, with no account required. Think of it as the pytest to their Sauce Labs.

Documentation

Integration guide — connect your agent in 5 minutes
Tutorial — instrument a simple agent and analyze the output
Architecture overview
Custom rules guide
Contribution guide

Contributing

agent-xray is designed to be extended. Add a detector, adapter, ruleset, or CLI improvement and keep the behavior observable.

See CONTRIBUTING.md for:

Local setup
Your first signal detector
Writing a new adapter
Adding a ruleset
Style, test, and PR expectations

Public API Index

Public exports referenced by tests and available for library users include:

StaticToolRegistry, StaticPromptBuilder, check_completeness, DefaultTargetResolver, build_fix_plan, format_fix_plan_text, list_all_targets, register_target_resolver, validate_fix_targets, load_task_bank, rank_golden_runs, find_exemplars, explain_efficiency_gap, capture_exemplar, generate_naked_prompt, build_baseline, measure_overhead, group_by_prompt_hash, suggest_baseline_capture, SignalDetector, discover_detectors, EnforceConfig, build_enforce_report, format_enforce_text.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.25.4

Apr 5, 2026

1.25.3

Mar 30, 2026

1.25.2

Mar 30, 2026

1.25.1

Mar 30, 2026

1.25.0

Mar 30, 2026

1.24.0

Mar 30, 2026

1.23.1

Mar 30, 2026

1.23.0

Mar 30, 2026

1.22.0

Mar 30, 2026

1.21.0

Mar 30, 2026

1.20.0

Mar 30, 2026

1.19.0

Mar 30, 2026

1.18.0

Mar 30, 2026

1.17.0

Mar 30, 2026

1.16.0

Mar 30, 2026

1.15.0

Mar 30, 2026

1.14.0

Mar 30, 2026

1.13.0

Mar 30, 2026

1.12.1

Mar 28, 2026

1.12.0

Mar 28, 2026

1.11.0

Mar 28, 2026

1.10.0

Mar 28, 2026

1.9.0

Mar 28, 2026

1.8.0

Mar 28, 2026

1.7.2

Mar 28, 2026

1.7.1

Mar 28, 2026

1.7.0

Mar 28, 2026

1.6.0

Mar 28, 2026

1.5.1

Mar 28, 2026

1.5.0

Mar 28, 2026

1.4.0

Mar 28, 2026

1.3.0

Mar 28, 2026

1.2.6

Mar 28, 2026

1.2.5

Mar 28, 2026

1.2.4

Mar 28, 2026

1.2.3

Mar 28, 2026

1.2.2

Mar 28, 2026

1.2.1

Mar 28, 2026

1.2.0

Mar 28, 2026

1.1.0

Mar 28, 2026

1.0.8

Mar 28, 2026

1.0.7

Mar 28, 2026

1.0.6

Mar 28, 2026

1.0.5

Mar 28, 2026

1.0.4

Mar 28, 2026

1.0.3

Mar 28, 2026

1.0.2

Mar 27, 2026

1.0.1

Mar 27, 2026

1.0.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_xray-1.25.4.tar.gz (416.3 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_xray-1.25.4-py3-none-any.whl (285.6 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file agent_xray-1.25.4.tar.gz.

File metadata

Download URL: agent_xray-1.25.4.tar.gz
Upload date: Apr 5, 2026
Size: 416.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for agent_xray-1.25.4.tar.gz
Algorithm	Hash digest
SHA256	`ae4116bd1fd991b87d56462afc0c12a5782f23bb158735b6523ae5040832bb4c`
MD5	`f512c375951e0350669ec82a7900e6bf`
BLAKE2b-256	`1679578e5365a8df84d3fc8b7a40608e9da2edeacbe0d746fdf18ea519937936`

See more details on using hashes here.

File details

Details for the file agent_xray-1.25.4-py3-none-any.whl.

File metadata

Download URL: agent_xray-1.25.4-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 285.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for agent_xray-1.25.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51d0504f3185ea2de7da17c7e11ac9d9dec671bade244115803ea28aaadc919b`
MD5	`fa9388b0de7fd46b3bf4e86d6ef3e1d8`
BLAKE2b-256	`c63106b832268be12f430a3bfee1a1ff9501b006e106ab5da195ed6dbedd1c20`

See more details on using hashes here.

agent-xray 1.25.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agent-xray

Why agent-xray?

The 5-Tool Workflow

What Grades Mean

The Flywheel

Quick Start

Supported Frameworks

Install

Full Tool Reference

CLI Quick Reference

Core 5 — The Investigation Workflow

Investigation (10 tools)

Fixtures and Replay (4 tools)

Golden and Baseline (7 tools)

Enforce — Controlled Experiments (10 tools)

Rules, Pricing, Task Bank, and Reports (13 tools)

Structural Grading System

Root Cause Classification

Task Bank — Correctness Validation

Integration

Minimal: write JSONL directly

Auto-instrumentation

Record mode

Pytest plugin

Enforce Mode — Controlled Experiments

Why enforce mode?

The enforce loop

Gaming detection

Adversarial challenges

Signal Detector Packs

Custom signal plugins

Library Usage

Architecture

How agent-xray Compares

Documentation

Contributing

Public API Index

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes