Skip to main content

Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.

Project description

Agent Diagnostics

A behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail on benchmark tasks.

Install

pip install agent-diagnostics

Optional extras:

pip install agent-diagnostics[llm]        # LLM annotation (anthropic SDK)
pip install agent-diagnostics[validation] # JSON schema validation
pip install agent-diagnostics[dev]        # pytest, ruff, coverage

Quick Start

1. Annotate a trial from signals

from agent_diagnostics import annotate_trial, TrialSignals

# Build a signal dict (e.g. from your agent's logs)
signals: TrialSignals = {
    "task_id": "django__django-16527",
    "reward": 0.0,
    "passed": False,
    "search_tool_calls": 0,
    "unique_files_read": 0,
    "tool_calls_total": 3,
    "exception_crashed": False,
    "rate_limited": False,
}

categories = annotate_trial(signals)
for c in categories:
    print(f"  {c.name} (confidence={c.confidence}): {c.evidence}")
# retrieval_failure (confidence=0.9): No search tool calls and no files read

2. Extract signals from a trial directory

from agent_diagnostics import extract_signals

# Point at a directory containing result.json + trajectory.json
signals = extract_signals("path/to/trial_dir")
print(signals["reward"], signals["tool_calls_total"])

3. Generate a reliability report

import json
from agent_diagnostics import generate_report

with open("annotations.json") as f:
    annotations = json.load(f)

md_path, json_path = generate_report(annotations, output_dir="reports/")
print(f"Report: {md_path}")

CLI

The package ships a CLI with 8 subcommands:

# Extract signals from trial directories
observatory extract --runs-dir runs/_raw --output signals.json

# Heuristic annotation
observatory annotate --signals signals.json --output heuristic.json

# LLM annotation (sample of trials)
observatory llm-annotate --signals signals.json --output llm.json \
    --sample-size 50 --model haiku --backend claude-code

# Train a classifier from LLM labels
observatory train --labels llm.json --signals signals.json --output model.json

# Predict with trained classifier
observatory predict --model model.json --signals signals.json --output predictions.json

# Two-tier ensemble (heuristic + classifier)
observatory ensemble --signals signals.json --model model.json --output ensemble.json

# Generate Markdown + JSON report
observatory report --annotations ensemble.json --output reports/

# Validate annotations against schema + taxonomy
observatory validate --annotations ensemble.json

Or via python -m:

python -m agent_diagnostics --help

Taxonomy

23 categories across three polarities:

Polarity Count Examples
failure 15 retrieval_failure, query_churn, decomposition_failure, edit_verify_loop_failure
success 5 success_via_code_nav, success_via_semantic_search, success_via_decomposition
neutral 3 rate_limited_run, task_ambiguity, insufficient_provenance
from agent_diagnostics import load_taxonomy, valid_category_names

taxonomy = load_taxonomy()  # v1 (flat) by default
names = valid_category_names()  # set of 23 category name strings

Both v1 (flat) and v2 (hierarchical by dimension) formats are bundled.

Architecture

agent_diagnostics/
  taxonomy.py        # Taxonomy loader (v1/v2 YAML)
  types.py           # TrialSignals TypedDict, CategoryAssignment, Annotation
  tool_registry.py   # Injectable tool name registry (DEFAULT_REGISTRY)
  signals.py         # Extract signals from trial directories
  annotator.py       # 23-rule heuristic annotator
  classifier.py      # Pure-Python logistic regression (no numpy)
  ensemble.py        # Two-tier ensemble (heuristic + classifier)
  calibrate.py       # Agreement analysis, Cohen's kappa
  blend_labels.py    # LLM + heuristic label blending
  llm_annotator.py   # LLM annotation (claude-code + API backends)
  report.py          # Markdown + JSON report generator
  cli.py             # CLI entrypoint

Custom Tool Registry

By default, tool classification uses Claude Code + Sourcegraph MCP tool names. For other agent harnesses:

from agent_diagnostics import annotate_trial, ToolRegistry

my_registry = ToolRegistry(
    search_tools=frozenset({"grep", "find", "rg"}),
    edit_tools=frozenset({"file_write", "patch"}),
    code_nav_tools=frozenset({"goto_def", "find_refs"}),
    semantic_search_tools=frozenset({"semantic_search"}),
)

categories = annotate_trial(signals, tool_registry=my_registry)

Extending Signal Extraction

For benchmark-specific metadata (suite detection, model resolution), inject callables:

from agent_diagnostics import extract_signals

signals = extract_signals(
    trial_dir,
    suite_mapping={"csb_swebench": "swebench_lite"},
    benchmark_resolver=lambda path: "my_benchmark",
    model_keywords={"haiku": "claude-haiku-4-5"},
)

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_diagnostics-0.5.1.tar.gz (93.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_diagnostics-0.5.1-py3-none-any.whl (78.2 kB view details)

Uploaded Python 3

File details

Details for the file agent_diagnostics-0.5.1.tar.gz.

File metadata

  • Download URL: agent_diagnostics-0.5.1.tar.gz
  • Upload date:
  • Size: 93.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agent_diagnostics-0.5.1.tar.gz
Algorithm Hash digest
SHA256 a56bfbb4e91b5fedb47be66558cca5764a81e9023545fcb00e88b94ecf8b2a4a
MD5 dc2fda6d3d803608d738fddb78ae38fb
BLAKE2b-256 cab74faa83b0cba4a39622825f926535b1e868b4182f39da8f77afdb30294f7c

See more details on using hashes here.

File details

Details for the file agent_diagnostics-0.5.1-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_diagnostics-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c9b14e4470d8dad0f72c54e278eb9a7e6206e75f482baea28ddce22cf73db10b
MD5 b7f07fd188e7a7bfd7222d8a06d2bc92
BLAKE2b-256 d0f327d3acd5ac516cb3ac91190950a5808192c66e88ec3c1fc6db5d6523cf58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page