Skip to main content

Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.

Project description

Agent Diagnostics

A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.

11,995 trials. 4 models. 61 benchmarks. 40 failure categories across 11 dimensions.

What this does

Coding agents pass benchmarks for the wrong reasons and fail them for the wrong reasons. Pass/fail scores hide reward hacking, flawed tests, and lucky patches. This project extracts structured signals from agent trajectories, classifies failure modes, and provides a queryable dataset backend so you can actually understand what happened.

Install

pip install agent-diagnostics

Quick start: label your own traces

Point the tool at a directory of agent trial outputs. Each trial needs a result.json (and optionally a trajectory.json) in its own directory:

my-runs/
  trial-001/
    result.json          # must have: task_id or task_name, reward/score
    trajectory.json      # optional: list of agent steps with tool calls
  trial-002/
    result.json
    agent/
      trajectory.json    # also checks agent/ subdirectory
  ...

Then run the pipeline:

# Step 1: Extract signals from all trial directories into JSONL
agent-diagnostics ingest --runs-dir my-runs/ --output data/signals.jsonl

# Step 2: Classify failure modes (heuristic rules, instant)
agent-diagnostics annotate --signals data/signals.jsonl --output data/heuristic.json

# Step 3: LLM-assisted classification (reads actual trajectories)
agent-diagnostics llm-annotate --signals data/signals.jsonl --output data/llm.json \
    --sample-size 50 --model haiku --backend batch

# Step 4: Query results with SQL
agent-diagnostics query "SELECT model, count(*) as n,
  round(avg(CASE WHEN passed THEN 1.0 ELSE 0.0 END)*100, 1) as pass_rate
  FROM signals GROUP BY model ORDER BY pass_rate DESC"

What result.json looks like

The tool reads standard benchmark harness output. At minimum:

{
  "task_name": "django__django-16527",
  "reward": 1.0,
  "agent_info": { "name": "claude-code" },
  "started_at": "2026-01-15T10:00:00Z",
  "finished_at": "2026-01-15T10:04:32Z"
}

Works out of the box with SWE-bench, OpenHands, and similar harnesses that write result.json per trial.

What you get back

signals.jsonl — one row per trial with 31 structured fields:

{
  "trial_id": "726c23ceb1ce7cf2...",
  "task_id": "django__django-16527",
  "model": "claude-sonnet-4-6",
  "reward": 1.0,
  "passed": true,
  "total_turns": 57,
  "tool_calls_total": 32,
  "search_tool_calls": 8,
  "edit_tool_calls": 12,
  "unique_files_read": 5,
  "unique_files_edited": 2,
  "duration_seconds": 272.0,
  "exception_crashed": false,
  "tool_call_sequence": ["read_file", "search", "edit_file", "..."],
  "..."
}

Included dataset

The repo ships a Parquet export of 11,995 trials in data/export/ (~1.5 MB):

Model Trials Pass rate
Claude Haiku 4.5 6,443 79.1%
Claude Sonnet 4.6 4,564 73.2%
Claude Opus 4.6 677 84.5%
Claude Opus 4.5 253 71.9%

Query it immediately after cloning:

agent-diagnostics query "SELECT model, count(*) as trials FROM signals GROUP BY model"

# Or load directly
python3 -c "import pandas as pd; print(pd.read_parquet('data/export/signals.parquet').describe())"

Pre-built queries

Five analysis queries are in docs/queries/:

agent-diagnostics query "$(cat docs/queries/per_model_outcomes.sql)"
agent-diagnostics query "$(cat docs/queries/benchmark_model_matrix.sql)"
agent-diagnostics query "$(cat docs/queries/annotation_cooccurrence.sql)"
agent-diagnostics query "$(cat docs/queries/tool_sequence_patterns.sql)"
agent-diagnostics query "$(cat docs/queries/eval_subset_export.sql)"

Export your own Parquet

agent-diagnostics export --format parquet --out data/export/

Produces zstd-compressed Parquet with native list<string> columns, plus MANIFEST.json with schema version, row counts, SHA256 checksums, and source commit.

Schema introspection

agent-diagnostics db schema --format markdown
agent-diagnostics db schema --format json

Taxonomy

40 categories across 11 behavioral dimensions (v3):

Dimension Categories Examples
Retrieval 3 retrieval_failure, query_churn, context_window_overflow
ToolUse 4 wrong_tool_selection, tool_argument_error, tool_misinterpretation
Reasoning 3 decomposition_failure, incorrect_root_cause, overconfident_diagnosis
Execution 5 edit_verify_loop_failure, syntax_error_loop, incomplete_implementation
Environment 4 exception_crash, rate_limited_run, environment_mismatch
Faithfulness 2 task_misunderstanding, scope_drift
Metacognition 5 premature_submission, excessive_exploration, sunk_cost_persistence
Integrity 2 test_file_modification, reward_hacking
Safety 3 data_exfiltration_attempt, sandbox_escape, destructive_operation
Strategy 6 success_via_code_nav, success_via_semantic_search, success_via_decomposition
Observability 3 insufficient_provenance, task_ambiguity, unreproducible_result
from agent_diagnostics import load_taxonomy, valid_category_names

taxonomy = load_taxonomy()
names = valid_category_names()

Annotation pipeline

Heuristic annotation

23 rule-based classifiers that fire on signal patterns (e.g., retrieval_failure when search calls = 0 and files read = 0):

agent-diagnostics annotate --signals data/signals.jsonl --output heuristic.json

LLM annotation

Reads actual trajectories and classifies with Claude. Supports claude-code, api, and batch (Message Batches API, 50% cheaper) backends:

agent-diagnostics llm-annotate --signals data/signals.jsonl --output llm.json \
    --sample-size 50 --model haiku --backend batch

Ensemble (heuristic + classifier)

Two-tier: heuristic rules for structural categories, trained classifier for learned categories:

agent-diagnostics train --labels llm.json --signals data/signals.jsonl --output model.json
agent-diagnostics ensemble --signals data/signals.jsonl --model model.json --output ensemble.json

Annotation store

All annotation writers can route through a shared AnnotationStore that enforces primary key uniqueness, atomic writes, and version consistency:

agent-diagnostics annotate --signals data/signals.jsonl --output heuristic.json \
    --annotations-out data/annotations.jsonl

agent-diagnostics ensemble --signals data/signals.jsonl --model model.json \
    --output ensemble.json --annotations-out data/annotations.jsonl

The store uses PK (trial_id, category_name, annotator_type, annotator_identity, taxonomy_version) so multiple annotators (heuristic, LLM, classifier, ensemble, human) can label the same trial without collision.

Data formats

signals.jsonl

One JSON object per line. 31 fields per trial including trial_id (stable SHA256-based), model, benchmark, reward, pass/fail, tool call counts/sequences, files read/edited, duration, error counts, and patch size.

annotations.jsonl

Narrow-tall schema — one row per (trial, category, annotator):

Column Description
trial_id SHA256-based stable identifier
category_name Taxonomy category (e.g., retrieval_failure)
confidence 0.0 to 1.0
evidence Free-text explanation
annotator_type heuristic, llm, classifier, ensemble, or human
annotator_identity e.g., heuristic:rule-engine, llm:haiku-4
taxonomy_version e.g., 3.0.0
annotated_at ISO 8601 timestamp

CLI reference

agent-diagnostics ingest           Ingest trial directories into signals.jsonl
agent-diagnostics extract          Extract signals from a single trial directory
agent-diagnostics annotate         Heuristic annotation
agent-diagnostics llm-annotate     LLM-assisted annotation
agent-diagnostics train            Train per-category classifiers
agent-diagnostics predict          Predict with trained classifier
agent-diagnostics ensemble         Two-tier ensemble annotation
agent-diagnostics report           Generate Markdown + JSON report
agent-diagnostics validate         Validate annotations against schema
agent-diagnostics query            Run SQL against the dataset (DuckDB)
agent-diagnostics export           Export to Parquet with MANIFEST.json
agent-diagnostics manifest refresh Rewrite manifests.jsonl
agent-diagnostics db schema        Inspect table schemas

Contributing

We welcome contributions of agent trace data, new benchmark integrations, taxonomy refinements, and annotation tooling. If you're building evaluation infrastructure for coding agents, we'd love to talk.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_diagnostics-0.7.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_diagnostics-0.7.0-py3-none-any.whl (120.7 kB view details)

Uploaded Python 3

File details

Details for the file agent_diagnostics-0.7.0.tar.gz.

File metadata

  • Download URL: agent_diagnostics-0.7.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agent_diagnostics-0.7.0.tar.gz
Algorithm Hash digest
SHA256 8542ddbabff6c48a34b22a29c7fae997d478905065d9a55886fcc66306e8d13b
MD5 d8885bd58dc3487ada07ea3c01d4a69a
BLAKE2b-256 7f6c8a43ac80fd21fe5e3da8c11fd86049e1d774e3a063ead56c261070786f36

See more details on using hashes here.

File details

Details for the file agent_diagnostics-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_diagnostics-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31ed86a5094ddd8b603847c943aa7dbd1a31da34563a4161b6b985acec4ee9b4
MD5 ed63a23fc32089354a7e989c913bd2f1
BLAKE2b-256 27d1ce1d73099eb41aa798abddf14070af02ea18f783daaa6289a110faec9889

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page