Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.
Project description
Agent Diagnostics
A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.
11,995 trials. 4 models. 61 benchmarks. 40 failure categories across 11 dimensions.
What this does
Coding agents pass benchmarks for the wrong reasons and fail them for the wrong reasons. Pass/fail scores hide reward hacking, flawed tests, and lucky patches. This project extracts structured signals from agent trajectories, classifies failure modes, and provides a queryable dataset backend so you can actually understand what happened.
Trial directories (result.json + trajectory.json)
-> agent-diagnostics ingest Extract 31 structured signals per trial
-> agent-diagnostics annotate Heuristic failure classification
-> agent-diagnostics llm-annotate LLM-assisted classification
-> agent-diagnostics ensemble Heuristic + classifier ensemble
-> agent-diagnostics export Parquet + MANIFEST.json share artifact
-> agent-diagnostics query SQL via DuckDB, zero server
Install
pip install agent-diagnostics
Includes DuckDB, PyArrow, Anthropic SDK, and jsonschema. For development:
pip install agent-diagnostics[dev] # pytest, ruff, mypy, coverage
Dataset
The current corpus covers 4 Claude models across 61 benchmark suites:
| Model | Trials | Pass rate |
|---|---|---|
| Claude Haiku 4.5 | 6,443 | 79.1% |
| Claude Sonnet 4.6 | 4,564 | 73.2% |
| Claude Opus 4.6 | 677 | 84.5% |
| Claude Opus 4.5 | 253 | 71.9% |
Each trial carries 31 structured signals including tool call sequences, files read/edited, duration, error counts, patch size, and a stable content-addressed trial_id.
Query the dataset
# Pass rates by model
agent-diagnostics query "SELECT model, count(*) as trials,
round(avg(CASE WHEN passed THEN 1.0 ELSE 0.0 END)*100, 1) as pass_rate
FROM signals GROUP BY model ORDER BY pass_rate DESC"
# Failure analysis
agent-diagnostics query "SELECT model, count(*) FROM signals WHERE passed = false GROUP BY model"
# Run any of the 5 committed queries
agent-diagnostics query "$(cat docs/queries/tool_sequence_patterns.sql)"
agent-diagnostics query "$(cat docs/queries/per_model_outcomes.sql)"
Export as Parquet
# 21.87 MB JSONL -> ~1 MB zstd Parquet
agent-diagnostics export --format parquet --out data/export/
# Readable in pandas, Polars, R, DuckDB, any Arrow tool
python3 -c "import pandas as pd; print(pd.read_parquet('data/export/signals.parquet').shape)"
The export produces signals.parquet, annotations.parquet, manifests.parquet, and a MANIFEST.json with schema version, taxonomy version, row counts, SHA256 checksums, and source commit.
Schema introspection
agent-diagnostics db schema --format markdown
agent-diagnostics db schema --format json
Taxonomy
40 categories across 11 behavioral dimensions (v3):
| Dimension | Categories | Examples |
|---|---|---|
| Retrieval | 3 | retrieval_failure, query_churn, context_window_overflow |
| ToolUse | 4 | wrong_tool_selection, tool_argument_error, tool_misinterpretation |
| Reasoning | 3 | decomposition_failure, incorrect_root_cause, overconfident_diagnosis |
| Execution | 5 | edit_verify_loop_failure, syntax_error_loop, incomplete_implementation |
| Environment | 4 | exception_crash, rate_limited_run, environment_mismatch |
| Faithfulness | 2 | task_misunderstanding, scope_drift |
| Metacognition | 5 | premature_submission, excessive_exploration, sunk_cost_persistence |
| Integrity | 2 | test_file_modification, reward_hacking |
| Safety | 3 | data_exfiltration_attempt, sandbox_escape, destructive_operation |
| Strategy | 6 | success_via_code_nav, success_via_semantic_search, success_via_decomposition |
| Observability | 3 | insufficient_provenance, task_ambiguity, unreproducible_result |
from agent_diagnostics import load_taxonomy, valid_category_names
taxonomy = load_taxonomy()
names = valid_category_names()
Annotation pipeline
Heuristic annotation
23 rule-based classifiers that fire on signal patterns (e.g., retrieval_failure when search calls = 0 and files read = 0):
agent-diagnostics annotate --signals data/signals.json --output heuristic.json
LLM annotation
Reads actual trajectories and classifies with Claude. Supports claude-code, api, and batch (Message Batches API, 50% cheaper) backends:
agent-diagnostics llm-annotate --signals data/signals.json --output llm.json \
--sample-size 50 --model haiku --backend batch
Ensemble (heuristic + classifier)
Two-tier: heuristic rules for structural categories, trained classifier for learned categories:
agent-diagnostics train --labels llm.json --signals signals.json --output model.json
agent-diagnostics ensemble --signals signals.json --model model.json --output ensemble.json
Annotation store
All annotation writers can route through a shared AnnotationStore that enforces primary key uniqueness, atomic writes, and version consistency:
agent-diagnostics annotate --signals data/signals.json --output heuristic.json \
--annotations-out data/annotations.jsonl
agent-diagnostics ensemble --signals data/signals.json --model model.json \
--output ensemble.json --annotations-out data/annotations.jsonl
The store uses PK (trial_id, category_name, annotator_type, annotator_identity, taxonomy_version) so multiple annotators (heuristic, LLM, classifier, ensemble, human) can label the same trial without collision.
CLI reference
agent-diagnostics extract Extract signals from trial directories
agent-diagnostics ingest Filter -> extract -> enrich -> write JSONL pipeline
agent-diagnostics annotate Heuristic annotation
agent-diagnostics llm-annotate LLM-assisted annotation
agent-diagnostics train Train per-category classifiers
agent-diagnostics predict Predict with trained classifier
agent-diagnostics ensemble Two-tier ensemble annotation
agent-diagnostics report Generate Markdown + JSON report
agent-diagnostics validate Validate annotations against schema
agent-diagnostics query Run SQL against the dataset (DuckDB)
agent-diagnostics export Export to Parquet with MANIFEST.json
agent-diagnostics manifest refresh Rewrite manifests.jsonl
agent-diagnostics db schema Inspect table schemas
Data formats
signals.jsonl
One JSON object per line. 31 fields per trial including trial_id (stable SHA256-based), model, benchmark, reward, pass/fail, tool call counts/sequences, files read/edited, duration, error counts, and patch size.
annotations.jsonl
Narrow-tall schema — one row per (trial, category, annotator):
| Column | Description |
|---|---|
trial_id |
SHA256-based stable identifier |
category_name |
Taxonomy category (e.g., retrieval_failure) |
confidence |
0.0 to 1.0 |
evidence |
Free-text explanation |
annotator_type |
heuristic, llm, classifier, ensemble, or human |
annotator_identity |
e.g., heuristic:rule-engine, llm:haiku-4 |
taxonomy_version |
e.g., 3.0.0 |
annotated_at |
ISO 8601 timestamp |
Parquet export
agent-diagnostics export produces zstd-compressed Parquet with native list<string> columns. The 21.87 MB JSONL corpus compresses to ~1 MB. Includes MANIFEST.json for provenance.
Architecture
agent_diagnostics/
signals.py Signal extraction + trial_id computation (31 fields)
types.py TrialSignals TypedDict, CategoryAssignment, Annotation
annotator.py 23-rule heuristic annotator
classifier.py Pure-Python logistic regression (no numpy)
ensemble.py Two-tier ensemble (heuristic + classifier)
llm_annotator.py LLM annotation (claude-code, API, batch backends)
annotation_store.py Narrow-tall JSONL store with PK enforcement + flock
model_identity.py Logical annotator identity resolution via models.yaml
query.py DuckDB query engine (JSONL + Parquet)
export.py Parquet export with MANIFEST.json
report.py Markdown + JSON report generator
calibrate.py Agreement analysis, Cohen's kappa
blend_labels.py LLM + heuristic label blending
taxonomy.py Taxonomy loader (v1/v2/v3 YAML)
tool_registry.py Injectable tool name registry
cli.py CLI entrypoint (14 subcommands)
Contributing
We welcome contributions of agent trace data, new benchmark integrations, taxonomy refinements, and annotation tooling. If you're building evaluation infrastructure for coding agents, we'd love to talk.
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_diagnostics-0.6.2.tar.gz.
File metadata
- Download URL: agent_diagnostics-0.6.2.tar.gz
- Upload date:
- Size: 237.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b57cc61a7540a9bb0fd1e00d686d0956fac4801b32f83b63945ca630271b8c4
|
|
| MD5 |
dfa98e530ced07266f9ade3b2e3d0f66
|
|
| BLAKE2b-256 |
5f27636434387dba9b4dc7a407283ebac04d7cf72f58f39c0c4da76769e535ad
|
File details
Details for the file agent_diagnostics-0.6.2-py3-none-any.whl.
File metadata
- Download URL: agent_diagnostics-0.6.2-py3-none-any.whl
- Upload date:
- Size: 120.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd43e1258f977c164e9c5cdad52dad2b8ed87a0644d3eaa68c7828f6dd155334
|
|
| MD5 |
c22fb4396ef618f7ec0e4746ba70b8e4
|
|
| BLAKE2b-256 |
93b4839c8b0cd8266df3d90fdc2f86d0a24e2b387aa8bc9fcb46583f90eaaebb
|