Agent Reliability Observatory — a behavioral taxonomy and annotation framework for analyzing why coding agents succeed or fail.

These details have not been verified by PyPI

Project links

Repository

Project description

Agent Diagnostics

A behavioral taxonomy, annotation framework, and shareable dataset backend for analyzing why coding agents succeed or fail on benchmark tasks.

11,995 trials. 4 models. 61 benchmarks. 40 failure categories across 11 dimensions.

What this does

Coding agents pass benchmarks for the wrong reasons and fail them for the wrong reasons. Pass/fail scores hide reward hacking, flawed tests, and lucky patches. This project extracts structured signals from agent trajectories, classifies failure modes, and provides a queryable dataset backend so you can actually understand what happened.

Trial directories (result.json + trajectory.json)
  -> agent-diagnostics ingest        Extract 31 structured signals per trial
  -> agent-diagnostics annotate      Heuristic failure classification
  -> agent-diagnostics llm-annotate  LLM-assisted classification
  -> agent-diagnostics ensemble      Heuristic + classifier ensemble
  -> agent-diagnostics export        Parquet + MANIFEST.json share artifact
  -> agent-diagnostics query         SQL via DuckDB, zero server

Install

pip install agent-diagnostics

Includes DuckDB, PyArrow, Anthropic SDK, and jsonschema. For development:

pip install agent-diagnostics[dev]   # pytest, ruff, mypy, coverage

Dataset

The current corpus covers 4 Claude models across 61 benchmark suites:

Model	Trials	Pass rate
Claude Haiku 4.5	6,443	79.1%
Claude Sonnet 4.6	4,564	73.2%
Claude Opus 4.6	677	84.5%
Claude Opus 4.5	253	71.9%

Each trial carries 31 structured signals including tool call sequences, files read/edited, duration, error counts, patch size, and a stable content-addressed trial_id.

Query the dataset

# Pass rates by model
agent-diagnostics query "SELECT model, count(*) as trials,
  round(avg(CASE WHEN passed THEN 1.0 ELSE 0.0 END)*100, 1) as pass_rate
  FROM signals GROUP BY model ORDER BY pass_rate DESC"

# Failure analysis
agent-diagnostics query "SELECT model, count(*) FROM signals WHERE passed = false GROUP BY model"

# Run any of the 5 committed queries
agent-diagnostics query "$(cat docs/queries/tool_sequence_patterns.sql)"
agent-diagnostics query "$(cat docs/queries/per_model_outcomes.sql)"

Export as Parquet

# 21.87 MB JSONL -> ~1 MB zstd Parquet
agent-diagnostics export --format parquet --out data/export/

# Readable in pandas, Polars, R, DuckDB, any Arrow tool
python3 -c "import pandas as pd; print(pd.read_parquet('data/export/signals.parquet').shape)"

The export produces signals.parquet, annotations.parquet, manifests.parquet, and a MANIFEST.json with schema version, taxonomy version, row counts, SHA256 checksums, and source commit.

Schema introspection

agent-diagnostics db schema --format markdown
agent-diagnostics db schema --format json

Taxonomy

40 categories across 11 behavioral dimensions (v3):

Dimension	Categories	Examples
Retrieval	3	`retrieval_failure`, `query_churn`, `context_window_overflow`
ToolUse	4	`wrong_tool_selection`, `tool_argument_error`, `tool_misinterpretation`
Reasoning	3	`decomposition_failure`, `incorrect_root_cause`, `overconfident_diagnosis`
Execution	5	`edit_verify_loop_failure`, `syntax_error_loop`, `incomplete_implementation`
Environment	4	`exception_crash`, `rate_limited_run`, `environment_mismatch`
Faithfulness	2	`task_misunderstanding`, `scope_drift`
Metacognition	5	`premature_submission`, `excessive_exploration`, `sunk_cost_persistence`
Integrity	2	`test_file_modification`, `reward_hacking`
Safety	3	`data_exfiltration_attempt`, `sandbox_escape`, `destructive_operation`
Strategy	6	`success_via_code_nav`, `success_via_semantic_search`, `success_via_decomposition`
Observability	3	`insufficient_provenance`, `task_ambiguity`, `unreproducible_result`

from agent_diagnostics import load_taxonomy, valid_category_names

taxonomy = load_taxonomy()
names = valid_category_names()

Annotation pipeline

Heuristic annotation

23 rule-based classifiers that fire on signal patterns (e.g., retrieval_failure when search calls = 0 and files read = 0):

agent-diagnostics annotate --signals data/signals.json --output heuristic.json

LLM annotation

Reads actual trajectories and classifies with Claude. Supports claude-code, api, and batch (Message Batches API, 50% cheaper) backends:

agent-diagnostics llm-annotate --signals data/signals.json --output llm.json \
    --sample-size 50 --model haiku --backend batch

Ensemble (heuristic + classifier)

Two-tier: heuristic rules for structural categories, trained classifier for learned categories:

agent-diagnostics train --labels llm.json --signals signals.json --output model.json
agent-diagnostics ensemble --signals signals.json --model model.json --output ensemble.json

Annotation store

All annotation writers can route through a shared AnnotationStore that enforces primary key uniqueness, atomic writes, and version consistency:

agent-diagnostics annotate --signals data/signals.json --output heuristic.json \
    --annotations-out data/annotations.jsonl

agent-diagnostics ensemble --signals data/signals.json --model model.json \
    --output ensemble.json --annotations-out data/annotations.jsonl

The store uses PK (trial_id, category_name, annotator_type, annotator_identity, taxonomy_version) so multiple annotators (heuristic, LLM, classifier, ensemble, human) can label the same trial without collision.

CLI reference

agent-diagnostics extract          Extract signals from trial directories
agent-diagnostics ingest           Filter -> extract -> enrich -> write JSONL pipeline
agent-diagnostics annotate         Heuristic annotation
agent-diagnostics llm-annotate     LLM-assisted annotation
agent-diagnostics train            Train per-category classifiers
agent-diagnostics predict          Predict with trained classifier
agent-diagnostics ensemble         Two-tier ensemble annotation
agent-diagnostics report           Generate Markdown + JSON report
agent-diagnostics validate         Validate annotations against schema
agent-diagnostics query            Run SQL against the dataset (DuckDB)
agent-diagnostics export           Export to Parquet with MANIFEST.json
agent-diagnostics manifest refresh Rewrite manifests.jsonl
agent-diagnostics db schema        Inspect table schemas

Data formats

signals.jsonl

One JSON object per line. 31 fields per trial including trial_id (stable SHA256-based), model, benchmark, reward, pass/fail, tool call counts/sequences, files read/edited, duration, error counts, and patch size.

annotations.jsonl

Narrow-tall schema — one row per (trial, category, annotator):

Column	Description
`trial_id`	SHA256-based stable identifier
`category_name`	Taxonomy category (e.g., `retrieval_failure`)
`confidence`	0.0 to 1.0
`evidence`	Free-text explanation
`annotator_type`	`heuristic`, `llm`, `classifier`, `ensemble`, or `human`
`annotator_identity`	e.g., `heuristic:rule-engine`, `llm:haiku-4`
`taxonomy_version`	e.g., `3.0.0`
`annotated_at`	ISO 8601 timestamp

Parquet export

agent-diagnostics export produces zstd-compressed Parquet with native list<string> columns. The 21.87 MB JSONL corpus compresses to ~1 MB. Includes MANIFEST.json for provenance.

Architecture

agent_diagnostics/
  signals.py           Signal extraction + trial_id computation (31 fields)
  types.py             TrialSignals TypedDict, CategoryAssignment, Annotation
  annotator.py         23-rule heuristic annotator
  classifier.py        Pure-Python logistic regression (no numpy)
  ensemble.py          Two-tier ensemble (heuristic + classifier)
  llm_annotator.py     LLM annotation (claude-code, API, batch backends)
  annotation_store.py  Narrow-tall JSONL store with PK enforcement + flock
  model_identity.py    Logical annotator identity resolution via models.yaml
  query.py             DuckDB query engine (JSONL + Parquet)
  export.py            Parquet export with MANIFEST.json
  report.py            Markdown + JSON report generator
  calibrate.py         Agreement analysis, Cohen's kappa
  blend_labels.py      LLM + heuristic label blending
  taxonomy.py          Taxonomy loader (v1/v2/v3 YAML)
  tool_registry.py     Injectable tool name registry
  cli.py               CLI entrypoint (14 subcommands)

Contributing

We welcome contributions of agent trace data, new benchmark integrations, taxonomy refinements, and annotation tooling. If you're building evaluation infrastructure for coding agents, we'd love to talk.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.8.1

Apr 19, 2026

0.8.0

Apr 19, 2026

0.7.0

Apr 15, 2026

This version

0.6.2

Apr 15, 2026

0.6.1

Apr 15, 2026

0.6.0

Apr 15, 2026

0.5.1

Apr 5, 2026

0.5.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_diagnostics-0.6.2.tar.gz (237.2 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_diagnostics-0.6.2-py3-none-any.whl (120.7 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file agent_diagnostics-0.6.2.tar.gz.

File metadata

Download URL: agent_diagnostics-0.6.2.tar.gz
Upload date: Apr 15, 2026
Size: 237.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agent_diagnostics-0.6.2.tar.gz
Algorithm	Hash digest
SHA256	`7b57cc61a7540a9bb0fd1e00d686d0956fac4801b32f83b63945ca630271b8c4`
MD5	`dfa98e530ced07266f9ade3b2e3d0f66`
BLAKE2b-256	`5f27636434387dba9b4dc7a407283ebac04d7cf72f58f39c0c4da76769e535ad`

See more details on using hashes here.

File details

Details for the file agent_diagnostics-0.6.2-py3-none-any.whl.

File metadata

Download URL: agent_diagnostics-0.6.2-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 120.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agent_diagnostics-0.6.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd43e1258f977c164e9c5cdad52dad2b8ed87a0644d3eaa68c7828f6dd155334`
MD5	`c22fb4396ef618f7ec0e4746ba70b8e4`
BLAKE2b-256	`93b4839c8b0cd8266df3d90fdc2f86d0a24e2b387aa8bc9fcb46583f90eaaebb`

See more details on using hashes here.

agent-diagnostics 0.6.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Agent Diagnostics

What this does

Install

Dataset

Query the dataset

Export as Parquet

Schema introspection

Taxonomy

Annotation pipeline

Heuristic annotation

LLM annotation

Ensemble (heuristic + classifier)

Annotation store

CLI reference

Data formats

signals.jsonl

annotations.jsonl

Parquet export

Architecture

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes