Rank multi-agent trace failures by severity. CLI + web dashboard + OTLP receiver, with optional LLM root-cause analysis.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

housewifey

These details have not been verified by PyPI

Project description

agent-triage

Rank multi-agent failures by what actually matters — not by what showed up first in the log.

agent-triage ingests trace files from multi-agent systems (NDJSON or OpenTelemetry) and produces a ranked "what actually needs your attention this morning" severity report. It ships a CLI, a web dashboard, an OTLP receiver, and optional LLM-generated root-cause narratives.

Why this exists

When multi-agent systems run at scale, they generate thousands of trace events per day. Most observability tools dump everything into a dashboard and leave triage to the human. That is the wrong default.

The signal you actually need at 9 AM is not "here are 847 events from last night." It's: "here are the three failure patterns that matter, ranked by how bad they are and whether the agents recovered."

agent-triage is that tool. It reads agent trace data (NDJSON natively, OpenTelemetry spans via the built-in adapter, and any custom format you can plug in), clusters events into incident patterns, scores each pattern across three dimensions (frequency, severity, recovery), and prints a ranked short list with plain-English explanations and a suggested next action for each. It also has a compare mode for diffing two batches.

Features

CLI — pipe NDJSON traces in, get a markdown morning report out
Web dashboard — interactive UI with severity charts, 7-day trend lines, and live updates via Server-Sent Events
OTLP receiver — accept OpenTelemetry spans directly from production agents
LLM root-cause analysis — optional Claude Haiku 4.5-generated narratives with prompt caching to keep cost low
Persistent SQLite storage — trace data survives restarts; supports time-series queries
Configurable scoring — tune severity weights, recovery window, and composite formula via triage.toml
Webhook alerting — fire Slack-compatible notifications when a pattern crosses your severity threshold
Docker-ready — docker compose up and you have a dashboard
PyPI-published — pip install agent-triage
Production quality — 220+ tests, 93%+ coverage, mypy strict, ruff clean

Quick start

Option 1: Docker (recommended for trying it out)

git clone https://github.com/thebharathkumar/agent-triage.git
cd agent-triage
docker compose up

Then open http://localhost:8000 and upload a trace file.

Option 2: pip

pip install "agent-triage[server,ai]"

# CLI
triage report runs/phase4/events_seed42.ndjson

# Web dashboard
triage-serve

Option 3: Stream OTLP from a running agent

# Start the receiver
triage-serve

# In another shell, send some demo spans
python examples/emit_otlp.py

The dashboard at http://localhost:8000 will populate in real time.

CLI commands

`triage report` — morning severity report

triage report runs/phase4/*.ndjson --top 5

Add --ai-analysis to enrich the top incidents with LLM-generated root-cause narratives:

export ANTHROPIC_API_KEY=sk-ant-...
triage report runs/phase4/*.ndjson --ai-analysis

Write to a file instead of stdout:

triage report runs/phase4/events_seed42.ndjson --output examples/seed42-report.md

See examples/seed42-report.md for a real generated report.

`triage compare` — diff two batches

triage compare answers a different question than the morning report: not "what should I look at" but "did the change between these two batches make agent behavior better or worse, and where". Useful when diffing pre- and post-architecture-change runs, A/B prompts, or two model versions.

triage compare runs/before.ndjson runs/after.ndjson

Each argument can be a single .ndjson file or a directory; passing a directory loads every .ndjson in it (non-recursive), so the command is symmetric across CI shapes:

triage compare runs/before/ runs/after/

Produces:

a Score Summary panel — pattern count, total failure events, unrecovered events, coordination-failure events, top final score, and mean final score, with deltas
incident frequency deltas per classification (down 39%, new, resolved, stable, etc.)
recovery latency changes — median turns-to-first-success for each classification, before vs after
unrecovered-count changes — number of failures that did not recover within the window
newly emerging patterns and resolved patterns
persisting patterns — signatures present in both, with their before/after counts

Changes based on small samples (fewer than 5 occurrences on the larger side) are tagged (tentative) so a thin-evidence delta is not read as a verified one. See examples/compare-before-after.md.

See all options

triage --help
triage report --help
triage compare --help

Trace formats and adapters

triage is plugin-shaped on the input side. Two adapters ship in the box:

Adapter	Extensions	Notes
`ndjson`	`.ndjson`, `.jsonl`	One TraceEvent per line. The native schema documented below.
`otel`	`.json`	Minimal OpenTelemetry-spans adapter; maps `span.name` → `tool_name`, `status.code` → `action_succeeded`, `attributes["agent.id"]` → `agent_id`, `attributes["failure_classification"]` → `failure_classification`, `attributes["divergence_fields"]` → `divergence_fields`.

The format is auto-detected by extension and can be overridden:

triage report --format otel spans.json
triage compare --format ndjson before.weird after.weird

Adding your own adapter

The TraceAdapter protocol in src/triage/adapters.py is intentionally small:

class TraceAdapter(Protocol):
    name: ClassVar[str]
    extensions: ClassVar[tuple[str, ...]]
    def load(self, path: Path) -> tuple[list[TraceEvent], list[str]]:
        ...

Any class that satisfies it can be added to ADAPTERS (or registered externally) and immediately works in both report and compare. The adapter only owns parsing — scoring, grouping, comparison, and reporting are agnostic to the source format.

Dashboard

triage-serve --host 0.0.0.0 --port 8000

The dashboard supports drag-and-drop NDJSON upload, live severity charts, 7-day trend lines, and per-incident AI narratives.

OTLP receiver — for production agents

Configure your agent to send spans to http://your-host:8000/otlp/v1/traces using the standard OTLP/HTTP JSON format. Map these span attributes:

Attribute	Type	Description
`agent.id`	string	Which agent acted
`run.id`	string	Which run this event belongs to
`turn`	int	Turn number within the run
`action.tool`	string	Tool/action name
`action.succeeded`	bool	Whether the action succeeded
`failure.classification`	string	One of: `coordination_failure`, `agent_error`, `information_lag`, `environment_constraint`
`divergence.fields`	string	Comma-separated belief-divergence fields

See examples/emit_otlp.py for a working end-to-end example.

Configuration

Create a triage.toml to override scoring weights, persistence, and alerting (all sections optional):

[scoring]
recovery_window = 5
no_recovery_multiplier = 2.0

[scoring.weights]
coordination_failure = 1.0
agent_error = 0.8

[storage]
db_path = "./triage.db"   # use ":memory:" for ephemeral runs

[alerting]
webhook_url = "https://hooks.slack.com/services/T0000/B0000/XXX"
threshold = 8.0           # fire when final_score crosses this
cooldown_seconds = 1800   # don't refire same pattern within this window

Then pass it to either entry point:

triage runs/*.ndjson --config triage.toml
triage-serve --config triage.toml --db ./triage.db

A full annotated example is at triage.example.toml.

Live updates

The dashboard subscribes to a Server-Sent Events stream at /api/stream. Whenever new spans arrive (via upload or OTLP), connected browsers refresh automatically — no polling, no manual reload. The 7-day trend chart is also re-rendered, so you can see severity climb in real time during an incident.

Webhook alerts

When [alerting] is configured, every /api/report evaluation checks for patterns whose final_score exceeds threshold and POSTs a Slack-compatible JSON payload to your webhook. A per-pattern cooldown prevents flooding during a continuous incident:

{
  "text": ":rotating_light: Triage alert — score 12.40\n>*[navigator] move / agent_error / position*\n>...",
  "pattern_id": "navigator-move-agent_error",
  "agent_id": "navigator",
  "tool_name": "move",
  "classification": "agent_error",
  "final_score": 12.4,
  "severity_score": 14.0,
  "recovery_rate": 0.0
}

Compatible with Slack incoming webhooks, Discord webhooks, PagerDuty Events API v2 (with minor adapter), and any HTTP endpoint that accepts JSON.

Architecture

                                  ┌─────────────────┐
   NDJSON file ─────────┐         │                 │
                        ├──►──────┤                 │
   OTLP/HTTP spans ─────┘         │   Loader        │
                                  │   ↓             │
                                  │   Grouper       │
                                  │   ↓             │
                                  │   Scorer        │──►── Markdown report (CLI)
                                  │   ↓             │
                                  │   Reporter      │──►── JSON API (dashboard)
                                  │                 │
                                  └─────────────────┘
                                          │
                                          ↓ (optional)
                                  ┌─────────────────┐
                                  │  Claude Haiku   │
                                  │  (cached)       │──►── Root-cause narrative
                                  └─────────────────┘

Module	Responsibility
`loader.py`	Parse NDJSON / OTLP into validated `TraceEvent` objects
`grouper.py`	Cluster events into `IncidentPattern` buckets by signature
`scorer.py`	Score patterns by frequency, severity, and recovery rate
`reporter.py`	Render the morning markdown report
`analyst.py`	Optional Claude-generated root-cause narratives
`store.py`	SQLite-backed persistent event store + time-series queries
`streaming.py`	In-process pub/sub for Server-Sent Events
`alerting.py`	Threshold-based webhook alerter with per-pattern cooldown
`config.py`	TOML config loader (`triage.toml`)
`server.py`	FastAPI app — dashboard, OTLP receiver, SSE stream, REST API
`cli.py`	Click-based command-line entry point

Scoring model

Each incident pattern is scored across three dimensions, then combined.

Frequency (40% weight) — how many times the pattern appeared, normalized 0–10 against the most frequent pattern in the batch.

Severity (60% weight) — weighted by failure classification:

Classification	Weight
`coordination_failure`	1.0
`agent_error`	0.7
`information_lag`	0.5
`environment_constraint`	0.2

Recovery — did the agent succeed within 3 turns after the failure? If zero occurrences recovered, severity is multiplied by 1.5. This captures the difference between a transient hiccup and a pattern that gets agents stuck.

final_score = (frequency_score * 0.4) + (severity_score * 0.6)

Why these weights?

Severity carries more weight than frequency because unrecovered coordination failures propagate downstream — one planner desync can corrupt every subsequent joint action in a run — while frequent environment_constraint events (wall bumps, edge checks) tend to self-resolve within a turn or two. A rare coordination failure is almost always worth more attention than a common environment bump, and the 0.6 / 0.4 split reflects that ordering.

Within severity, the classification weights (coordination_failure 1.0 → environment_constraint 0.2) were chosen so that each category maps to an operational cost tier: requires architectural change (coordination), bug in a single agent (agent_error), stale belief state (information_lag), expected friction (environment_constraint). Empirically, this ordering produced rankings on runs/phase4/events_seed42.ndjson that matched the incidents a human operator would triage first.

These weights are not a universal truth. They are starting defaults for dungeon-trace-style workloads. If your system has different failure economics, override them — CLASSIFICATION_WEIGHTS in src/triage/scorer.py is a plain dict.

Confidence

Each pattern is also tagged with a confidence level derived from how many times it occurred:

confidence = min(1.0, occurrences / 5)

Patterns with fewer than 5 occurrences are labelled low or medium, so that rankings built on thin evidence are surfaced as such rather than presented with false precision. A pattern with a high final score but low confidence should be treated as a hypothesis to investigate, not a conclusion.

Recovery dynamics

In addition to a binary "recovered within 3 turns" rate, the scorer reports two dynamics signals:

Median recovery latency — across events that did recover, how many turns it typically took. A pattern that always recovers in 1 turn is very different from one that always recovers in 3.
Tail risk — the count of failures that remained unrecovered after 10 turns. This is the "stuck for good" signal, distinct from "slow to recover".

Cross-run recurrence

A pattern that hits once in a single run is noise. A pattern that hits in every run is architectural. Each scored pattern exposes two recurrence signals:

Appeared in X/Y runs — how many of the analyzed runs contained this pattern at least once. A coordination failure appearing in 11/12 runs is a different animal from one that spikes in a single run.

Trend — when the input has at least 6 runs (2 × TREND_WINDOW_SIZE), the most recent 3 runs are compared against the 3 runs immediately preceding them as a sliding window. Below 6 runs, the function falls back to a split-half partition so the label is still directional rather than always insufficient data. The pattern is labelled one of:

Label	Meaning
`new`	absent in the first half, present in the second — emerging
`increasing`	second-half rate ≥ 1.3× first-half rate
`stable`	second-half rate within ±30% of first-half rate
`decreasing`	second-half rate ≤ 0.7× first-half rate
`resolved`	present in the first half, absent in the second
`insufficient data`	fewer than 3 runs — no trend emitted

Trend treats input order as a chronology proxy. If you pass files in a deterministic chronological order (e.g. runs/*.ndjson sorted by filename timestamp), the signal is meaningful; if you shuffle them, it is not. Sliding-window detection is more sensitive to recent changes than split-half: a regression that hits in the last few runs but was absent before will show as increasing or new rather than being averaged out by older history.

Recurrence is reported separately from severity to distinguish persistent low-impact issues from rare catastrophic failures. Fusing it into the final score would double-count frequency effects and hide the distinction an operator needs to triage the two differently.

When triage is not the right tool

triage is opinionated. It is useful when traces are labelled with failure_classification and action_succeeded, multiple agents are involved, and the same pattern recurs at least a few times. It is the wrong tool when:

Single-agent pipelines. The grouper keys on agent_id; with only one agent, incident patterns collapse into "failures by tool_name", which a two-line jq query handles better.
Unlabelled traces. If failure_classification is null everywhere, every incident is unclassified and the severity weighting degenerates. Label first, then triage.
Streaming real-time observability. triage is a batch tool designed for a morning summary. For sub-second alerting, use a streaming pipeline (OpenTelemetry, Prometheus) and reserve triage for the daily digest on top.
Small sample sizes (< 5 occurrences per pattern). Confidence will be low across the board; the ranking is still directional but should be treated as a starting point for investigation, not a verdict.
Homogeneous agent architectures. If every agent runs the same code path, coordination_failure vs agent_error stops being a meaningful axis and most of the signal comes from frequency alone, which is easier to get from a grouping query than from this tool.

Observability for Agents Is Not Log Aggregation

Multi-agent systems fail in ways traditional log infrastructure was not designed to surface. A coordination failure between two agents is structurally different from a 500 error: it has a recovery story, a belief-divergence story, and a recurrence story. Treating those as event lines to aggregate loses everything that matters.

Four design choices follow from that stance, and together they are the reason triage looks the way it does:

Top-3, not top-100. Operators can act on three things at once. A list of fifty incidents is a dashboard; a dashboard is not triage.
Recovery latency, not just recovery. "Did the agent succeed within 3 turns" is a binary. "How long did it take, and is the tail bounded" is a control signal. Timeouts and retry budgets are tuned against the latter, not the former.
Recurrence orthogonal to severity. Persistence and impact are different axes. Fusing them would let a high-frequency wall-bump outrank a rare planner desync — exactly the wrong prioritization for a multi-agent system, where coordination errors propagate downstream and environment errors do not.
Confidence as a first-class field. A score without an error bar is a guess presented as a number. Surfacing low / medium / high per pattern lets the operator decide whether the ranking is a verdict or a hypothesis worth investigating.

These are not features. They are the modeling stance: measure dynamics and persistence, not just events; show uncertainty rather than hide it; rank by what an operator can act on, not by what is statistically anomalous.

Design decisions

Why severity scoring instead of anomaly detection?

Anomaly detection tells you what is unusual. Severity scoring tells you what matters. A coordination_failure that happens on every run is not an anomaly — it's a chronic problem. Scoring surfaces chronic problems alongside rare-but-catastrophic ones using weights that reflect operational cost.

Why prompt caching for the LLM?

Each pattern triggers one Claude call, but the system prompt is identical across calls. With Anthropic's cache_control: ephemeral, the system prompt is cached server-side after the first call and reused on subsequent ones, cutting per-call cost ~90% for batches of 3+ patterns.

Why an embedded HTML dashboard instead of React/Next.js?

A pitch tool should run in 30 seconds, not 30 minutes. The dashboard ships as a single HTML file with Tailwind + Chart.js via CDN — no build step, no node_modules, no npm install. The template lives at src/triage/templates/dashboard.html and is loaded at server startup. Upgrading to a separate frontend later is straightforward; the backend already returns clean JSON.

Why three top incidents?

Three is the number of things a person can hold in working memory while still being able to act on them. A list of 10 is noise. A list of 1 is overconfident. Three forces the model to prioritize, which is the core value proposition.

Example reports

Four reports are checked in:

examples/monday-report.md — a synthetic but realistic "Monday morning" report across three weekend runs. Shows the full set of signals: a high-confidence coordination failure with 0% recovery, a self-healing information lag, and a low-confidence tool error flagged for investigation. Generated from runs/examples/monday.ndjson.
examples/seed42-report.md — a report generated from runs/phase4/events_seed42.ndjson (real dungeon-traces output).
examples/compare-before-after.md — a triage compare regression report between runs/examples/before.ndjson and runs/examples/monday.ndjson.
examples/otel-report.md — same triage report invocation, but reading from runs/examples/otel.json (a small OpenTelemetry-spans file) to demonstrate that the adapter layer feeds the same scoring pipeline regardless of source format.

Development

git clone https://github.com/thebharathkumar/agent-triage.git
cd agent-triage
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest --cov=triage --cov-report=term-missing   # tests
ruff check src/triage tests                     # lint
mypy src/triage                                 # type check

Quality gates enforced in CI:

220+ tests, 93%+ coverage
ruff check clean
mypy --strict clean

See CONTRIBUTING.md for PR guidelines.

Schema source

The original NDJSON trace format is compatible with dungeon-traces (feature/viewer-enhanced branch), a multi-agent simulation that produces NDJSON event logs for dungeon-navigation runs.

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

housewifey

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_triage-0.1.0.tar.gz (73.9 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_triage-0.1.0-py3-none-any.whl (50.3 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file agent_triage-0.1.0.tar.gz.

File metadata

Download URL: agent_triage-0.1.0.tar.gz
Upload date: May 4, 2026
Size: 73.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_triage-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`070e72ca0e10efe6624d9c7615dead609d2fdca314d87109a0ef24c036f34110`
MD5	`896c18dd6798ab6205c9868e981d23a0`
BLAKE2b-256	`bb1aff2c7a4e1aba3569d6c0c2c072e3df725ca9ccdec337a80910546a53abe3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_triage-0.1.0.tar.gz:

Publisher: release.yml on thebharathkumar/agent-triage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_triage-0.1.0.tar.gz
- Subject digest: 070e72ca0e10efe6624d9c7615dead609d2fdca314d87109a0ef24c036f34110
- Sigstore transparency entry: 1437008987
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: thebharathkumar/agent-triage@5a978eb070d5b795a42a1aeb8b3431330846c33f
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/thebharathkumar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@5a978eb070d5b795a42a1aeb8b3431330846c33f
- Trigger Event: push

File details

Details for the file agent_triage-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_triage-0.1.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 50.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_triage-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b0fdae6afe0f23cde8c4698f7abbc3de06c06e75ef1504ec678e4bd9c94f56a`
MD5	`7f1c3817db8a5ff3fe2524775e77b8d2`
BLAKE2b-256	`92deff775fba4fd5a4d56d2b25aae187d90654056b89f8ac6e379366236711a3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_triage-0.1.0-py3-none-any.whl:

Publisher: release.yml on thebharathkumar/agent-triage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_triage-0.1.0-py3-none-any.whl
- Subject digest: 3b0fdae6afe0f23cde8c4698f7abbc3de06c06e75ef1504ec678e4bd9c94f56a
- Sigstore transparency entry: 1437008993
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: thebharathkumar/agent-triage@5a978eb070d5b795a42a1aeb8b3431330846c33f
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/thebharathkumar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@5a978eb070d5b795a42a1aeb8b3431330846c33f
- Trigger Event: push

agent-triage 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agent-triage

Why this exists

Features

Quick start

Option 1: Docker (recommended for trying it out)

Option 2: pip

Option 3: Stream OTLP from a running agent

CLI commands

triage report — morning severity report

triage compare — diff two batches

See all options

Trace formats and adapters

Adding your own adapter

Dashboard

OTLP receiver — for production agents

Configuration

Live updates

Webhook alerts

Architecture

Scoring model

Why these weights?

Confidence

Recovery dynamics

Cross-run recurrence

When triage is not the right tool

Observability for Agents Is Not Log Aggregation

Design decisions

Why severity scoring instead of anomaly detection?

Why prompt caching for the LLM?

Why an embedded HTML dashboard instead of React/Next.js?

Why three top incidents?

Example reports

Development

Schema source

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`triage report` — morning severity report

`triage compare` — diff two batches