Rank multi-agent trace failures by severity. CLI + web dashboard + OTLP receiver, with optional LLM root-cause analysis.
Project description
agent-triage
Rank multi-agent failures by what actually matters — not by what showed up first in the log.
agent-triage ingests trace files from multi-agent systems (NDJSON or OpenTelemetry) and produces a ranked "what actually needs your attention this morning" severity report. It ships a CLI, a web dashboard, an OTLP receiver, and optional LLM-generated root-cause narratives.
Why this exists
When multi-agent systems run at scale, they generate thousands of trace events per day. Most observability tools dump everything into a dashboard and leave triage to the human. That is the wrong default.
The signal you actually need at 9 AM is not "here are 847 events from last night." It's: "here are the three failure patterns that matter, ranked by how bad they are and whether the agents recovered."
agent-triage is that tool. It reads agent trace data (NDJSON natively, OpenTelemetry spans via the built-in adapter, and any custom format you can plug in), clusters events into incident patterns, scores each pattern across three dimensions (frequency, severity, recovery), and prints a ranked short list with plain-English explanations and a suggested next action for each. It also has a compare mode for diffing two batches.
Features
- CLI — pipe NDJSON traces in, get a markdown morning report out
- Web dashboard — interactive UI with severity charts, 7-day trend lines, and live updates via Server-Sent Events
- OTLP receiver — accept OpenTelemetry spans directly from production agents
- LLM root-cause analysis — optional Claude Haiku 4.5-generated narratives with prompt caching to keep cost low
- Persistent SQLite storage — trace data survives restarts; supports time-series queries
- Configurable scoring — tune severity weights, recovery window, and composite formula via
triage.toml - Webhook alerting — fire Slack-compatible notifications when a pattern crosses your severity threshold
- Docker-ready —
docker compose upand you have a dashboard - PyPI-published —
pip install agent-triage - Production quality — 220+ tests, 93%+ coverage, mypy strict, ruff clean
Quick start
Option 1: Docker (recommended for trying it out)
git clone https://github.com/thebharathkumar/agent-triage.git
cd agent-triage
docker compose up
Then open http://localhost:8000 and upload a trace file.
Option 2: pip
pip install "agent-triage[server,ai]"
# CLI
triage report runs/phase4/events_seed42.ndjson
# Web dashboard
triage-serve
Option 3: Stream OTLP from a running agent
# Start the receiver
triage-serve
# In another shell, send some demo spans
python examples/emit_otlp.py
The dashboard at http://localhost:8000 will populate in real time.
CLI commands
triage report — morning severity report
triage report runs/phase4/*.ndjson --top 5
Add --ai-analysis to enrich the top incidents with LLM-generated root-cause narratives:
export ANTHROPIC_API_KEY=sk-ant-...
triage report runs/phase4/*.ndjson --ai-analysis
Write to a file instead of stdout:
triage report runs/phase4/events_seed42.ndjson --output examples/seed42-report.md
See examples/seed42-report.md for a real generated report.
triage compare — diff two batches
triage compare answers a different question than the morning report:
not "what should I look at" but "did the change between these two
batches make agent behavior better or worse, and where". Useful when
diffing pre- and post-architecture-change runs, A/B prompts, or two
model versions.
triage compare runs/before.ndjson runs/after.ndjson
Each argument can be a single .ndjson file or a directory; passing a
directory loads every .ndjson in it (non-recursive), so the command
is symmetric across CI shapes:
triage compare runs/before/ runs/after/
Produces:
- a Score Summary panel — pattern count, total failure events, unrecovered events, coordination-failure events, top final score, and mean final score, with deltas
- incident frequency deltas per classification (
down 39%,new,resolved,stable, etc.) - recovery latency changes — median turns-to-first-success for each classification, before vs after
- unrecovered-count changes — number of failures that did not recover within the window
- newly emerging patterns and resolved patterns
- persisting patterns — signatures present in both, with their before/after counts
Changes based on small samples (fewer than 5 occurrences on the larger
side) are tagged (tentative) so a thin-evidence delta is not read as
a verified one. See
examples/compare-before-after.md.
See all options
triage --help
triage report --help
triage compare --help
Trace formats and adapters
triage is plugin-shaped on the input side. Two adapters ship in the box:
| Adapter | Extensions | Notes |
|---|---|---|
ndjson |
.ndjson, .jsonl |
One TraceEvent per line. The native schema documented below. |
otel |
.json |
Minimal OpenTelemetry-spans adapter; maps span.name → tool_name, status.code → action_succeeded, attributes["agent.id"] → agent_id, attributes["failure_classification"] → failure_classification, attributes["divergence_fields"] → divergence_fields. |
The format is auto-detected by extension and can be overridden:
triage report --format otel spans.json
triage compare --format ndjson before.weird after.weird
Adding your own adapter
The TraceAdapter protocol in src/triage/adapters.py is intentionally
small:
class TraceAdapter(Protocol):
name: ClassVar[str]
extensions: ClassVar[tuple[str, ...]]
def load(self, path: Path) -> tuple[list[TraceEvent], list[str]]:
...
Any class that satisfies it can be added to ADAPTERS (or registered
externally) and immediately works in both report and compare. The
adapter only owns parsing — scoring, grouping, comparison, and
reporting are agnostic to the source format.
Dashboard
triage-serve --host 0.0.0.0 --port 8000
The dashboard supports drag-and-drop NDJSON upload, live severity charts, 7-day trend lines, and per-incident AI narratives.
OTLP receiver — for production agents
Configure your agent to send spans to http://your-host:8000/otlp/v1/traces using the standard OTLP/HTTP JSON format. Map these span attributes:
| Attribute | Type | Description |
|---|---|---|
agent.id |
string | Which agent acted |
run.id |
string | Which run this event belongs to |
turn |
int | Turn number within the run |
action.tool |
string | Tool/action name |
action.succeeded |
bool | Whether the action succeeded |
failure.classification |
string | One of: coordination_failure, agent_error, information_lag, environment_constraint |
divergence.fields |
string | Comma-separated belief-divergence fields |
See examples/emit_otlp.py for a working end-to-end example.
Configuration
Create a triage.toml to override scoring weights, persistence, and alerting (all sections optional):
[scoring]
recovery_window = 5
no_recovery_multiplier = 2.0
[scoring.weights]
coordination_failure = 1.0
agent_error = 0.8
[storage]
db_path = "./triage.db" # use ":memory:" for ephemeral runs
[alerting]
webhook_url = "https://hooks.slack.com/services/T0000/B0000/XXX"
threshold = 8.0 # fire when final_score crosses this
cooldown_seconds = 1800 # don't refire same pattern within this window
Then pass it to either entry point:
triage runs/*.ndjson --config triage.toml
triage-serve --config triage.toml --db ./triage.db
A full annotated example is at triage.example.toml.
Live updates
The dashboard subscribes to a Server-Sent Events stream at /api/stream. Whenever new spans arrive (via upload or OTLP), connected browsers refresh automatically — no polling, no manual reload. The 7-day trend chart is also re-rendered, so you can see severity climb in real time during an incident.
Webhook alerts
When [alerting] is configured, every /api/report evaluation checks for patterns whose final_score exceeds threshold and POSTs a Slack-compatible JSON payload to your webhook. A per-pattern cooldown prevents flooding during a continuous incident:
{
"text": ":rotating_light: Triage alert — score 12.40\n>*[navigator] move / agent_error / position*\n>...",
"pattern_id": "navigator-move-agent_error",
"agent_id": "navigator",
"tool_name": "move",
"classification": "agent_error",
"final_score": 12.4,
"severity_score": 14.0,
"recovery_rate": 0.0
}
Compatible with Slack incoming webhooks, Discord webhooks, PagerDuty Events API v2 (with minor adapter), and any HTTP endpoint that accepts JSON.
Architecture
┌─────────────────┐
NDJSON file ─────────┐ │ │
├──►──────┤ │
OTLP/HTTP spans ─────┘ │ Loader │
│ ↓ │
│ Grouper │
│ ↓ │
│ Scorer │──►── Markdown report (CLI)
│ ↓ │
│ Reporter │──►── JSON API (dashboard)
│ │
└─────────────────┘
│
↓ (optional)
┌─────────────────┐
│ Claude Haiku │
│ (cached) │──►── Root-cause narrative
└─────────────────┘
| Module | Responsibility |
|---|---|
loader.py |
Parse NDJSON / OTLP into validated TraceEvent objects |
grouper.py |
Cluster events into IncidentPattern buckets by signature |
scorer.py |
Score patterns by frequency, severity, and recovery rate |
reporter.py |
Render the morning markdown report |
analyst.py |
Optional Claude-generated root-cause narratives |
store.py |
SQLite-backed persistent event store + time-series queries |
streaming.py |
In-process pub/sub for Server-Sent Events |
alerting.py |
Threshold-based webhook alerter with per-pattern cooldown |
config.py |
TOML config loader (triage.toml) |
server.py |
FastAPI app — dashboard, OTLP receiver, SSE stream, REST API |
cli.py |
Click-based command-line entry point |
Scoring model
Each incident pattern is scored across three dimensions, then combined.
Frequency (40% weight) — how many times the pattern appeared, normalized 0–10 against the most frequent pattern in the batch.
Severity (60% weight) — weighted by failure classification:
| Classification | Weight |
|---|---|
coordination_failure |
1.0 |
agent_error |
0.7 |
information_lag |
0.5 |
environment_constraint |
0.2 |
Recovery — did the agent succeed within 3 turns after the failure? If zero occurrences recovered, severity is multiplied by 1.5. This captures the difference between a transient hiccup and a pattern that gets agents stuck.
final_score = (frequency_score * 0.4) + (severity_score * 0.6)
Why these weights?
Severity carries more weight than frequency because unrecovered coordination
failures propagate downstream — one planner desync can corrupt every
subsequent joint action in a run — while frequent environment_constraint
events (wall bumps, edge checks) tend to self-resolve within a turn or two.
A rare coordination failure is almost always worth more attention than a
common environment bump, and the 0.6 / 0.4 split reflects that ordering.
Within severity, the classification weights (coordination_failure 1.0 →
environment_constraint 0.2) were chosen so that each category maps to an
operational cost tier: requires architectural change (coordination),
bug in a single agent (agent_error), stale belief state
(information_lag), expected friction (environment_constraint).
Empirically, this ordering produced rankings on
runs/phase4/events_seed42.ndjson that matched the incidents a human
operator would triage first.
These weights are not a universal truth. They are starting defaults for
dungeon-trace-style workloads. If your system has different failure
economics, override them — CLASSIFICATION_WEIGHTS in
src/triage/scorer.py is a plain dict.
Confidence
Each pattern is also tagged with a confidence level derived from how many times it occurred:
confidence = min(1.0, occurrences / 5)
Patterns with fewer than 5 occurrences are labelled low or medium, so
that rankings built on thin evidence are surfaced as such rather than
presented with false precision. A pattern with a high final score but low
confidence should be treated as a hypothesis to investigate, not a
conclusion.
Recovery dynamics
In addition to a binary "recovered within 3 turns" rate, the scorer reports two dynamics signals:
- Median recovery latency — across events that did recover, how many turns it typically took. A pattern that always recovers in 1 turn is very different from one that always recovers in 3.
- Tail risk — the count of failures that remained unrecovered after 10 turns. This is the "stuck for good" signal, distinct from "slow to recover".
Cross-run recurrence
A pattern that hits once in a single run is noise. A pattern that hits in every run is architectural. Each scored pattern exposes two recurrence signals:
-
Appeared in X/Y runs — how many of the analyzed runs contained this pattern at least once. A coordination failure appearing in 11/12 runs is a different animal from one that spikes in a single run.
-
Trend — when the input has at least 6 runs (
2 × TREND_WINDOW_SIZE), the most recent 3 runs are compared against the 3 runs immediately preceding them as a sliding window. Below 6 runs, the function falls back to a split-half partition so the label is still directional rather than alwaysinsufficient data. The pattern is labelled one of:Label Meaning newabsent in the first half, present in the second — emerging increasingsecond-half rate ≥ 1.3× first-half rate stablesecond-half rate within ±30% of first-half rate decreasingsecond-half rate ≤ 0.7× first-half rate resolvedpresent in the first half, absent in the second insufficient datafewer than 3 runs — no trend emitted Trend treats input order as a chronology proxy. If you pass files in a deterministic chronological order (e.g.
runs/*.ndjsonsorted by filename timestamp), the signal is meaningful; if you shuffle them, it is not. Sliding-window detection is more sensitive to recent changes than split-half: a regression that hits in the last few runs but was absent before will show asincreasingornewrather than being averaged out by older history.
Recurrence is reported separately from severity to distinguish persistent low-impact issues from rare catastrophic failures. Fusing it into the final score would double-count frequency effects and hide the distinction an operator needs to triage the two differently.
When triage is not the right tool
triage is opinionated. It is useful when traces are labelled with
failure_classification and action_succeeded, multiple agents are
involved, and the same pattern recurs at least a few times. It is the
wrong tool when:
- Single-agent pipelines. The grouper keys on
agent_id; with only one agent, incident patterns collapse into "failures by tool_name", which a two-linejqquery handles better. - Unlabelled traces. If
failure_classificationis null everywhere, every incident isunclassifiedand the severity weighting degenerates. Label first, then triage. - Streaming real-time observability.
triageis a batch tool designed for a morning summary. For sub-second alerting, use a streaming pipeline (OpenTelemetry, Prometheus) and reservetriagefor the daily digest on top. - Small sample sizes (< 5 occurrences per pattern). Confidence will be low across the board; the ranking is still directional but should be treated as a starting point for investigation, not a verdict.
- Homogeneous agent architectures. If every agent runs the same code
path,
coordination_failurevsagent_errorstops being a meaningful axis and most of the signal comes from frequency alone, which is easier to get from a grouping query than from this tool.
Observability for Agents Is Not Log Aggregation
Multi-agent systems fail in ways traditional log infrastructure was not designed to surface. A coordination failure between two agents is structurally different from a 500 error: it has a recovery story, a belief-divergence story, and a recurrence story. Treating those as event lines to aggregate loses everything that matters.
Four design choices follow from that stance, and together they are the
reason triage looks the way it does:
- Top-3, not top-100. Operators can act on three things at once. A list of fifty incidents is a dashboard; a dashboard is not triage.
- Recovery latency, not just recovery. "Did the agent succeed within 3 turns" is a binary. "How long did it take, and is the tail bounded" is a control signal. Timeouts and retry budgets are tuned against the latter, not the former.
- Recurrence orthogonal to severity. Persistence and impact are different axes. Fusing them would let a high-frequency wall-bump outrank a rare planner desync — exactly the wrong prioritization for a multi-agent system, where coordination errors propagate downstream and environment errors do not.
- Confidence as a first-class field. A score without an error bar is
a guess presented as a number. Surfacing
low/medium/highper pattern lets the operator decide whether the ranking is a verdict or a hypothesis worth investigating.
These are not features. They are the modeling stance: measure dynamics and persistence, not just events; show uncertainty rather than hide it; rank by what an operator can act on, not by what is statistically anomalous.
Design decisions
Why severity scoring instead of anomaly detection?
Anomaly detection tells you what is unusual. Severity scoring tells you what matters. A coordination_failure that happens on every run is not an anomaly — it's a chronic problem. Scoring surfaces chronic problems alongside rare-but-catastrophic ones using weights that reflect operational cost.
Why prompt caching for the LLM?
Each pattern triggers one Claude call, but the system prompt is identical across calls. With Anthropic's cache_control: ephemeral, the system prompt is cached server-side after the first call and reused on subsequent ones, cutting per-call cost ~90% for batches of 3+ patterns.
Why an embedded HTML dashboard instead of React/Next.js?
A pitch tool should run in 30 seconds, not 30 minutes. The dashboard ships as a single HTML file with Tailwind + Chart.js via CDN — no build step, no node_modules, no npm install. The template lives at src/triage/templates/dashboard.html and is loaded at server startup. Upgrading to a separate frontend later is straightforward; the backend already returns clean JSON.
Why three top incidents?
Three is the number of things a person can hold in working memory while still being able to act on them. A list of 10 is noise. A list of 1 is overconfident. Three forces the model to prioritize, which is the core value proposition.
Example reports
Four reports are checked in:
- examples/monday-report.md — a synthetic but
realistic "Monday morning" report across three weekend runs. Shows the
full set of signals: a high-confidence coordination failure with 0%
recovery, a self-healing information lag, and a low-confidence tool
error flagged for investigation. Generated from
runs/examples/monday.ndjson. - examples/seed42-report.md — a report
generated from
runs/phase4/events_seed42.ndjson(real dungeon-traces output). - examples/compare-before-after.md —
a
triage compareregression report betweenruns/examples/before.ndjsonandruns/examples/monday.ndjson. - examples/otel-report.md — same
triage reportinvocation, but reading fromruns/examples/otel.json(a small OpenTelemetry-spans file) to demonstrate that the adapter layer feeds the same scoring pipeline regardless of source format.
Development
git clone https://github.com/thebharathkumar/agent-triage.git
cd agent-triage
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest --cov=triage --cov-report=term-missing # tests
ruff check src/triage tests # lint
mypy src/triage # type check
Quality gates enforced in CI:
- 220+ tests, 93%+ coverage
ruff checkcleanmypy --strictclean
See CONTRIBUTING.md for PR guidelines.
Schema source
The original NDJSON trace format is compatible with dungeon-traces (feature/viewer-enhanced branch), a multi-agent simulation that produces NDJSON event logs for dungeon-navigation runs.
License
MIT © 2026 Bharath kumar R
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_triage-0.1.0.tar.gz.
File metadata
- Download URL: agent_triage-0.1.0.tar.gz
- Upload date:
- Size: 73.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
070e72ca0e10efe6624d9c7615dead609d2fdca314d87109a0ef24c036f34110
|
|
| MD5 |
896c18dd6798ab6205c9868e981d23a0
|
|
| BLAKE2b-256 |
bb1aff2c7a4e1aba3569d6c0c2c072e3df725ca9ccdec337a80910546a53abe3
|
Provenance
The following attestation bundles were made for agent_triage-0.1.0.tar.gz:
Publisher:
release.yml on thebharathkumar/agent-triage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_triage-0.1.0.tar.gz -
Subject digest:
070e72ca0e10efe6624d9c7615dead609d2fdca314d87109a0ef24c036f34110 - Sigstore transparency entry: 1437008987
- Sigstore integration time:
-
Permalink:
thebharathkumar/agent-triage@5a978eb070d5b795a42a1aeb8b3431330846c33f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/thebharathkumar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5a978eb070d5b795a42a1aeb8b3431330846c33f -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent_triage-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_triage-0.1.0-py3-none-any.whl
- Upload date:
- Size: 50.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b0fdae6afe0f23cde8c4698f7abbc3de06c06e75ef1504ec678e4bd9c94f56a
|
|
| MD5 |
7f1c3817db8a5ff3fe2524775e77b8d2
|
|
| BLAKE2b-256 |
92deff775fba4fd5a4d56d2b25aae187d90654056b89f8ac6e379366236711a3
|
Provenance
The following attestation bundles were made for agent_triage-0.1.0-py3-none-any.whl:
Publisher:
release.yml on thebharathkumar/agent-triage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_triage-0.1.0-py3-none-any.whl -
Subject digest:
3b0fdae6afe0f23cde8c4698f7abbc3de06c06e75ef1504ec678e4bd9c94f56a - Sigstore transparency entry: 1437008993
- Sigstore integration time:
-
Permalink:
thebharathkumar/agent-triage@5a978eb070d5b795a42a1aeb8b3431330846c33f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/thebharathkumar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5a978eb070d5b795a42a1aeb8b3431330846c33f -
Trigger Event:
push
-
Statement type: