Observability-platform-agnostic triage runtime for LLM agent traces
Project description
docket
An observability-platform-agnostic triage runtime for LLM agent traces.
docket reads traces from your existing observability backend (Phoenix, Langfuse, LangSmith), classifies each one against a YAML failure-mode taxonomy you write, clusters similar failures together, and drafts issues into your tracker (Jira, Linear, GitHub Issues). It is not a new observability backend, an eval framework, or a web UI — it's a thin agent that sits above what you already have.
Human-in-the-loop is the default: drafts queue locally or open in your
$EDITOR for review before they post. Auto-posting requires an explicit
opt-in (auto_post_threshold).
Quickstart (5 minutes)
The fastest path to a working setup: a local Phoenix backend + a GitHub-Issues tracker.
1. Install
pip install docket-runtime
# or: uv pip install docket-runtime
2. Bring up Phoenix
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest
Send your agent's traces to http://localhost:6006 via the
OpenInference instrumentation of your choice (any OTLP-compatible
instrumentation works — see docs/local-phoenix.md for ingestion
recipes).
3. Configure credentials
export ANTHROPIC_API_KEY="sk-ant-..." # for the llm_judge detectors
export OPENAI_API_KEY="sk-..." # for clustering embeddings
# (required even with an
# Anthropic classifier)
export GITHUB_TOKEN="ghp_..." # PAT with Issues write
4. Run
docket run \
--backend phoenix \
--phoenix-url http://localhost:6006 \
--tracker github \
--github-owner YOUR_GH_USER \
--github-repo docket-issues \
--rubric docket.dev/builtin/agents/v1 \
--since 1h
That's it. The pipeline:
- Pulls the last hour of traces from Phoenix.
- Runs each one through the
agents/v1failure-mode rubric. - Clusters positive classifications per mode.
- Drafts one issue per cluster into
~/.docket/queued-issues/<run-id>/. - Looks at your GitHub repo for matching open issues (dedup by labels +
embedded provenance) and comments on existing issues that grew, or
leaves new ones in the local queue for
--review. - Prints a markdown report.
Add --review to walk each queued draft through $EDITOR + accept/reject
- post. Add
--auto-post-threshold highto auto-post critical and high severity drafts. Add--dry-runto price a window before committing to it.
For scheduled triage, swap run for the daemon:
docket serve --interval 1h ... # same flags as run
Each tick processes exactly the window since the last successful tick —
no gaps, no overlap — and a failed tick retries its window instead of
dropping it. (Plain cron + docket run works too; serve just
does the window bookkeeping for you.)
For other backends and trackers, see docs/quickstart.md (full matrix:
Phoenix/Langfuse/LangSmith × Jira/Linear/GitHub).
What it does
docket runs a small pipeline of LLM-driven subagents over your existing traces:
┌──────────────────────────┐
│ Phoenix / Langfuse / │
│ LangSmith trace backend │ <- you already have this
└────────────┬─────────────┘
│ trace fetch (read-only by default)
▼
┌─────────────────────┐
│ classifier subagent │ rubric: YAML failure-mode taxonomy
└──────────┬──────────┘ (built-in or your own)
▼
┌─────────────────────┐
│ clusterer subagent │ embeddings + HDBSCAN per mode
└──────────┬──────────┘
▼
┌─────────────────────┐
│ drafter subagent │ one IssueDraft per cluster, with
└──────────┬──────────┘ embedded provenance for dedup
▼
┌─────────────────────┐
│ poster subagent │ dedup against tracker, then
└──────────┬──────────┘ comment / create / queue
▼
┌──────────────────────────┐
│ Jira / Linear / GitHub │
└──────────────────────────┘
Read-only by default. Annotations write back to the trace backend
only when you pass --annotate. Issues post to the tracker only when
their severity meets auto_post_threshold (default: never) or when
you opt in via --review.
Bounded by default. Every run is capped by max_traces_per_run
(default 1000, measured after sampling and checkpoint subtraction);
exceeding the cap aborts loudly before any trace is fetched — never a
silent truncation. An optional max_estimated_cost_usd adds a dollar
ceiling on the pre-flight cost estimate. --dry-run reports both gates
and exits non-zero iff the real run would abort, so CI can use it as a
preflight check. For production-scale windows, --sample N bounds the
work with --strategy uniform, --strategy errors-only (root-errored
traces, filter pushed down to the backend), or --strategy stratified --stratify-by status|latency_bucket|tag:<key> (equal allocation so rare
strata — errors, small tenants, tail latencies — get seen). Adapters
flag truncated listings — trace and tracker alike — instead of silently
stopping at their pagination ceiling; when the open-issue listing is
truncated during dedup, drafts are queued for review instead of
auto-posted, since "no duplicate found" was not proven.
State lives in the backends, not here. docket doesn't own a
database. Annotations key off (trace_id, run_id, rubric_version, mode_id) in the observability backend; issues key off labels +
HTML-comment provenance in the tracker. Re-running the same window is
idempotent.
Built-in rubrics
Four reference rubrics ship with the package; each is a starting point intended to be imported into a domain-specific rubric you maintain.
| URI | Modes |
|---|---|
docket.dev/builtin/agents/v1 |
6 — hallucination, infinite loop, premature termination, unsafe tool call, refusal leakage, bad handoff |
docket.dev/builtin/rag/v1 |
4 — off-corpus answer, missing citation, stale retrieval, context overflow |
docket.dev/builtin/routing/v1 |
4 — wrong-skill routing, capability mismatch, dead-end transfer, oscillation |
docket.dev/builtin/multi-agent/v1 |
4 — handoff context loss, conflicting instructions, role drift, shared-memory corruption |
Reference them by URI on the CLI (--rubric docket.dev/builtin/rag/v1)
or import them into your own rubric:
apiVersion: docket.dev/v1
kind: Rubric
metadata:
name: my-prod-agents
version: 1.0.0
imports:
- docket.dev/builtin/agents/v1
- docket.dev/builtin/rag/v1
modes:
- id: refund-without-confirmation
severity: critical
detection:
type: tool_call
tool_calls: [process_refund]
# ... your modes go here
Validate with docket validate ./my-rubric.yaml. Smoke-test the
examples with docket self-test ./my-rubric.yaml.
Architecture overview
- OpenInference is the canonical trace schema. Adapters normalize to it; the runtime never sees backend-specific shapes.
- MCP is the integration protocol for both trace backends and
trackers. The CLI ships one MCP server binary per adapter
(
docket-adapter-phoenix,docket-adapter-jira, …) that you can run standalone or invoke throughdocket run. - deepagents is the agent harness; we don't reimplement planning, virtual filesystems, or subagent delegation.
- Stateless runtime. Annotations live in the backend; issues live in the tracker. No local database.
- Pydantic v2 + httpx + asyncio throughout. No bespoke SDK dependency per backend — every adapter is plain HTTP.
Execution modes
docket ships two execution modes over the same six pipeline stages
(list_traces → classify_traces → annotate_classifications →
cluster_classifications → draft_issues → write_report):
- Deterministic pipeline (default). Stages run in a fixed order from plain Python. Predictable cost, reproducible across runs, easy to debug. Use this for batch / cron / CI, anywhere SLOs and cost forecasting matter.
- deepagents harness (
--agent). Same six stages exposed as tools to a top-level planning LLM. Use this for exploratory / debugging runs today; the harness is the substrate the project commits to for future interactive surfaces (chat-driven triage, incident investigation, rubric authoring). The tools and entry points for those surfaces are post-v1.0 work — seedocs/design.md§4.2 and §7 (Phases 14–15).
Both modes share the same subagents, the same run_id, and the same
annotation idempotency, so investments in one benefit the other.
For the full design, see docs/design.md. Per-backend
and per-tracker setup guides:
Documentation
Start at the docs index.
Guides
- Quickstart — every backend × tracker pair
- Concepts — the vocabulary in five minutes
- Adapters — the integration contracts + how to add a backend or tracker
- Benchmarks — wall time and cost for a 1000-trace run
- Design document — every architectural decision, with rationale
API reference
- CLI — every command, flag, and exit code for
run,serve,validate,self-test, and the adapter binaries - Configuration —
docket.yamlschema, all env vars, precedence rules, defaults - Python API — embed the pipeline as a library:
run_triage_pipeline, adapters, providers, models, errors - MCP servers — tool contracts for driving the adapters from any MCP client
- Rubric DSL — the complete taxonomy spec, with a worked example rubric
Status
v1.0. Three trace-backend adapters and three tracker adapters at
parity, four built-in rubrics, deterministic + agent-harness execution
modes, daemon mode, budget guardrails and sampling. The
changelog has the full feature list. Post-1.0 roadmap
(streaming, sharding, interactive surfaces) lives in
docs/design.md §7.
Contributing
Rubrics and adapters are the highest-leverage contributions, and both have step-by-step guides: see CONTRIBUTING.md. Bug reports and adapter proposals have issue templates. Security issues go through SECURITY.md — never a public issue.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docket_runtime-1.0.0.tar.gz.
File metadata
- Download URL: docket_runtime-1.0.0.tar.gz
- Upload date:
- Size: 122.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4554127971180179c7a719fa8d87fc4a941dc88a5b9b8a319e321d33488e6e64
|
|
| MD5 |
448aac80c918011a37cdb22a3ee5cdb7
|
|
| BLAKE2b-256 |
b8e59ec84ade12f50cf59b3b4e6beaa944a52699df4ce59da001b2cce3f93030
|
Provenance
The following attestation bundles were made for docket_runtime-1.0.0.tar.gz:
Publisher:
release.yml on wczaja/docket
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docket_runtime-1.0.0.tar.gz -
Subject digest:
4554127971180179c7a719fa8d87fc4a941dc88a5b9b8a319e321d33488e6e64 - Sigstore transparency entry: 1855750059
- Sigstore integration time:
-
Permalink:
wczaja/docket@b275a941cafae84c0051d77b86db9dedb5fcb4b9 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/wczaja
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b275a941cafae84c0051d77b86db9dedb5fcb4b9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docket_runtime-1.0.0-py3-none-any.whl.
File metadata
- Download URL: docket_runtime-1.0.0-py3-none-any.whl
- Upload date:
- Size: 168.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4917485799bca86c8f06b2ff9bacca2f65d247af8fbd8605114feb994fde6bc7
|
|
| MD5 |
301dacdfe98387ba6f1ec1e5df294f97
|
|
| BLAKE2b-256 |
6d3a1930ad9e0a50a306d7dfe1657b11465dfa160c1f7da91a189cce2d0d430b
|
Provenance
The following attestation bundles were made for docket_runtime-1.0.0-py3-none-any.whl:
Publisher:
release.yml on wczaja/docket
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docket_runtime-1.0.0-py3-none-any.whl -
Subject digest:
4917485799bca86c8f06b2ff9bacca2f65d247af8fbd8605114feb994fde6bc7 - Sigstore transparency entry: 1855750086
- Sigstore integration time:
-
Permalink:
wczaja/docket@b275a941cafae84c0051d77b86db9dedb5fcb4b9 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/wczaja
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b275a941cafae84c0051d77b86db9dedb5fcb4b9 -
Trigger Event:
push
-
Statement type: