Skip to main content

Observability-platform-agnostic triage runtime for LLM agent traces

Project description

docket

CI License: Apache 2.0 Python 3.11+

An observability-platform-agnostic triage runtime for LLM agent traces.

docket reads traces from your existing observability backend (Phoenix, Langfuse, LangSmith), classifies each one against a YAML failure-mode taxonomy you write, clusters similar failures together, and drafts issues into your tracker (Jira, Linear, GitHub Issues). It is not a new observability backend, an eval framework, or a web UI — it's a thin agent that sits above what you already have.

Human-in-the-loop is the default: drafts queue locally or open in your $EDITOR for review before they post. Auto-posting requires an explicit opt-in (auto_post_threshold).


Quickstart (5 minutes)

The fastest path to a working setup: a local Phoenix backend + a GitHub-Issues tracker.

1. Install

pip install docket-runtime
# or:  uv pip install docket-runtime

2. Bring up Phoenix

docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest

Send your agent's traces to http://localhost:6006 via the OpenInference instrumentation of your choice (any OTLP-compatible instrumentation works — see docs/local-phoenix.md for ingestion recipes).

3. Configure credentials

export ANTHROPIC_API_KEY="sk-ant-..."         # for the llm_judge detectors
export OPENAI_API_KEY="sk-..."                # for clustering embeddings
                                              # (required even with an
                                              # Anthropic classifier)
export GITHUB_TOKEN="ghp_..."                 # PAT with Issues write

4. Run

docket run \
  --backend phoenix \
  --phoenix-url http://localhost:6006 \
  --tracker github \
  --github-owner YOUR_GH_USER \
  --github-repo docket-issues \
  --rubric docket.dev/builtin/agents/v1 \
  --since 1h

That's it. The pipeline:

  1. Pulls the last hour of traces from Phoenix.
  2. Runs each one through the agents/v1 failure-mode rubric.
  3. Clusters positive classifications per mode.
  4. Drafts one issue per cluster into ~/.docket/queued-issues/<run-id>/.
  5. Looks at your GitHub repo for matching open issues (dedup by labels + embedded provenance) and comments on existing issues that grew, or leaves new ones in the local queue for --review.
  6. Prints a markdown report.

Add --review to walk each queued draft through $EDITOR + accept/reject

  • post. Add --auto-post-threshold high to auto-post critical and high severity drafts. Add --dry-run to price a window before committing to it.

For scheduled triage, swap run for the daemon:

docket serve --interval 1h ...   # same flags as run

Each tick processes exactly the window since the last successful tick — no gaps, no overlap — and a failed tick retries its window instead of dropping it. (Plain cron + docket run works too; serve just does the window bookkeeping for you.)

For other backends and trackers, see docs/quickstart.md (full matrix: Phoenix/Langfuse/LangSmith × Jira/Linear/GitHub).


What it does

docket runs a small pipeline of LLM-driven subagents over your existing traces:

┌──────────────────────────┐
│ Phoenix / Langfuse /     │
│ LangSmith trace backend  │  <- you already have this
└────────────┬─────────────┘
             │ trace fetch (read-only by default)
             ▼
   ┌─────────────────────┐
   │ classifier subagent │  rubric: YAML failure-mode taxonomy
   └──────────┬──────────┘     (built-in or your own)
              ▼
   ┌─────────────────────┐
   │ clusterer subagent  │  embeddings + HDBSCAN per mode
   └──────────┬──────────┘
              ▼
   ┌─────────────────────┐
   │ drafter subagent    │  one IssueDraft per cluster, with
   └──────────┬──────────┘     embedded provenance for dedup
              ▼
   ┌─────────────────────┐
   │ poster subagent     │  dedup against tracker, then
   └──────────┬──────────┘     comment / create / queue
              ▼
┌──────────────────────────┐
│ Jira / Linear / GitHub   │
└──────────────────────────┘

Read-only by default. Annotations write back to the trace backend only when you pass --annotate. Issues post to the tracker only when their severity meets auto_post_threshold (default: never) or when you opt in via --review.

Bounded by default. Every run is capped by max_traces_per_run (default 1000, measured after sampling and checkpoint subtraction); exceeding the cap aborts loudly before any trace is fetched — never a silent truncation. An optional max_estimated_cost_usd adds a dollar ceiling on the pre-flight cost estimate. --dry-run reports both gates and exits non-zero iff the real run would abort, so CI can use it as a preflight check. For production-scale windows, --sample N bounds the work with --strategy uniform, --strategy errors-only (root-errored traces, filter pushed down to the backend), or --strategy stratified --stratify-by status|latency_bucket|tag:<key> (equal allocation so rare strata — errors, small tenants, tail latencies — get seen). Adapters flag truncated listings — trace and tracker alike — instead of silently stopping at their pagination ceiling; when the open-issue listing is truncated during dedup, drafts are queued for review instead of auto-posted, since "no duplicate found" was not proven.

State lives in the backends, not here. docket doesn't own a database. Annotations key off (trace_id, run_id, rubric_version, mode_id) in the observability backend; issues key off labels + HTML-comment provenance in the tracker. Re-running the same window is idempotent.


Built-in rubrics

Four reference rubrics ship with the package; each is a starting point intended to be imported into a domain-specific rubric you maintain.

URI Modes
docket.dev/builtin/agents/v1 6 — hallucination, infinite loop, premature termination, unsafe tool call, refusal leakage, bad handoff
docket.dev/builtin/rag/v1 4 — off-corpus answer, missing citation, stale retrieval, context overflow
docket.dev/builtin/routing/v1 4 — wrong-skill routing, capability mismatch, dead-end transfer, oscillation
docket.dev/builtin/multi-agent/v1 4 — handoff context loss, conflicting instructions, role drift, shared-memory corruption

Reference them by URI on the CLI (--rubric docket.dev/builtin/rag/v1) or import them into your own rubric:

apiVersion: docket.dev/v1
kind: Rubric
metadata:
  name: my-prod-agents
  version: 1.0.0
imports:
  - docket.dev/builtin/agents/v1
  - docket.dev/builtin/rag/v1
modes:
  - id: refund-without-confirmation
    severity: critical
    detection:
      type: tool_call
      tool_calls: [process_refund]
    # ... your modes go here

Validate with docket validate ./my-rubric.yaml. Smoke-test the examples with docket self-test ./my-rubric.yaml.


Architecture overview

  • OpenInference is the canonical trace schema. Adapters normalize to it; the runtime never sees backend-specific shapes.
  • MCP is the integration protocol for both trace backends and trackers. The CLI ships one MCP server binary per adapter (docket-adapter-phoenix, docket-adapter-jira, …) that you can run standalone or invoke through docket run.
  • deepagents is the agent harness; we don't reimplement planning, virtual filesystems, or subagent delegation.
  • Stateless runtime. Annotations live in the backend; issues live in the tracker. No local database.
  • Pydantic v2 + httpx + asyncio throughout. No bespoke SDK dependency per backend — every adapter is plain HTTP.

Execution modes

docket ships two execution modes over the same six pipeline stages (list_tracesclassify_tracesannotate_classificationscluster_classificationsdraft_issueswrite_report):

  • Deterministic pipeline (default). Stages run in a fixed order from plain Python. Predictable cost, reproducible across runs, easy to debug. Use this for batch / cron / CI, anywhere SLOs and cost forecasting matter.
  • deepagents harness (--agent). Same six stages exposed as tools to a top-level planning LLM. Use this for exploratory / debugging runs today; the harness is the substrate the project commits to for future interactive surfaces (chat-driven triage, incident investigation, rubric authoring). The tools and entry points for those surfaces are post-v1.0 work — see docs/design.md §4.2 and §7 (Phases 14–15).

Both modes share the same subagents, the same run_id, and the same annotation idempotency, so investments in one benefit the other.

For the full design, see docs/design.md. Per-backend and per-tracker setup guides:


Documentation

Start at the docs index.

Guides

  • Quickstart — every backend × tracker pair
  • Concepts — the vocabulary in five minutes
  • Adapters — the integration contracts + how to add a backend or tracker
  • Benchmarks — wall time and cost for a 1000-trace run
  • Design document — every architectural decision, with rationale

API reference

  • CLI — every command, flag, and exit code for run, serve, validate, self-test, and the adapter binaries
  • Configurationdocket.yaml schema, all env vars, precedence rules, defaults
  • Python API — embed the pipeline as a library: run_triage_pipeline, adapters, providers, models, errors
  • MCP servers — tool contracts for driving the adapters from any MCP client
  • Rubric DSL — the complete taxonomy spec, with a worked example rubric

Status

v1.0. Three trace-backend adapters and three tracker adapters at parity, four built-in rubrics, deterministic + agent-harness execution modes, daemon mode, budget guardrails and sampling. The changelog has the full feature list. Post-1.0 roadmap (streaming, sharding, interactive surfaces) lives in docs/design.md §7.

Contributing

Rubrics and adapters are the highest-leverage contributions, and both have step-by-step guides: see CONTRIBUTING.md. Bug reports and adapter proposals have issue templates. Security issues go through SECURITY.md — never a public issue.


License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docket_runtime-1.0.0.tar.gz (122.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docket_runtime-1.0.0-py3-none-any.whl (168.2 kB view details)

Uploaded Python 3

File details

Details for the file docket_runtime-1.0.0.tar.gz.

File metadata

  • Download URL: docket_runtime-1.0.0.tar.gz
  • Upload date:
  • Size: 122.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docket_runtime-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4554127971180179c7a719fa8d87fc4a941dc88a5b9b8a319e321d33488e6e64
MD5 448aac80c918011a37cdb22a3ee5cdb7
BLAKE2b-256 b8e59ec84ade12f50cf59b3b4e6beaa944a52699df4ce59da001b2cce3f93030

See more details on using hashes here.

Provenance

The following attestation bundles were made for docket_runtime-1.0.0.tar.gz:

Publisher: release.yml on wczaja/docket

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docket_runtime-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: docket_runtime-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 168.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docket_runtime-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4917485799bca86c8f06b2ff9bacca2f65d247af8fbd8605114feb994fde6bc7
MD5 301dacdfe98387ba6f1ec1e5df294f97
BLAKE2b-256 6d3a1930ad9e0a50a306d7dfe1657b11465dfa160c1f7da91a189cce2d0d430b

See more details on using hashes here.

Provenance

The following attestation bundles were made for docket_runtime-1.0.0-py3-none-any.whl:

Publisher: release.yml on wczaja/docket

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page