Skip to main content

Deterministic-first CLI for evaluating AI agent traces

Project description

trace-eval

Tell me why this agent run went wrong and what to change next.

A deterministic-first CLI for evaluating AI agent traces. No dashboards, no LLM-as-judge, no cloud dependency. Built for solo builders and small AI-native teams using coding/CLI agents.

What It Does

Run trace-eval on an agent trace file and get:

  • A scorecard — 0-100 across 5 dimensions
  • Root causes — critical and high-severity issues surfaced first
  • Actionable suggestions — what to fix, not just that something broke
  • Before/after comparison — see if your changes actually improved things

See It in Action

# 1. Install (uv or pip)
uv sync --all-extras

# 2. Validate a trace file
trace-eval validate trace.jsonl
# Schema validation PASSED — 8 events, field coverage bars printed

# 3. Run a scorecard
trace-eval run trace.jsonl
# ============================================================
#   TRACE-EVAL SCORECARD  Total: 32.4/100
# ============================================================
#
# LIKELY ROOT CAUSES:
#   - Use canonical retrieval entrypoint
#   - Stop accessing deprecated files
#   - Context pressure exceeded 90% — reduce prompt size
#
# DIMENSION SCORES:
#   reliability             5.0  (high)
#   efficiency             77.4  (medium)
#   retrieval               0.0  (high)
#   tool_discipline        80.0  (high)
#   context                32.0  (high)

# 4. Compare before vs after a fix
trace-eval compare before.jsonl after.jsonl
# Total score: 67.5 -> 99.3  Change: +31.9 (improved)
#
# FLAG CHANGES:
#   [RESOLVED] reliability_errors
#   [RESOLVED] retrieval_no_entrypoint
#   [RESOLVED] tool_retries

# 5. CI gate — fails the build below a threshold
trace-eval ci trace.jsonl --min-score 80
# PASS (exit 0) or FAIL (exit 1)

Good Run

trace-eval run examples/hermes_good.jsonl
============================================================
  TRACE-EVAL SCORECARD  Total: 98.9/100
============================================================

DIMENSION SCORES:
  reliability           100.0  (high)
  efficiency             94.5  (medium)
  retrieval             100.0  (high)
  tool_discipline       100.0  (high)
  context               100.0  (high)

Bad Run

trace-eval run examples/hermes_bad.jsonl
============================================================
  TRACE-EVAL SCORECARD  Total: 32.4/100
============================================================

LIKELY ROOT CAUSES:
  - Use canonical retrieval entrypoint
  - Stop accessing deprecated files
  - Context pressure exceeded 90% — reduce prompt size

DIMENSION SCORES:
  reliability             5.0  (high)
  efficiency             77.4  (medium)
  retrieval               0.0  (high)
  tool_discipline        80.0  (high)
  context                32.0  (high)

FRICTION FLAGS (sorted by severity):
  [CRITICAL] retrieval_no_entrypoint
    -> Use canonical retrieval entrypoint
  [CRITICAL] retrieval_deprecated_file @event 9
    -> Stop accessing deprecated files
  [CRITICAL] context_pressure_critical
    -> Context pressure exceeded 90% — reduce prompt size
  [HIGH] retrieval_fallback_search
    -> Avoid fallback search -- use primary retrieval
  [HIGH] tool_timeout @event 5
    -> 1 tool call(s) timed out
  [MEDIUM] reliability_errors @event 3
    -> Review 3 error(s) at event indices [3, 4, 8]
  [MEDIUM] context_compression
    -> Context compression triggered 1 time(s)

Compare

trace-eval compare examples/before.jsonl examples/after.jsonl
COMPARISON: before vs after
=======================================================
  Total score:   67.5 ->   99.3
  Change:      +31.9 (improved)

  reliability            45.0 ->  100.0  ^ +55.0
  efficiency             93.5 ->   96.8  ^ +3.2
  retrieval              50.0 ->  100.0  ^ +50.0
  tool_discipline        90.0 ->  100.0  ^ +10.0
  context                95.0 ->  100.0  ^ +5.0

  FLAG CHANGES:
    [RESOLVED] reliability_errors
    [RESOLVED] reliability_terminal_partial
    [RESOLVED] retrieval_no_entrypoint
    [RESOLVED] tool_retries

Quick Start

# Install
pip install -e .
# Or with uv:
uv sync --all-extras

# Validate a trace
trace-eval validate examples/hermes_good.jsonl

# Run a scorecard
trace-eval run examples/hermes_good.jsonl

# Machine-readable output (for agents)
trace-eval run examples/hermes_bad.jsonl --format json

# Compare before/after
trace-eval compare examples/before.jsonl examples/after.jsonl

# CI gate
trace-eval ci examples/hermes_good.jsonl --min-score 80

Scoring Dimensions

Dimension Weight What It Measures
Reliability 35% Did it succeed? Errors, timeouts, partial results
Efficiency 20% Token usage, cost, tool call density
Retrieval 20% Canonical entrypoint, deprecated files, fallback search
Tool Discipline 15% Retries, redundant calls, timeouts
Context 10% Context pressure, warnings, compression events

Weights are configurable. Unscorable dimensions redistribute proportionally.

Examples

Trace Source Events Score File
Good run (synthetic) Modeled after Hermes 8 98.9 examples/hermes_good.jsonl
Bad run (synthetic) Modeled after Hermes 11 32.4 examples/hermes_bad.jsonl
Real Claude Code Stillness project session 3,694 60.3 examples/claude_code_real.jsonl
OpenClaw before Real OpenClaw session 158 44.3 examples/openclaw_before.jsonl
OpenClaw after Simulated fixes 158 57.3 examples/openclaw_after.jsonl
Compare demo before vs after +66.5 examples/case_study.md

See examples/case_study.md for a complete walkthrough of bad run → diagnosis → fix → compare.

Adapters

Format Adapter Capability
JSONL (.jsonl) Generic JSONL Full — all fields available if present in file
Hermes SQLite (.db) Hermes Honest/lossy — populates what exists, nulls what doesn't

Adding your own adapter? The adapter interface is simple: implement load(path) -> Trace and capability_report(trace) -> dict.

Agent Integration (--format json)

The --format json flag produces stable, machine-readable output designed for agent consumption. An AI agent that just completed a task can pipe its trace through trace-eval and use the results to self-diagnose and guide remediation.

How an agent uses it

{
  "total_score": 32.43,
  "dimension_scores": {
    "reliability": 5.0,
    "efficiency": 77.42,
    "retrieval": 0.0,
    "tool_discipline": 80.0,
    "context": 32.0
  },
  "friction_flags": [
    {
      "id": "retrieval_no_entrypoint",
      "severity": "critical",
      "dimension": "retrieval",
      "event_index": null,
      "suggestion": "Use canonical retrieval entrypoint"
    }
  ],
  "likely_causes": [
    "Use canonical retrieval entrypoint",
    "Stop accessing deprecated files"
  ],
  "suggestions": [
    "Use canonical retrieval entrypoint",
    "Stop accessing deprecated files"
  ],
  "scorable_dimensions": ["reliability", "efficiency", "retrieval", "tool_discipline", "context"],
  "unscorable_dimensions": [],
  "judge_coverage": { "...": "per-judge scorable + confidence" },
  "adapter_capability_report": { "...": "field availability from adapter" },
  "failed_thresholds": []
}

Agent remediation pattern

  1. Run trace-eval run trace.jsonl --format json
  2. Parse likely_causes — these are the root-cause hypotheses
  3. Parse suggestions — each one maps to a concrete fix
  4. Apply fixes, re-run the agent, compare the new trace
  5. Use trace-eval compare old.jsonl new.jsonl --format json to quantify improvement

Fields are stable and typed. suggestions is a plain string array designed for direct use in agent prompts or remediation logic.

What's Coming

  • More adapters (OpenAI traces, LangSmith, LangGraph, custom formats)
  • Score profiles (balanced, reliability_first, cost_conscious)
  • Baseline comparison (cost vs similar tasks)
  • Parallelization analysis in Tool Discipline

Known Limitations

Hermes SQLite Adapter (lossy by design)

The Hermes adapter maps the real Hermes DB schema (sessions + messages tables) to the canonical trace format. It is honest and lossy — it populates what exists and nulls what doesn't. It does not synthesize span IDs, fabricate relationships, or guess at missing data.

Fields the Hermes schema does not provide:

Missing field Impact on scoring
error_type Reliability judge counts errors but can't classify them
retrieval_entrypoint, retrieval_steps Retrieval judge always scores 50.0 (no entrypoint)
context_pressure_pct, context_tokens Context judge returns unscorable — weights redistributed
latency_ms Efficiency can't penalize slow tool calls
span_id, parent_span_id No trace-level deduplication or call tree analysis
cost_estimate (Hermes stores in sessions, not events) Efficiency cost sub-score unavailable

What this means for Hermes users:

  • Retrieval score will always be 50.0 (the no-entrypoint baseline) — this is correct behavior, not a bug
  • Context judge will show N/A (low) — the weight (10%) redistributes to other judges
  • Errors embedded in tool result content (e.g., "error": "File not found") are not parsed by the adapter — you must add status: "error" to the JSONL if you want them scored

Generic JSONL Adapter

  • Requires the canonical field names (event_type, status, tool_name, etc.)
  • Fields not present are simply absent — no synthetic values
  • Does not parse proprietary trace formats (OpenAI, LangSmith) — use those formats' native tools to export to canonical JSONL first

Scoring

  • Scores are relative to the trace content, not to a baseline of "similar tasks"
  • Tool Discipline does not yet analyze parallel tool calls
  • Weight redistribution is proportional — some users may want custom profiles

What This Is NOT

  • A dashboard — CLI-first, local-first
  • An LLM judge — all scoring is deterministic
  • An auto-fix tool — it tells you what to fix, not how
  • A broad observability platform — focused on agent trace evaluation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_eval-0.2.0.tar.gz (355.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trace_eval-0.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file trace_eval-0.2.0.tar.gz.

File metadata

  • Download URL: trace_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 355.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trace_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fcf84d48b7325a8038b5719efc6fe202785bad0658934b41cfb4d87a266e3a91
MD5 38451060bc5b655cf206de00adec8e28
BLAKE2b-256 da401b1fc195f414784209d52fa4ad0bd39f1bc73109fdd8b5d2599d237b1347

See more details on using hashes here.

File details

Details for the file trace_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: trace_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trace_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58bebcef64eaea995369c66430c4e7d2dd3908282217706e6dd9c86c554be7e1
MD5 c9efb9b7423b98f9e92b64c5d1012a9f
BLAKE2b-256 20b89b668685785cfe5318c5ea74243974c9744e5a75ac105829e86708b4ca11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page