Deterministic-first CLI for evaluating AI agent traces

These details have not been verified by PyPI

Project links

Project description

trace-eval

Tell me why this agent run went wrong and what to change next.

A deterministic-first CLI for evaluating AI agent traces. No dashboards, no LLM-as-judge, no cloud dependency. Built for solo builders and small AI-native teams using coding/CLI agents.

What It Does

Run trace-eval on an agent trace file and get:

A scorecard — 0-100 across 5 dimensions
Root causes — critical and high-severity issues surfaced first
Actionable suggestions — what to fix, not just that something broke
Before/after comparison — see if your changes actually improved things

See It in Action

# 1. Install (uv or pip)
uv sync --all-extras

# 2. Validate a trace file
trace-eval validate trace.jsonl
# Schema validation PASSED — 8 events, field coverage bars printed

# 3. Run a scorecard
trace-eval run trace.jsonl
# ============================================================
#   TRACE-EVAL SCORECARD  Total: 32.4/100
# ============================================================
#
# LIKELY ROOT CAUSES:
#   - Use canonical retrieval entrypoint
#   - Stop accessing deprecated files
#   - Context pressure exceeded 90% — reduce prompt size
#
# DIMENSION SCORES:
#   reliability             5.0  (high)
#   efficiency             77.4  (medium)
#   retrieval               0.0  (high)
#   tool_discipline        80.0  (high)
#   context                32.0  (high)

# 4. Compare before vs after a fix
trace-eval compare before.jsonl after.jsonl
# Total score: 67.5 -> 99.3  Change: +31.9 (improved)
#
# FLAG CHANGES:
#   [RESOLVED] reliability_errors
#   [RESOLVED] retrieval_no_entrypoint
#   [RESOLVED] tool_retries

# 5. CI gate — fails the build below a threshold
trace-eval ci trace.jsonl --min-score 80
# PASS (exit 0) or FAIL (exit 1)

Good Run

trace-eval run examples/hermes_good.jsonl
============================================================
  TRACE-EVAL SCORECARD  Total: 98.9/100
============================================================

DIMENSION SCORES:
  reliability           100.0  (high)
  efficiency             94.5  (medium)
  retrieval             100.0  (high)
  tool_discipline       100.0  (high)
  context               100.0  (high)

Bad Run

trace-eval run examples/hermes_bad.jsonl
============================================================
  TRACE-EVAL SCORECARD  Total: 32.4/100
============================================================

LIKELY ROOT CAUSES:
  - Use canonical retrieval entrypoint
  - Stop accessing deprecated files
  - Context pressure exceeded 90% — reduce prompt size

DIMENSION SCORES:
  reliability             5.0  (high)
  efficiency             77.4  (medium)
  retrieval               0.0  (high)
  tool_discipline        80.0  (high)
  context                32.0  (high)

FRICTION FLAGS (sorted by severity):
  [CRITICAL] retrieval_no_entrypoint
    -> Use canonical retrieval entrypoint
  [CRITICAL] retrieval_deprecated_file @event 9
    -> Stop accessing deprecated files
  [CRITICAL] context_pressure_critical
    -> Context pressure exceeded 90% — reduce prompt size
  [HIGH] retrieval_fallback_search
    -> Avoid fallback search -- use primary retrieval
  [HIGH] tool_timeout @event 5
    -> 1 tool call(s) timed out
  [MEDIUM] reliability_errors @event 3
    -> Review 3 error(s) at event indices [3, 4, 8]
  [MEDIUM] context_compression
    -> Context compression triggered 1 time(s)

Compare

trace-eval compare examples/before.jsonl examples/after.jsonl
COMPARISON: before vs after
=======================================================
  Total score:   67.5 ->   99.3
  Change:      +31.9 (improved)

  reliability            45.0 ->  100.0  ^ +55.0
  efficiency             93.5 ->   96.8  ^ +3.2
  retrieval              50.0 ->  100.0  ^ +50.0
  tool_discipline        90.0 ->  100.0  ^ +10.0
  context                95.0 ->  100.0  ^ +5.0

  FLAG CHANGES:
    [RESOLVED] reliability_errors
    [RESOLVED] reliability_terminal_partial
    [RESOLVED] retrieval_no_entrypoint
    [RESOLVED] tool_retries

Quick Start

# Install
pip install -e .
# Or with uv:
uv sync --all-extras

# Validate a trace
trace-eval validate examples/hermes_good.jsonl

# Run a scorecard
trace-eval run examples/hermes_good.jsonl

# Machine-readable output (for agents)
trace-eval run examples/hermes_bad.jsonl --format json

# Compare before/after
trace-eval compare examples/before.jsonl examples/after.jsonl

# CI gate
trace-eval ci examples/hermes_good.jsonl --min-score 80

Scoring Dimensions

Dimension	Weight	What It Measures
Reliability	35%	Did it succeed? Errors, timeouts, partial results
Efficiency	20%	Token usage, cost, tool call density
Retrieval	20%	Canonical entrypoint, deprecated files, fallback search
Tool Discipline	15%	Retries, redundant calls, timeouts
Context	10%	Context pressure, warnings, compression events

Weights are configurable. Unscorable dimensions redistribute proportionally.

Examples

Trace	Source	Events	Score	File
Good run (synthetic)	Modeled after Hermes	8	98.9	`examples/hermes_good.jsonl`
Bad run (synthetic)	Modeled after Hermes	11	32.4	`examples/hermes_bad.jsonl`
Real Claude Code	Stillness project session	3,694	60.3	`examples/claude_code_real.jsonl`
OpenClaw before	Real OpenClaw session	158	44.3	`examples/openclaw_before.jsonl`
OpenClaw after	Simulated fixes	158	57.3	`examples/openclaw_after.jsonl`
Compare demo	before vs after	—	+66.5	`examples/case_study.md`

See examples/case_study.md for a complete walkthrough of bad run → diagnosis → fix → compare.

Adapters

Format	Adapter	Capability
JSONL (`.jsonl`)	Generic JSONL	Full — all fields available if present in file
Hermes SQLite (`.db`)	Hermes	Honest/lossy — populates what exists, nulls what doesn't

Adding your own adapter? The adapter interface is simple: implement load(path) -> Trace and capability_report(trace) -> dict.

Agent Integration (`--format json`)

The --format json flag produces stable, machine-readable output designed for agent consumption. An AI agent that just completed a task can pipe its trace through trace-eval and use the results to self-diagnose and guide remediation.

How an agent uses it

{
  "total_score": 32.43,
  "dimension_scores": {
    "reliability": 5.0,
    "efficiency": 77.42,
    "retrieval": 0.0,
    "tool_discipline": 80.0,
    "context": 32.0
  },
  "friction_flags": [
    {
      "id": "retrieval_no_entrypoint",
      "severity": "critical",
      "dimension": "retrieval",
      "event_index": null,
      "suggestion": "Use canonical retrieval entrypoint"
    }
  ],
  "likely_causes": [
    "Use canonical retrieval entrypoint",
    "Stop accessing deprecated files"
  ],
  "suggestions": [
    "Use canonical retrieval entrypoint",
    "Stop accessing deprecated files"
  ],
  "scorable_dimensions": ["reliability", "efficiency", "retrieval", "tool_discipline", "context"],
  "unscorable_dimensions": [],
  "judge_coverage": { "...": "per-judge scorable + confidence" },
  "adapter_capability_report": { "...": "field availability from adapter" },
  "failed_thresholds": []
}

Agent remediation pattern

Run trace-eval run trace.jsonl --format json
Parse likely_causes — these are the root-cause hypotheses
Parse suggestions — each one maps to a concrete fix
Apply fixes, re-run the agent, compare the new trace
Use trace-eval compare old.jsonl new.jsonl --format json to quantify improvement

Fields are stable and typed. suggestions is a plain string array designed for direct use in agent prompts or remediation logic.

What's Coming

More adapters (OpenAI traces, LangSmith, LangGraph, custom formats)
Score profiles (balanced, reliability_first, cost_conscious)
Baseline comparison (cost vs similar tasks)
Parallelization analysis in Tool Discipline

Known Limitations

Hermes SQLite Adapter (lossy by design)

The Hermes adapter maps the real Hermes DB schema (sessions + messages tables) to the canonical trace format. It is honest and lossy — it populates what exists and nulls what doesn't. It does not synthesize span IDs, fabricate relationships, or guess at missing data.

Fields the Hermes schema does not provide:

Missing field	Impact on scoring
`error_type`	Reliability judge counts errors but can't classify them
`retrieval_entrypoint`, `retrieval_steps`	Retrieval judge always scores 50.0 (no entrypoint)
`context_pressure_pct`, `context_tokens`	Context judge returns unscorable — weights redistributed
`latency_ms`	Efficiency can't penalize slow tool calls
`span_id`, `parent_span_id`	No trace-level deduplication or call tree analysis
`cost_estimate` (Hermes stores in sessions, not events)	Efficiency cost sub-score unavailable

What this means for Hermes users:

Retrieval score will always be 50.0 (the no-entrypoint baseline) — this is correct behavior, not a bug
Context judge will show N/A (low) — the weight (10%) redistributes to other judges
Errors embedded in tool result content (e.g., "error": "File not found") are not parsed by the adapter — you must add status: "error" to the JSONL if you want them scored

Generic JSONL Adapter

Requires the canonical field names (event_type, status, tool_name, etc.)
Fields not present are simply absent — no synthetic values
Does not parse proprietary trace formats (OpenAI, LangSmith) — use those formats' native tools to export to canonical JSONL first

Scoring

Scores are relative to the trace content, not to a baseline of "similar tasks"
Tool Discipline does not yet analyze parallel tool calls
Weight redistribution is proportional — some users may want custom profiles

What This Is NOT

A dashboard — CLI-first, local-first
An LLM judge — all scoring is deterministic
An auto-fix tool — it tells you what to fix, not how
A broad observability platform — focused on agent trace evaluation

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 16, 2026

0.1.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_eval-0.2.0.tar.gz (355.6 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trace_eval-0.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file trace_eval-0.2.0.tar.gz.

File metadata

Download URL: trace_eval-0.2.0.tar.gz
Upload date: Apr 16, 2026
Size: 355.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trace_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fcf84d48b7325a8038b5719efc6fe202785bad0658934b41cfb4d87a266e3a91`
MD5	`38451060bc5b655cf206de00adec8e28`
BLAKE2b-256	`da401b1fc195f414784209d52fa4ad0bd39f1bc73109fdd8b5d2599d237b1347`

See more details on using hashes here.

File details

Details for the file trace_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: trace_eval-0.2.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 29.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trace_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58bebcef64eaea995369c66430c4e7d2dd3908282217706e6dd9c86c554be7e1`
MD5	`c9efb9b7423b98f9e92b64c5d1012a9f`
BLAKE2b-256	`20b89b668685785cfe5318c5ea74243974c9744e5a75ac105829e86708b4ca11`

See more details on using hashes here.

trace-eval 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

trace-eval

What It Does

See It in Action

Good Run

Bad Run

Compare

Quick Start

Scoring Dimensions

Examples

Adapters

Agent Integration (--format json)

How an agent uses it

Agent remediation pattern

What's Coming

Known Limitations

Hermes SQLite Adapter (lossy by design)

Generic JSONL Adapter

Scoring

What This Is NOT

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Agent Integration (`--format json`)