Deterministic-first CLI for evaluating AI agent traces
Project description
trace-eval
Tell me why this agent run went wrong and what to change next.
A deterministic-first CLI for evaluating AI agent traces. No dashboards, no LLM-as-judge, no cloud dependency. Built for solo builders and small AI-native teams using coding/CLI agents.
What It Does
Run trace-eval on an agent trace file and get:
- A scorecard — 0-100 across 5 dimensions
- Root causes — critical and high-severity issues surfaced first
- Actionable suggestions — what to fix, not just that something broke
- Before/after comparison — see if your changes actually improved things
See It in Action
# 1. Install (uv or pip)
uv sync --all-extras
# 2. Validate a trace file
trace-eval validate trace.jsonl
# Schema validation PASSED — 8 events, field coverage bars printed
# 3. Run a scorecard
trace-eval run trace.jsonl
# ============================================================
# TRACE-EVAL SCORECARD Total: 32.4/100
# ============================================================
#
# LIKELY ROOT CAUSES:
# - Use canonical retrieval entrypoint
# - Stop accessing deprecated files
# - Context pressure exceeded 90% — reduce prompt size
#
# DIMENSION SCORES:
# reliability 5.0 (high)
# efficiency 77.4 (medium)
# retrieval 0.0 (high)
# tool_discipline 80.0 (high)
# context 32.0 (high)
# 4. Compare before vs after a fix
trace-eval compare before.jsonl after.jsonl
# Total score: 67.5 -> 99.3 Change: +31.9 (improved)
#
# FLAG CHANGES:
# [RESOLVED] reliability_errors
# [RESOLVED] retrieval_no_entrypoint
# [RESOLVED] tool_retries
# 5. CI gate — fails the build below a threshold
trace-eval ci trace.jsonl --min-score 80
# PASS (exit 0) or FAIL (exit 1)
Good Run
trace-eval run examples/hermes_good.jsonl
============================================================
TRACE-EVAL SCORECARD Total: 98.9/100
============================================================
DIMENSION SCORES:
reliability 100.0 (high)
efficiency 94.5 (medium)
retrieval 100.0 (high)
tool_discipline 100.0 (high)
context 100.0 (high)
Bad Run
trace-eval run examples/hermes_bad.jsonl
============================================================
TRACE-EVAL SCORECARD Total: 32.4/100
============================================================
LIKELY ROOT CAUSES:
- Use canonical retrieval entrypoint
- Stop accessing deprecated files
- Context pressure exceeded 90% — reduce prompt size
DIMENSION SCORES:
reliability 5.0 (high)
efficiency 77.4 (medium)
retrieval 0.0 (high)
tool_discipline 80.0 (high)
context 32.0 (high)
FRICTION FLAGS (sorted by severity):
[CRITICAL] retrieval_no_entrypoint
-> Use canonical retrieval entrypoint
[CRITICAL] retrieval_deprecated_file @event 9
-> Stop accessing deprecated files
[CRITICAL] context_pressure_critical
-> Context pressure exceeded 90% — reduce prompt size
[HIGH] retrieval_fallback_search
-> Avoid fallback search -- use primary retrieval
[HIGH] tool_timeout @event 5
-> 1 tool call(s) timed out
[MEDIUM] reliability_errors @event 3
-> Review 3 error(s) at event indices [3, 4, 8]
[MEDIUM] context_compression
-> Context compression triggered 1 time(s)
Compare
trace-eval compare examples/before.jsonl examples/after.jsonl
COMPARISON: before vs after
=======================================================
Total score: 67.5 -> 99.3
Change: +31.9 (improved)
reliability 45.0 -> 100.0 ^ +55.0
efficiency 93.5 -> 96.8 ^ +3.2
retrieval 50.0 -> 100.0 ^ +50.0
tool_discipline 90.0 -> 100.0 ^ +10.0
context 95.0 -> 100.0 ^ +5.0
FLAG CHANGES:
[RESOLVED] reliability_errors
[RESOLVED] reliability_terminal_partial
[RESOLVED] retrieval_no_entrypoint
[RESOLVED] tool_retries
Quick Start
# Install
pip install -e .
# Or with uv:
uv sync --all-extras
# Validate a trace
trace-eval validate examples/hermes_good.jsonl
# Run a scorecard
trace-eval run examples/hermes_good.jsonl
# Machine-readable output (for agents)
trace-eval run examples/hermes_bad.jsonl --format json
# Compare before/after
trace-eval compare examples/before.jsonl examples/after.jsonl
# CI gate
trace-eval ci examples/hermes_good.jsonl --min-score 80
Scoring Dimensions
| Dimension | Weight | What It Measures |
|---|---|---|
| Reliability | 35% | Did it succeed? Errors, timeouts, partial results |
| Efficiency | 20% | Token usage, cost, tool call density |
| Retrieval | 20% | Canonical entrypoint, deprecated files, fallback search |
| Tool Discipline | 15% | Retries, redundant calls, timeouts |
| Context | 10% | Context pressure, warnings, compression events |
Weights are configurable. Unscorable dimensions redistribute proportionally.
Adapters
| Format | Adapter | Capability |
|---|---|---|
JSONL (.jsonl) |
Generic JSONL | Full — all fields available if present in file |
Hermes SQLite (.db) |
Hermes | Honest/lossy — populates what exists, nulls what doesn't |
Adding your own adapter? The adapter interface is simple: implement load(path) -> Trace and capability_report(trace) -> dict.
Agent Integration (--format json)
The --format json flag produces stable, machine-readable output designed for agent consumption. An AI agent that just completed a task can pipe its trace through trace-eval and use the results to self-diagnose and guide remediation.
How an agent uses it
{
"total_score": 32.43,
"dimension_scores": {
"reliability": 5.0,
"efficiency": 77.42,
"retrieval": 0.0,
"tool_discipline": 80.0,
"context": 32.0
},
"friction_flags": [
{
"id": "retrieval_no_entrypoint",
"severity": "critical",
"dimension": "retrieval",
"event_index": null,
"suggestion": "Use canonical retrieval entrypoint"
}
],
"likely_causes": [
"Use canonical retrieval entrypoint",
"Stop accessing deprecated files"
],
"suggestions": [
"Use canonical retrieval entrypoint",
"Stop accessing deprecated files"
],
"scorable_dimensions": ["reliability", "efficiency", "retrieval", "tool_discipline", "context"],
"unscorable_dimensions": [],
"judge_coverage": { "...": "per-judge scorable + confidence" },
"adapter_capability_report": { "...": "field availability from adapter" },
"failed_thresholds": []
}
Agent remediation pattern
- Run
trace-eval run trace.jsonl --format json - Parse
likely_causes— these are the root-cause hypotheses - Parse
suggestions— each one maps to a concrete fix - Apply fixes, re-run the agent, compare the new trace
- Use
trace-eval compare old.jsonl new.jsonl --format jsonto quantify improvement
Fields are stable and typed. suggestions is a plain string array designed for direct use in agent prompts or remediation logic.
What's Coming
- More adapters (OpenAI traces, LangSmith, LangGraph, custom formats)
- Score profiles (balanced, reliability_first, cost_conscious)
- Baseline comparison (cost vs similar tasks)
- Parallelization analysis in Tool Discipline
Known Limitations
Hermes SQLite Adapter (lossy by design)
The Hermes adapter maps the real Hermes DB schema (sessions + messages tables) to the canonical trace format. It is honest and lossy — it populates what exists and nulls what doesn't. It does not synthesize span IDs, fabricate relationships, or guess at missing data.
Fields the Hermes schema does not provide:
| Missing field | Impact on scoring |
|---|---|
error_type |
Reliability judge counts errors but can't classify them |
retrieval_entrypoint, retrieval_steps |
Retrieval judge always scores 50.0 (no entrypoint) |
context_pressure_pct, context_tokens |
Context judge returns unscorable — weights redistributed |
latency_ms |
Efficiency can't penalize slow tool calls |
span_id, parent_span_id |
No trace-level deduplication or call tree analysis |
cost_estimate (Hermes stores in sessions, not events) |
Efficiency cost sub-score unavailable |
What this means for Hermes users:
- Retrieval score will always be 50.0 (the no-entrypoint baseline) — this is correct behavior, not a bug
- Context judge will show
N/A (low)— the weight (10%) redistributes to other judges - Errors embedded in tool result content (e.g.,
"error": "File not found") are not parsed by the adapter — you must addstatus: "error"to the JSONL if you want them scored
Generic JSONL Adapter
- Requires the canonical field names (
event_type,status,tool_name, etc.) - Fields not present are simply absent — no synthetic values
- Does not parse proprietary trace formats (OpenAI, LangSmith) — use those formats' native tools to export to canonical JSONL first
Scoring
- Scores are relative to the trace content, not to a baseline of "similar tasks"
- Tool Discipline does not yet analyze parallel tool calls
- Weight redistribution is proportional — some users may want custom profiles
What This Is NOT
- A dashboard — CLI-first, local-first
- An LLM judge — all scoring is deterministic
- An auto-fix tool — it tells you what to fix, not how
- A broad observability platform — focused on agent trace evaluation
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trace_eval-0.1.0.tar.gz.
File metadata
- Download URL: trace_eval-0.1.0.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96f5c596fc394f291bbbe3f01e87e7b4c6af3a8020e94b6625776c018b968cf7
|
|
| MD5 |
1b2c797573728bbf51ec0f4ed83ec11e
|
|
| BLAKE2b-256 |
1b9b00f225622f779f791afb6d8c8e1f9f46462de2262ddeb67211f3dabe2786
|
File details
Details for the file trace_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trace_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36a979eb7045da9549e21004b4a365102beb4fca32ade28cb5dc428b7250393a
|
|
| MD5 |
ebd00fafe5882d6d2481d841ed21c765
|
|
| BLAKE2b-256 |
b12c9dcb1fc97837ba3341f848884d04f60afc9e2700b923202a710bd37bea48
|