Evaluate multi step AI agent traces, not just single LLM responses.
Project description
Tracecheck
Evaluate multi step AI agent traces, not just single LLM responses.
Why
Most eval tools score one input/output pair at a time. Production agents take 7 tool calls, retry, hit context limits, and then produce an answer. The trace is the unit you actually need to evaluate. tracecheck ingests full agent traces and scores them on tool accuracy, context handling, step efficiency, failure modes, and final output quality.
Quick start
pip install tracecheck # core, deterministic evaluators only
pip install tracecheck[llm] # adds the LLM-as-judge extras (anthropic SDK)
tracecheck run \
--traces examples/sample_traces.jsonl \
--config examples/evals.yaml
# write a self-contained HTML report you can open in a browser
tracecheck run \
--traces examples/sample_traces.jsonl \
--config examples/evals.yaml \
--output html --out report.html
Sample text output:
Trace: support_001_pass (support_agent)
[PASS] tool_accuracy: Tool sequence matches expected (3 calls).
[PASS] step_efficiency: Efficient: 3 tool call(s), no loops or retries.
[PASS] failure_modes: No failure modes detected.
[PASS] context_drift: Stayed on topic.
[PASS] output_quality: Reply confirms refund and dollar amount.
-> PASS
Trace: support_002_fail (support_agent)
[FAIL] tool_accuracy: Expected [get_order, verify_item_mismatch, issue_refund],
got [get_order, get_order, get_order, search_products].
[FAIL] step_efficiency: Inefficient: 2 consecutive duplicate tool call(s),
1 excess tool call(s) over expected.
[FAIL] failure_modes: Detected: infinite_loop, context_window_overflow.
[PASS] context_drift: Stayed on topic but failed mechanically.
[SKIP] output_quality: Skipped: other evaluators failed on this trace.
-> FAIL
Aggregate: 1/2 traces passed
exit code: 1
Or programmatically:
from tracecheck import load_traces, run_evals
from tracecheck.report import to_text
traces = load_traces("traces.jsonl")
reports = run_evals(traces, "evals.yaml")
print(to_text(reports))
The CLI exits with code 1 if any trace fails, so you can drop it into CI.
Evaluators
| Evaluator | What it checks | Status |
|---|---|---|
tool_accuracy |
Did the agent call the right tools in the right order? | Built |
step_efficiency |
Tool calls vs expected; flags consecutive duplicate calls and retries | Built |
failure_modes |
Tags traces with known failure shapes (loops, context overflow, tool errors) | Built |
context_drift |
Does the agent stay on topic across steps? (LLM as judge) | Built |
output_quality |
Final reply scored against a rubric, only after others pass (LLM as judge) | Built |
output_quality is deferred: the runner only invokes the judge after every other evaluator on the same trace passes. Already-broken traces do not burn judge tokens, and the report renders [SKIP] for them.
LLM-as-judge configuration
The two judge-based evaluators (context_drift, output_quality) need a backend. Configure it in your YAML:
evaluators:
- tool_accuracy
- step_efficiency
- failure_modes
- context_drift
- output_quality
judge:
provider: anthropic # or "fake" for tests
model: claude-sonnet-4-5
max_tokens: 1024
output_quality:
rubric: |
The reply must accurately reflect the tool-call outcomes.
Confirm dollar amounts when a refund was issued.
Rubric resolution is per-trace first, then YAML default — set trace.metadata.rubric to override per scenario.
The AnthropicJudge enables prompt caching on the system block, so judging N traces costs roughly 1 + 0.1 * (N − 1) system-prompt tokens. See tracecheck/judges/ for the protocol and the FakeJudge used in tests.
Trace format
A trace is a JSON object with an ordered list of steps:
{
"trace_id": "support_001",
"agent_name": "support_agent",
"user_input": "I got the wrong color sweater, refund please",
"expected_tools": ["get_order", "verify_item_mismatch", "issue_refund"],
"steps": [
{"type": "tool_call", "name": "get_order",
"input": {"user_id": "u_123"}, "output": {"order_id": "o_99"}},
{"type": "tool_call", "name": "verify_item_mismatch",
"input": {"order_id": "o_99"}, "output": {"mismatch": true}},
{"type": "tool_call", "name": "issue_refund",
"input": {"order_id": "o_99"}, "output": {"status": "ok"}},
{"type": "llm_call", "output": "Refunded $49.99 to your card."}
]
}
Step types: llm_call, tool_call, retry, error. Each step may carry latency_ms, tokens, timestamp, and an error message. See tracecheck/schema.py for the full Pydantic spec.
A .jsonl file is one trace per line. A .json file may be a single trace or an array.
Roadmap
- Trace ingestion (JSON, JSONL)
- Tool call accuracy evaluator
- Step efficiency evaluator
- Failure mode detection
- Context drift evaluator
- Output quality evaluator
- Static HTML report (
--output html) - OpenTelemetry span ingest
- Pydantic AI native integration
- LangGraph trace adapter
- PyPI release
Examples
See examples/:
- basic_usage.py — programmatic API
- sample_traces.jsonl — three traces (pass, fail, edge case)
- evals.yaml — minimal deterministic config (no LLM key needed)
- evals_with_judge.yaml — full config with the LLM-as-judge evaluators enabled
Architecture
traces.jsonl ──► ingest ──► Trace[] ──► runner ──► [Evaluator.evaluate(trace)] ──► report
▲
evals.yaml
The library is intentionally small: a Pydantic schema, a loader, a registry of evaluators, and a runner that walks every trace through every configured evaluator. CLI and library share the same code path.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracecheck-0.4.0.tar.gz.
File metadata
- Download URL: tracecheck-0.4.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d600b418845246d5c0488f0a22cf2df21d680ee2eb0bc12f950ba79fa8579d25
|
|
| MD5 |
9367c937d9efdafa47d289209e0d70f4
|
|
| BLAKE2b-256 |
415571db11809acd4e14209a575c3dcfabad9fc5ac810c69054ba15a565cbe7c
|
Provenance
The following attestation bundles were made for tracecheck-0.4.0.tar.gz:
Publisher:
release.yml on Mohdtalibakhtar/tracecheck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracecheck-0.4.0.tar.gz -
Subject digest:
d600b418845246d5c0488f0a22cf2df21d680ee2eb0bc12f950ba79fa8579d25 - Sigstore transparency entry: 1485618968
- Sigstore integration time:
-
Permalink:
Mohdtalibakhtar/tracecheck@2edc3d1ff7537635712b5aa9d2a3560363ea23cb -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Mohdtalibakhtar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2edc3d1ff7537635712b5aa9d2a3560363ea23cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file tracecheck-0.4.0-py3-none-any.whl.
File metadata
- Download URL: tracecheck-0.4.0-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
483ba765fd7ffef96c57df04c8f18a114a3d613c57e7e7fb53d089c0eb699b69
|
|
| MD5 |
0180d048c17508d90d91afd51dc2551a
|
|
| BLAKE2b-256 |
2ee2b26147d86b4bd64b3e2803c9f860714b37b8ef622559d774b61e5363d552
|
Provenance
The following attestation bundles were made for tracecheck-0.4.0-py3-none-any.whl:
Publisher:
release.yml on Mohdtalibakhtar/tracecheck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracecheck-0.4.0-py3-none-any.whl -
Subject digest:
483ba765fd7ffef96c57df04c8f18a114a3d613c57e7e7fb53d089c0eb699b69 - Sigstore transparency entry: 1485618973
- Sigstore integration time:
-
Permalink:
Mohdtalibakhtar/tracecheck@2edc3d1ff7537635712b5aa9d2a3560363ea23cb -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Mohdtalibakhtar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2edc3d1ff7537635712b5aa9d2a3560363ea23cb -
Trigger Event:
push
-
Statement type: