Evaluate multi step AI agent traces, not just single LLM responses.
Project description
Tracecheck
Evaluate multi step AI agent traces, not just single LLM responses.
Why
Most eval tools score one input/output pair at a time. Production agents take 7 tool calls, retry, hit context limits, and then produce an answer. The trace is the unit you actually need to evaluate. tracecheck ingests full agent traces and scores them on tool accuracy, context handling, step efficiency, failure modes, and final output quality.
Quick start
pip install tracecheck # core, deterministic evaluators only
pip install tracecheck[llm] # adds the LLM-as-judge extras (anthropic SDK)
tracecheck run \
--traces examples/sample_traces.jsonl \
--config examples/evals.yaml
# write a self-contained HTML report you can open in a browser
tracecheck run \
--traces examples/sample_traces.jsonl \
--config examples/evals.yaml \
--output html --out report.html
Sample text output:
Trace: support_001_pass (support_agent)
[PASS] tool_accuracy: Tool sequence matches expected (3 calls).
[PASS] step_efficiency: Efficient: 3 tool call(s), no loops or retries.
[PASS] failure_modes: No failure modes detected.
[PASS] context_drift: Stayed on topic.
[PASS] output_quality: Reply confirms refund and dollar amount.
-> PASS
Trace: support_002_fail (support_agent)
[FAIL] tool_accuracy: Expected [get_order, verify_item_mismatch, issue_refund],
got [get_order, get_order, get_order, search_products].
[FAIL] step_efficiency: Inefficient: 2 consecutive duplicate tool call(s),
1 excess tool call(s) over expected.
[FAIL] failure_modes: Detected: infinite_loop, context_window_overflow.
[PASS] context_drift: Stayed on topic but failed mechanically.
[SKIP] output_quality: Skipped: other evaluators failed on this trace.
-> FAIL
Aggregate: 1/2 traces passed
exit code: 1
Or programmatically:
from tracecheck import load_traces, run_evals
from tracecheck.report import to_text
traces = load_traces("traces.jsonl")
reports = run_evals(traces, "evals.yaml")
print(to_text(reports))
The CLI exits with code 1 if any trace fails, so you can drop it into CI.
Integrating tracecheck into your agent
The trace data has to come from your agent. tracecheck reads traces; it does not produce them. Three steps to wire it in.
1. Log each step from inside your agent
Around each tool call and LLM call, append a step record. About ten lines of code total.
import json, uuid
from datetime import datetime, timezone
from pathlib import Path
steps = []
def log(step_type, **kwargs):
steps.append({
"type": step_type,
"timestamp": datetime.now(timezone.utc).isoformat(),
**kwargs,
})
# inside your handler:
hits = semantic_search(query)
log("tool_call", name="semantic_search", input={"q": query}, output={"n_hits": len(hits)})
reply = llm.generate(query, hits)
log("llm_call", output=reply)
# at end of each request, append one trace line:
Path("traces.jsonl").open("a").write(json.dumps({
"trace_id": str(uuid.uuid4()),
"agent_name": "my_agent",
"user_input": query,
"steps": steps,
}) + "\n")
If you already use a framework with built-in tracing (LangChain callbacks, Pydantic AI events, OpenTelemetry, Logfire, etc.) you can adapt their output to this schema instead of writing logging from scratch.
2. Author a golden test set
One JSON line per scenario you care about, listing the user input and the tools you expect the agent to call, in order. This is your assertion file — same role as assert result == expected in pytest. tracecheck has no opinion on what your agent should do; you tell it.
{"trace_id":"refund_ok", "agent_name":"my_agent", "user_input":"Refund my order", "expected_tools":["get_order","verify","issue_refund"]}
{"trace_id":"already_shipped", "agent_name":"my_agent", "user_input":"Cancel order o_99", "expected_tools":["get_order","check_shipping"]}
3. Replay the golden inputs through your agent, then run tracecheck
A short driver script feeds each user_input through your real agent (with the logger turned on) so the agent appends actual steps to traces.jsonl. Merge the expected_tools from your golden set onto the captured traces.
Then:
tracecheck run --traces traces.jsonl --config evals.yaml
Drop that command into a GitHub Actions step on every PR. Exit code 1 blocks the merge if any trace regresses.
Evaluators
| Evaluator | What it checks | Status |
|---|---|---|
tool_accuracy |
Did the agent call the right tools in the right order? | Built |
step_efficiency |
Tool calls vs expected; flags consecutive duplicate calls and retries | Built |
failure_modes |
Tags traces with known failure shapes (loops, context overflow, tool errors) | Built |
context_drift |
Does the agent stay on topic across steps? (LLM as judge) | Built |
output_quality |
Final reply scored against a rubric, only after others pass (LLM as judge) | Built |
output_quality is deferred: the runner only invokes the judge after every other evaluator on the same trace passes. Already-broken traces do not burn judge tokens, and the report renders [SKIP] for them.
What each evaluator uniquely catches
Stacking the five evaluators gives pinpoint diagnostics, because each catches a different kind of failure. Imagine an agent expected to call tools [A, B, C]:
| Trace shape | tool_accuracy |
step_efficiency |
failure_modes |
context_drift |
output_quality |
|---|---|---|---|---|---|
[A, B, C] — perfect |
PASS | PASS | PASS | PASS | PASS |
[A, A, A, B] — looped on A |
FAIL | FAIL | FAIL infinite_loop |
PASS | SKIP |
[A, B, A, B, A, B] — non-adjacent oscillation |
FAIL | soft pass | FAIL infinite_loop |
PASS | SKIP |
[A, B, search_unrelated, C] — drifted off-topic |
FAIL | FAIL | PASS | FAIL | SKIP |
[A, B, C] + reply "refunded $500" but no refund tool ran |
PASS | PASS | PASS | PASS | FAIL |
[A, B] + step error: "context window exceeded" |
PASS | PASS | FAIL context_overflow |
PASS | SKIP |
The bolded cells are the evaluators that only detect that failure shape. A single evaluator alone would silently let those rows through.
Quick mental shortcut
tool_accuracy — did you call the right things step_efficiency — without thrashing failure_modes — and if you broke, was it a known break context_drift — and you stayed on topic output_quality — and your final answer was actually good
How tracecheck compares to existing eval tools
There are excellent eval tools already. tracecheck does not replace them; it fills a different slot.
| Tool | Unit of evaluation | LLM key needed? | Form factor | Best for |
|---|---|---|---|---|
| tracecheck | Full agent trace (multi step) | Optional — 3 of 5 evaluators are deterministic | Library + CLI, drops into CI | Multi-step agent regression testing |
| Ragas | RAG retrieval + answer | Yes (LLM-judge metrics) | Library | RAG-specific metrics (faithfulness, context precision, etc.) |
| DeepEval | Single test case | Yes for most metrics | pytest-style framework | Single-turn LLM unit testing |
| TruLens | App + feedback functions | Yes | Framework + observability layer | Observability with feedback evaluation |
| Promptfoo | Prompt + provider matrix | Configurable | YAML test runner | Prompt regression across providers |
The shape that makes tracecheck useful when the others fall short:
- Trace as the unit, not a single output. If your agent loops three times, calls a wrong tool, and then produces a plausible reply, the existing tools score the reply. tracecheck scores the path.
- Deterministic by default.
tool_accuracy,step_efficiency, andfailure_modesneed zero LLM calls. You can run hundreds of traces in CI with no API key, no rate limits, no cost. The LLM-as-judge evaluators are opt-in for finer grained checks. - CI-native exit codes. No dashboard, no service.
exit 1blocks the merge. Same form factor aspytest,ruff,mypy.
LLM-as-judge configuration
The two judge-based evaluators (context_drift, output_quality) need a backend. Configure it in your YAML:
evaluators:
- tool_accuracy
- step_efficiency
- failure_modes
- context_drift
- output_quality
judge:
provider: anthropic # or "fake" for tests
model: claude-sonnet-4-5
max_tokens: 1024
output_quality:
rubric: |
The reply must accurately reflect the tool-call outcomes.
Confirm dollar amounts when a refund was issued.
Rubric resolution is per-trace first, then YAML default — set trace.metadata.rubric to override per scenario.
The AnthropicJudge enables prompt caching on the system block, so judging N traces costs roughly 1 + 0.1 * (N − 1) system-prompt tokens. See tracecheck/judges/ for the protocol and the FakeJudge used in tests.
Trace format
A trace is a JSON object with an ordered list of steps:
{
"trace_id": "support_001",
"agent_name": "support_agent",
"user_input": "I got the wrong color sweater, refund please",
"expected_tools": ["get_order", "verify_item_mismatch", "issue_refund"],
"steps": [
{"type": "tool_call", "name": "get_order",
"input": {"user_id": "u_123"}, "output": {"order_id": "o_99"}},
{"type": "tool_call", "name": "verify_item_mismatch",
"input": {"order_id": "o_99"}, "output": {"mismatch": true}},
{"type": "tool_call", "name": "issue_refund",
"input": {"order_id": "o_99"}, "output": {"status": "ok"}},
{"type": "llm_call", "output": "Refunded $49.99 to your card."}
]
}
Step types: llm_call, tool_call, retry, error. Each step may carry latency_ms, tokens, timestamp, and an error message. See tracecheck/schema.py for the full Pydantic spec.
A .jsonl file is one trace per line. A .json file may be a single trace or an array.
OpenTelemetry ingestion
If your agent already emits OpenTelemetry spans (LangChain callbacks, OpenAI/Anthropic instrumentation, Pydantic Logfire, OpenInference, etc.), point tracecheck at the OTLP/JSON span export directly — no hand-rolled logger required:
tracecheck run --traces my_otel_spans.json --config evals.yaml
The loader auto-detects OTLP/JSON (any file with a top-level resourceSpans key) and maps the OpenTelemetry GenAI semantic conventions onto the trace schema:
| OTel attribute | Maps to |
|---|---|
gen_ai.tool.name |
Step.name on a tool_call step |
gen_ai.request.model / gen_ai.system |
Step.name on an llm_call step |
gen_ai.usage.input_tokens (or prompt_tokens) |
Step.tokens.prompt |
gen_ai.usage.output_tokens (or completion_tokens) |
Step.tokens.completion |
span.startTimeUnixNano |
Step.timestamp |
endTime − startTime |
Step.latency_ms |
span.status.code = ERROR |
Step flagged with the error message |
resource.service.name |
Trace.agent_name |
Spans are grouped by traceId and ordered by start time. Spans that don't match a recognised pattern are dropped quietly.
from tracecheck import load_otel_traces
traces = load_otel_traces("spans.json")
One caveat: OTel-ingested traces have no expected_tools — OTel does not carry your test assertions. To run tool_accuracy on them you write a small golden test set and merge expected_tools in. The LLM-as-judge evaluators (context_drift, output_quality) work fine without it.
Pydantic AI integration
If you build with Pydantic AI, the adapter turns an AgentRunResult directly into a tracecheck Trace — no hand-rolled logger, no OTel exporter, just one function call after each run:
from pydantic_ai import Agent
from tracecheck import pydantic_ai_to_trace, run_evals
agent = Agent("anthropic:claude-sonnet-4-5", tools=[get_order, verify, issue_refund])
result = await agent.run("Refund my order")
trace = pydantic_ai_to_trace(
result,
trace_id="refund_happy_path",
agent_name="support_agent",
expected_tools=["get_order", "verify", "issue_refund"],
)
reports = run_evals([trace], "evals.yaml")
What the adapter does:
- Walks the
result.all_messages()list in chronological order - Emits one
llm_callstep perTextPart - Merges each
ToolCallPartwith its matchingToolReturnPart(linked bytool_call_id) into a singletool_callstep —inputfrom the call args,outputfrom the return content RetryPromptPartbecomes aretrystep sostep_efficiencyandfailure_modescan score it- Pulls
user_inputfrom the firstUserPromptPart
The adapter is duck-typed: tracecheck does not import pydantic_ai, so installing tracecheck adds zero new dependencies to your project. You install Pydantic AI; you hand us the result.
LangChain / LangGraph integration
If your agent is built with LangGraph (or any LangChain runnable), collect the events emitted by graph.astream_events(...) and hand them to the adapter:
from tracecheck import langgraph_events_to_trace, run_evals
events = []
async for ev in graph.astream_events({"input": "Refund my order"}, version="v2"):
events.append(ev)
trace = langgraph_events_to_trace(
events,
trace_id="refund_happy_path",
agent_name="support_agent",
expected_tools=["get_order", "verify", "issue_refund"],
)
reports = run_evals([trace], "evals.yaml")
What the adapter does:
- Pairs every
on_*_startevent with its matchingon_*_end(oron_*_error) byrun_id - Emits one
tool_callstep peron_tool_start(input from call args, output from the return) - Emits one
llm_callstep peron_chat_model_start/on_llm_start(model name frommetadata.ls_model_name) - Tool errors and chat-model errors become
errorsteps - Drops unrelated event types (chain, retriever, parser, etc.) so the trace stays focused on the agent's decisions
- Tries to extract
user_inputfrom the first chat call's human message
Duck-typed: tracecheck does not import langchain or langgraph. Same zero-dependency story as the Pydantic AI adapter.
Roadmap
- Trace ingestion (JSON, JSONL)
- Tool call accuracy evaluator
- Step efficiency evaluator
- Failure mode detection
- Context drift evaluator
- Output quality evaluator
- Static HTML report (
--output html) - OpenTelemetry span ingest
- PyPI release
- Pydantic AI native integration
- LangGraph / LangChain adapter
Examples
See examples/:
- basic_usage.py — programmatic API
- sample_traces.jsonl — three traces (pass, fail, edge case)
- evals.yaml — minimal deterministic config (no LLM key needed)
- evals_with_judge.yaml — full config with the LLM-as-judge evaluators enabled
For the fastest possible end-to-end demo, clone the repo and run the example traces straight away:
pip install tracecheck
git clone https://github.com/Mohdtalibakhtar/tracecheck && cd tracecheck/examples
tracecheck run --traces sample_traces.jsonl --config evals.yaml --output html --out report.html
open report.html
Architecture
traces.jsonl ──► ingest ──► Trace[] ──► runner ──► [Evaluator.evaluate(trace)] ──► report
▲
evals.yaml
The library is intentionally small: a Pydantic schema, a loader, a registry of evaluators, and a runner that walks every trace through every configured evaluator. CLI and library share the same code path.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracecheck-0.7.0.tar.gz.
File metadata
- Download URL: tracecheck-0.7.0.tar.gz
- Upload date:
- Size: 31.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6d4afafe58b38bafbb3991fd8f08264525d1f9e56d41ee2a748bb5a76249486
|
|
| MD5 |
a2594c48e74c027d544195a334cc9c34
|
|
| BLAKE2b-256 |
79b15ec0936493e839dfc5a194120da5ec42589e984765b0996b2f7837ff4cd1
|
Provenance
The following attestation bundles were made for tracecheck-0.7.0.tar.gz:
Publisher:
release.yml on Mohdtalibakhtar/tracecheck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracecheck-0.7.0.tar.gz -
Subject digest:
f6d4afafe58b38bafbb3991fd8f08264525d1f9e56d41ee2a748bb5a76249486 - Sigstore transparency entry: 1506505298
- Sigstore integration time:
-
Permalink:
Mohdtalibakhtar/tracecheck@f0a42e47d58459536374ce884c3976e022fb6859 -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/Mohdtalibakhtar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f0a42e47d58459536374ce884c3976e022fb6859 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tracecheck-0.7.0-py3-none-any.whl.
File metadata
- Download URL: tracecheck-0.7.0-py3-none-any.whl
- Upload date:
- Size: 36.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8c9a62e81214d9ba96a4b74af976310f84cbcc0dd5f6197043346c0a9a3b2b6
|
|
| MD5 |
31d508641ad472a46bb8a05e13ed2eba
|
|
| BLAKE2b-256 |
9e68b4af4631a2e666530616dc4c97333a6d41ee1be12d9b38f714376a7e81d9
|
Provenance
The following attestation bundles were made for tracecheck-0.7.0-py3-none-any.whl:
Publisher:
release.yml on Mohdtalibakhtar/tracecheck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracecheck-0.7.0-py3-none-any.whl -
Subject digest:
e8c9a62e81214d9ba96a4b74af976310f84cbcc0dd5f6197043346c0a9a3b2b6 - Sigstore transparency entry: 1506505455
- Sigstore integration time:
-
Permalink:
Mohdtalibakhtar/tracecheck@f0a42e47d58459536374ce884c3976e022fb6859 -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/Mohdtalibakhtar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f0a42e47d58459536374ce884c3976e022fb6859 -
Trigger Event:
push
-
Statement type: