Skip to main content

Evaluate multi step AI agent traces, not just single LLM responses.

Project description

Tracecheck

Evaluate multi step AI agent traces, not just single LLM responses.

Why

Most eval tools score one input/output pair at a time. Production agents take 7 tool calls, retry, hit context limits, and then produce an answer. The trace is the unit you actually need to evaluate. tracecheck ingests full agent traces and scores them on tool accuracy, context handling, step efficiency, failure modes, and final output quality.

Quick start

pip install tracecheck         # core, deterministic evaluators only
pip install tracecheck[llm]    # adds the LLM-as-judge extras (anthropic SDK)

tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml

# write a self-contained HTML report you can open in a browser
tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml \
  --output html --out report.html

Sample text output:

Trace: support_001_pass (support_agent)
  [PASS] tool_accuracy:    Tool sequence matches expected (3 calls).
  [PASS] step_efficiency:  Efficient: 3 tool call(s), no loops or retries.
  [PASS] failure_modes:    No failure modes detected.
  [PASS] context_drift:    Stayed on topic.
  [PASS] output_quality:   Reply confirms refund and dollar amount.
  -> PASS

Trace: support_002_fail (support_agent)
  [FAIL] tool_accuracy:    Expected [get_order, verify_item_mismatch, issue_refund],
                           got [get_order, get_order, get_order, search_products].
  [FAIL] step_efficiency:  Inefficient: 2 consecutive duplicate tool call(s),
                           1 excess tool call(s) over expected.
  [FAIL] failure_modes:    Detected: infinite_loop, context_window_overflow.
  [PASS] context_drift:    Stayed on topic but failed mechanically.
  [SKIP] output_quality:   Skipped: other evaluators failed on this trace.
  -> FAIL

Aggregate: 1/2 traces passed
exit code: 1

Or programmatically:

from tracecheck import load_traces, run_evals
from tracecheck.report import to_text

traces = load_traces("traces.jsonl")
reports = run_evals(traces, "evals.yaml")
print(to_text(reports))

The CLI exits with code 1 if any trace fails, so you can drop it into CI.

Integrating tracecheck into your agent

The trace data has to come from your agent. tracecheck reads traces; it does not produce them. Three steps to wire it in.

1. Log each step from inside your agent

Around each tool call and LLM call, append a step record. About ten lines of code total.

import json, uuid
from datetime import datetime, timezone
from pathlib import Path

steps = []

def log(step_type, **kwargs):
    steps.append({
        "type": step_type,
        "timestamp": datetime.now(timezone.utc).isoformat(),
        **kwargs,
    })

# inside your handler:
hits = semantic_search(query)
log("tool_call", name="semantic_search", input={"q": query}, output={"n_hits": len(hits)})

reply = llm.generate(query, hits)
log("llm_call", output=reply)

# at end of each request, append one trace line:
Path("traces.jsonl").open("a").write(json.dumps({
    "trace_id": str(uuid.uuid4()),
    "agent_name": "my_agent",
    "user_input": query,
    "steps": steps,
}) + "\n")

If you already use a framework with built-in tracing (LangChain callbacks, Pydantic AI events, OpenTelemetry, Logfire, etc.) you can adapt their output to this schema instead of writing logging from scratch.

2. Author a golden test set

One JSON line per scenario you care about, listing the user input and the tools you expect the agent to call, in order. This is your assertion file — same role as assert result == expected in pytest. tracecheck has no opinion on what your agent should do; you tell it.

{"trace_id":"refund_ok",       "agent_name":"my_agent", "user_input":"Refund my order",   "expected_tools":["get_order","verify","issue_refund"]}
{"trace_id":"already_shipped", "agent_name":"my_agent", "user_input":"Cancel order o_99", "expected_tools":["get_order","check_shipping"]}

3. Replay the golden inputs through your agent, then run tracecheck

A short driver script feeds each user_input through your real agent (with the logger turned on) so the agent appends actual steps to traces.jsonl. Merge the expected_tools from your golden set onto the captured traces.

Then:

tracecheck run --traces traces.jsonl --config evals.yaml

Drop that command into a GitHub Actions step on every PR. Exit code 1 blocks the merge if any trace regresses.

Evaluators

Evaluator What it checks Status
tool_accuracy Did the agent call the right tools in the right order? Built
step_efficiency Tool calls vs expected; flags consecutive duplicate calls and retries Built
failure_modes Tags traces with known failure shapes (loops, context overflow, tool errors) Built
context_drift Does the agent stay on topic across steps? (LLM as judge) Built
output_quality Final reply scored against a rubric, only after others pass (LLM as judge) Built

output_quality is deferred: the runner only invokes the judge after every other evaluator on the same trace passes. Already-broken traces do not burn judge tokens, and the report renders [SKIP] for them.

What each evaluator uniquely catches

Stacking the five evaluators gives pinpoint diagnostics, because each catches a different kind of failure. Imagine an agent expected to call tools [A, B, C]:

Trace shape tool_accuracy step_efficiency failure_modes context_drift output_quality
[A, B, C] — perfect PASS PASS PASS PASS PASS
[A, A, A, B] — looped on A FAIL FAIL FAIL infinite_loop PASS SKIP
[A, B, A, B, A, B] — non-adjacent oscillation FAIL soft pass FAIL infinite_loop PASS SKIP
[A, B, search_unrelated, C] — drifted off-topic FAIL FAIL PASS FAIL SKIP
[A, B, C] + reply "refunded $500" but no refund tool ran PASS PASS PASS PASS FAIL
[A, B] + step error: "context window exceeded" PASS PASS FAIL context_overflow PASS SKIP

The bolded cells are the evaluators that only detect that failure shape. A single evaluator alone would silently let those rows through.

Quick mental shortcut

tool_accuracydid you call the right things step_efficiencywithout thrashing failure_modesand if you broke, was it a known break context_driftand you stayed on topic output_qualityand your final answer was actually good

How tracecheck compares to existing eval tools

There are excellent eval tools already. tracecheck does not replace them; it fills a different slot.

Tool Unit of evaluation LLM key needed? Form factor Best for
tracecheck Full agent trace (multi step) Optional — 3 of 5 evaluators are deterministic Library + CLI, drops into CI Multi-step agent regression testing
Ragas RAG retrieval + answer Yes (LLM-judge metrics) Library RAG-specific metrics (faithfulness, context precision, etc.)
DeepEval Single test case Yes for most metrics pytest-style framework Single-turn LLM unit testing
TruLens App + feedback functions Yes Framework + observability layer Observability with feedback evaluation
Promptfoo Prompt + provider matrix Configurable YAML test runner Prompt regression across providers

The shape that makes tracecheck useful when the others fall short:

  • Trace as the unit, not a single output. If your agent loops three times, calls a wrong tool, and then produces a plausible reply, the existing tools score the reply. tracecheck scores the path.
  • Deterministic by default. tool_accuracy, step_efficiency, and failure_modes need zero LLM calls. You can run hundreds of traces in CI with no API key, no rate limits, no cost. The LLM-as-judge evaluators are opt-in for finer grained checks.
  • CI-native exit codes. No dashboard, no service. exit 1 blocks the merge. Same form factor as pytest, ruff, mypy.

LLM-as-judge configuration

The two judge-based evaluators (context_drift, output_quality) need a backend. Configure it in your YAML:

evaluators:
  - tool_accuracy
  - step_efficiency
  - failure_modes
  - context_drift
  - output_quality

judge:
  provider: anthropic        # or "fake" for tests
  model: claude-sonnet-4-5
  max_tokens: 1024

output_quality:
  rubric: |
    The reply must accurately reflect the tool-call outcomes.
    Confirm dollar amounts when a refund was issued.

Rubric resolution is per-trace first, then YAML default — set trace.metadata.rubric to override per scenario.

The AnthropicJudge enables prompt caching on the system block, so judging N traces costs roughly 1 + 0.1 * (N − 1) system-prompt tokens. See tracecheck/judges/ for the protocol and the FakeJudge used in tests.

Trace format

A trace is a JSON object with an ordered list of steps:

{
  "trace_id": "support_001",
  "agent_name": "support_agent",
  "user_input": "I got the wrong color sweater, refund please",
  "expected_tools": ["get_order", "verify_item_mismatch", "issue_refund"],
  "steps": [
    {"type": "tool_call", "name": "get_order",
     "input": {"user_id": "u_123"}, "output": {"order_id": "o_99"}},
    {"type": "tool_call", "name": "verify_item_mismatch",
     "input": {"order_id": "o_99"}, "output": {"mismatch": true}},
    {"type": "tool_call", "name": "issue_refund",
     "input": {"order_id": "o_99"}, "output": {"status": "ok"}},
    {"type": "llm_call", "output": "Refunded $49.99 to your card."}
  ]
}

Step types: llm_call, tool_call, retry, error. Each step may carry latency_ms, tokens, timestamp, and an error message. See tracecheck/schema.py for the full Pydantic spec.

A .jsonl file is one trace per line. A .json file may be a single trace or an array.

OpenTelemetry ingestion

If your agent already emits OpenTelemetry spans (LangChain callbacks, OpenAI/Anthropic instrumentation, Pydantic Logfire, OpenInference, etc.), point tracecheck at the OTLP/JSON span export directly — no hand-rolled logger required:

tracecheck run --traces my_otel_spans.json --config evals.yaml

The loader auto-detects OTLP/JSON (any file with a top-level resourceSpans key) and maps the OpenTelemetry GenAI semantic conventions onto the trace schema:

OTel attribute Maps to
gen_ai.tool.name Step.name on a tool_call step
gen_ai.request.model / gen_ai.system Step.name on an llm_call step
gen_ai.usage.input_tokens (or prompt_tokens) Step.tokens.prompt
gen_ai.usage.output_tokens (or completion_tokens) Step.tokens.completion
span.startTimeUnixNano Step.timestamp
endTime − startTime Step.latency_ms
span.status.code = ERROR Step flagged with the error message
resource.service.name Trace.agent_name

Spans are grouped by traceId and ordered by start time. Spans that don't match a recognised pattern are dropped quietly.

from tracecheck import load_otel_traces

traces = load_otel_traces("spans.json")

One caveat: OTel-ingested traces have no expected_tools — OTel does not carry your test assertions. To run tool_accuracy on them you write a small golden test set and merge expected_tools in. The LLM-as-judge evaluators (context_drift, output_quality) work fine without it.

Roadmap

  • Trace ingestion (JSON, JSONL)
  • Tool call accuracy evaluator
  • Step efficiency evaluator
  • Failure mode detection
  • Context drift evaluator
  • Output quality evaluator
  • Static HTML report (--output html)
  • OpenTelemetry span ingest
  • PyPI release
  • Pydantic AI native integration
  • LangGraph trace adapter

Examples

See examples/:

For the fastest possible end-to-end demo, clone the repo and run the example traces straight away:

pip install tracecheck
git clone https://github.com/Mohdtalibakhtar/tracecheck && cd tracecheck/examples
tracecheck run --traces sample_traces.jsonl --config evals.yaml --output html --out report.html
open report.html

Architecture

traces.jsonl ──► ingest ──► Trace[] ──► runner ──► [Evaluator.evaluate(trace)] ──► report
                                            ▲
                                         evals.yaml

The library is intentionally small: a Pydantic schema, a loader, a registry of evaluators, and a runner that walks every trace through every configured evaluator. CLI and library share the same code path.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracecheck-0.5.0.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracecheck-0.5.0-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file tracecheck-0.5.0.tar.gz.

File metadata

  • Download URL: tracecheck-0.5.0.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c16dec34bfe34bc50bf306c6f589e34c07fa415d110673f4c75c35bf0f64d2c8
MD5 4f33610946b94a3c7ea09370a92153a9
BLAKE2b-256 cd710fb3c2e01049e07f7f85f71c5c60077ad2e763f35ea9a6caccef1eeedcc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.5.0.tar.gz:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracecheck-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: tracecheck-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df4fba36cf467eae5ece8b2e563cfe4ae8bda1cf49662b6e83c0ea85f4487e5b
MD5 dc4faf4f7eff9308cee575f930a63a92
BLAKE2b-256 c8c9ffaa25d3a924aba4b5620bb6b0928a28d3e7d4cd96f9f5ceea9c831fb579

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.5.0-py3-none-any.whl:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page