Evaluate multi step AI agent traces, not just single LLM responses.

These details have not been verified by PyPI

Project description

Tracecheck

Evaluate multi step AI agent traces, not just single LLM responses.

Why

Most eval tools score one input/output pair at a time. Production agents take 7 tool calls, retry, hit context limits, and then produce an answer. The trace is the unit you actually need to evaluate. tracecheck ingests full agent traces and scores them on tool accuracy, context handling, step efficiency, failure modes, and final output quality.

Quick start

pip install tracecheck         # core, deterministic evaluators only
pip install tracecheck[llm]    # adds the LLM-as-judge extras (anthropic SDK)

tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml

# write a self-contained HTML report you can open in a browser
tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml \
  --output html --out report.html

Sample text output:

Trace: support_001_pass (support_agent)
  [PASS] tool_accuracy:    Tool sequence matches expected (3 calls).
  [PASS] step_efficiency:  Efficient: 3 tool call(s), no loops or retries.
  [PASS] failure_modes:    No failure modes detected.
  [PASS] context_drift:    Stayed on topic.
  [PASS] output_quality:   Reply confirms refund and dollar amount.
  -> PASS

Trace: support_002_fail (support_agent)
  [FAIL] tool_accuracy:    Expected [get_order, verify_item_mismatch, issue_refund],
                           got [get_order, get_order, get_order, search_products].
  [FAIL] step_efficiency:  Inefficient: 2 consecutive duplicate tool call(s),
                           1 excess tool call(s) over expected.
  [FAIL] failure_modes:    Detected: infinite_loop, context_window_overflow.
  [PASS] context_drift:    Stayed on topic but failed mechanically.
  [SKIP] output_quality:   Skipped: other evaluators failed on this trace.
  -> FAIL

Aggregate: 1/2 traces passed
exit code: 1

Or programmatically:

from tracecheck import load_traces, run_evals
from tracecheck.report import to_text

traces = load_traces("traces.jsonl")
reports = run_evals(traces, "evals.yaml")
print(to_text(reports))

The CLI exits with code 1 if any trace fails, so you can drop it into CI.

Integrating tracecheck into your agent

The trace data has to come from your agent. tracecheck reads traces; it does not produce them. Three steps to wire it in.

1. Log each step from inside your agent

Around each tool call and LLM call, append a step record. About ten lines of code total.

import json, uuid
from datetime import datetime, timezone
from pathlib import Path

steps = []

def log(step_type, **kwargs):
    steps.append({
        "type": step_type,
        "timestamp": datetime.now(timezone.utc).isoformat(),
        **kwargs,
    })

# inside your handler:
hits = semantic_search(query)
log("tool_call", name="semantic_search", input={"q": query}, output={"n_hits": len(hits)})

reply = llm.generate(query, hits)
log("llm_call", output=reply)

# at end of each request, append one trace line:
Path("traces.jsonl").open("a").write(json.dumps({
    "trace_id": str(uuid.uuid4()),
    "agent_name": "my_agent",
    "user_input": query,
    "steps": steps,
}) + "\n")

If you already use a framework with built-in tracing (LangChain callbacks, Pydantic AI events, OpenTelemetry, Logfire, etc.) you can adapt their output to this schema instead of writing logging from scratch.

2. Author a golden test set

One JSON line per scenario you care about, listing the user input and the tools you expect the agent to call, in order. This is your assertion file — same role as assert result == expected in pytest. tracecheck has no opinion on what your agent should do; you tell it.

{"trace_id":"refund_ok",       "agent_name":"my_agent", "user_input":"Refund my order",   "expected_tools":["get_order","verify","issue_refund"]}
{"trace_id":"already_shipped", "agent_name":"my_agent", "user_input":"Cancel order o_99", "expected_tools":["get_order","check_shipping"]}

3. Replay the golden inputs through your agent, then run tracecheck

A short driver script feeds each user_input through your real agent (with the logger turned on) so the agent appends actual steps to traces.jsonl. Merge the expected_tools from your golden set onto the captured traces.

Then:

tracecheck run --traces traces.jsonl --config evals.yaml

Drop that command into a GitHub Actions step on every PR. Exit code 1 blocks the merge if any trace regresses.

Evaluators

Evaluator	What it checks	Status
`tool_accuracy`	Did the agent call the right tools in the right order?	Built
`step_efficiency`	Tool calls vs expected; flags consecutive duplicate calls and retries	Built
`failure_modes`	Tags traces with known failure shapes (loops, context overflow, tool errors)	Built
`context_drift`	Does the agent stay on topic across steps? (LLM as judge)	Built
`output_quality`	Final reply scored against a rubric, only after others pass (LLM as judge)	Built

output_quality is deferred: the runner only invokes the judge after every other evaluator on the same trace passes. Already-broken traces do not burn judge tokens, and the report renders [SKIP] for them.

What each evaluator uniquely catches

Stacking the five evaluators gives pinpoint diagnostics, because each catches a different kind of failure. Imagine an agent expected to call tools [A, B, C]:

Trace shape	`tool_accuracy`	`step_efficiency`	`failure_modes`	`context_drift`	`output_quality`
`[A, B, C]` — perfect	PASS	PASS	PASS	PASS	PASS
`[A, A, A, B]` — looped on A	FAIL	FAIL	FAIL `infinite_loop`	PASS	SKIP
`[A, B, A, B, A, B]` — non-adjacent oscillation	FAIL	soft pass	FAIL `infinite_loop`	PASS	SKIP
`[A, B, search_unrelated, C]` — drifted off-topic	FAIL	FAIL	PASS	FAIL	SKIP
`[A, B, C]` + reply "refunded $500" but no refund tool ran	PASS	PASS	PASS	PASS	FAIL
`[A, B]` + step `error: "context window exceeded"`	PASS	PASS	FAIL `context_overflow`	PASS	SKIP

The bolded cells are the evaluators that only detect that failure shape. A single evaluator alone would silently let those rows through.

Quick mental shortcut

tool_accuracy — did you call the right things step_efficiency — without thrashing failure_modes — and if you broke, was it a known break context_drift — and you stayed on topic output_quality — and your final answer was actually good

How tracecheck compares to existing eval tools

There are excellent eval tools already. tracecheck does not replace them; it fills a different slot.

Tool	Unit of evaluation	LLM key needed?	Form factor	Best for
tracecheck	Full agent trace (multi step)	Optional — 3 of 5 evaluators are deterministic	Library + CLI, drops into CI	Multi-step agent regression testing
Ragas	RAG retrieval + answer	Yes (LLM-judge metrics)	Library	RAG-specific metrics (faithfulness, context precision, etc.)
DeepEval	Single test case	Yes for most metrics	pytest-style framework	Single-turn LLM unit testing
TruLens	App + feedback functions	Yes	Framework + observability layer	Observability with feedback evaluation
Promptfoo	Prompt + provider matrix	Configurable	YAML test runner	Prompt regression across providers

The shape that makes tracecheck useful when the others fall short:

Trace as the unit, not a single output. If your agent loops three times, calls a wrong tool, and then produces a plausible reply, the existing tools score the reply. tracecheck scores the path.
Deterministic by default. tool_accuracy, step_efficiency, and failure_modes need zero LLM calls. You can run hundreds of traces in CI with no API key, no rate limits, no cost. The LLM-as-judge evaluators are opt-in for finer grained checks.
CI-native exit codes. No dashboard, no service. exit 1 blocks the merge. Same form factor as pytest, ruff, mypy.

LLM-as-judge configuration

The two judge-based evaluators (context_drift, output_quality) need a backend. Configure it in your YAML:

evaluators:
  - tool_accuracy
  - step_efficiency
  - failure_modes
  - context_drift
  - output_quality

judge:
  provider: anthropic        # or "fake" for tests
  model: claude-sonnet-4-5
  max_tokens: 1024

output_quality:
  rubric: |
    The reply must accurately reflect the tool-call outcomes.
    Confirm dollar amounts when a refund was issued.

Rubric resolution is per-trace first, then YAML default — set trace.metadata.rubric to override per scenario.

The AnthropicJudge enables prompt caching on the system block, so judging N traces costs roughly 1 + 0.1 * (N − 1) system-prompt tokens. See tracecheck/judges/ for the protocol and the FakeJudge used in tests.

Trace format

A trace is a JSON object with an ordered list of steps:

{
  "trace_id": "support_001",
  "agent_name": "support_agent",
  "user_input": "I got the wrong color sweater, refund please",
  "expected_tools": ["get_order", "verify_item_mismatch", "issue_refund"],
  "steps": [
    {"type": "tool_call", "name": "get_order",
     "input": {"user_id": "u_123"}, "output": {"order_id": "o_99"}},
    {"type": "tool_call", "name": "verify_item_mismatch",
     "input": {"order_id": "o_99"}, "output": {"mismatch": true}},
    {"type": "tool_call", "name": "issue_refund",
     "input": {"order_id": "o_99"}, "output": {"status": "ok"}},
    {"type": "llm_call", "output": "Refunded $49.99 to your card."}
  ]
}

Step types: llm_call, tool_call, retry, error. Each step may carry latency_ms, tokens, timestamp, and an error message. See tracecheck/schema.py for the full Pydantic spec.

A .jsonl file is one trace per line. A .json file may be a single trace or an array.

OpenTelemetry ingestion

If your agent already emits OpenTelemetry spans (LangChain callbacks, OpenAI/Anthropic instrumentation, Pydantic Logfire, OpenInference, etc.), point tracecheck at the OTLP/JSON span export directly — no hand-rolled logger required:

tracecheck run --traces my_otel_spans.json --config evals.yaml

The loader auto-detects OTLP/JSON (any file with a top-level resourceSpans key) and maps the OpenTelemetry GenAI semantic conventions onto the trace schema:

OTel attribute	Maps to
`gen_ai.tool.name`	`Step.name` on a `tool_call` step
`gen_ai.request.model` / `gen_ai.system`	`Step.name` on an `llm_call` step
`gen_ai.usage.input_tokens` (or `prompt_tokens`)	`Step.tokens.prompt`
`gen_ai.usage.output_tokens` (or `completion_tokens`)	`Step.tokens.completion`
`span.startTimeUnixNano`	`Step.timestamp`
`endTime − startTime`	`Step.latency_ms`
`span.status.code = ERROR`	`Step` flagged with the error message
`resource.service.name`	`Trace.agent_name`

Spans are grouped by traceId and ordered by start time. Spans that don't match a recognised pattern are dropped quietly.

from tracecheck import load_otel_traces

traces = load_otel_traces("spans.json")

One caveat: OTel-ingested traces have no expected_tools — OTel does not carry your test assertions. To run tool_accuracy on them you write a small golden test set and merge expected_tools in. The LLM-as-judge evaluators (context_drift, output_quality) work fine without it.

Pydantic AI integration

If you build with Pydantic AI, the adapter turns an AgentRunResult directly into a tracecheck Trace — no hand-rolled logger, no OTel exporter, just one function call after each run:

from pydantic_ai import Agent
from tracecheck import pydantic_ai_to_trace, run_evals

agent = Agent("anthropic:claude-sonnet-4-5", tools=[get_order, verify, issue_refund])

result = await agent.run("Refund my order")

trace = pydantic_ai_to_trace(
    result,
    trace_id="refund_happy_path",
    agent_name="support_agent",
    expected_tools=["get_order", "verify", "issue_refund"],
)

reports = run_evals([trace], "evals.yaml")

What the adapter does:

Walks the result.all_messages() list in chronological order
Emits one llm_call step per TextPart
Merges each ToolCallPart with its matching ToolReturnPart (linked by tool_call_id) into a single tool_call step — input from the call args, output from the return content
RetryPromptPart becomes a retry step so step_efficiency and failure_modes can score it
Pulls user_input from the first UserPromptPart

The adapter is duck-typed: tracecheck does not import pydantic_ai, so installing tracecheck adds zero new dependencies to your project. You install Pydantic AI; you hand us the result.

LangChain / LangGraph integration

If your agent is built with LangGraph (or any LangChain runnable), collect the events emitted by graph.astream_events(...) and hand them to the adapter:

from tracecheck import langgraph_events_to_trace, run_evals

events = []
async for ev in graph.astream_events({"input": "Refund my order"}, version="v2"):
    events.append(ev)

trace = langgraph_events_to_trace(
    events,
    trace_id="refund_happy_path",
    agent_name="support_agent",
    expected_tools=["get_order", "verify", "issue_refund"],
)

reports = run_evals([trace], "evals.yaml")

What the adapter does:

Pairs every on_*_start event with its matching on_*_end (or on_*_error) by run_id
Emits one tool_call step per on_tool_start (input from call args, output from the return)
Emits one llm_call step per on_chat_model_start / on_llm_start (model name from metadata.ls_model_name)
Tool errors and chat-model errors become error steps
Drops unrelated event types (chain, retriever, parser, etc.) so the trace stays focused on the agent's decisions
Tries to extract user_input from the first chat call's human message

Duck-typed: tracecheck does not import langchain or langgraph. Same zero-dependency story as the Pydantic AI adapter.

Roadmap

Trace ingestion (JSON, JSONL)
Tool call accuracy evaluator
Step efficiency evaluator
Failure mode detection
Context drift evaluator
Output quality evaluator
Static HTML report (--output html)
OpenTelemetry span ingest
PyPI release
Pydantic AI native integration
LangGraph / LangChain adapter

Examples

See examples/:

basic_usage.py — programmatic API
sample_traces.jsonl — three traces (pass, fail, edge case)
evals.yaml — minimal deterministic config (no LLM key needed)
evals_with_judge.yaml — full config with the LLM-as-judge evaluators enabled

For the fastest possible end-to-end demo, clone the repo and run the example traces straight away:

pip install tracecheck
git clone https://github.com/Mohdtalibakhtar/tracecheck && cd tracecheck/examples
tracecheck run --traces sample_traces.jsonl --config evals.yaml --output html --out report.html
open report.html

Architecture

traces.jsonl ──► ingest ──► Trace[] ──► runner ──► [Evaluator.evaluate(trace)] ──► report
                                            ▲
                                         evals.yaml

The library is intentionally small: a Pydantic schema, a loader, a registry of evaluators, and a runner that walks every trace through every configured evaluator. CLI and library share the same code path.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.7.0

May 11, 2026

0.5.0

May 11, 2026

0.4.0

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracecheck-0.7.0.tar.gz (31.7 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracecheck-0.7.0-py3-none-any.whl (36.9 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file tracecheck-0.7.0.tar.gz.

File metadata

Download URL: tracecheck-0.7.0.tar.gz
Upload date: May 11, 2026
Size: 31.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`f6d4afafe58b38bafbb3991fd8f08264525d1f9e56d41ee2a748bb5a76249486`
MD5	`a2594c48e74c027d544195a334cc9c34`
BLAKE2b-256	`79b15ec0936493e839dfc5a194120da5ec42589e984765b0996b2f7837ff4cd1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.7.0.tar.gz:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracecheck-0.7.0.tar.gz
- Subject digest: f6d4afafe58b38bafbb3991fd8f08264525d1f9e56d41ee2a748bb5a76249486
- Sigstore transparency entry: 1506505298
- Sigstore integration time: May 11, 2026
Source repository:
- Permalink: Mohdtalibakhtar/tracecheck@f0a42e47d58459536374ce884c3976e022fb6859
- Branch / Tag: refs/tags/v0.7.0
- Owner: https://github.com/Mohdtalibakhtar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f0a42e47d58459536374ce884c3976e022fb6859
- Trigger Event: push

File details

Details for the file tracecheck-0.7.0-py3-none-any.whl.

File metadata

Download URL: tracecheck-0.7.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 36.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8c9a62e81214d9ba96a4b74af976310f84cbcc0dd5f6197043346c0a9a3b2b6`
MD5	`31d508641ad472a46bb8a05e13ed2eba`
BLAKE2b-256	`9e68b4af4631a2e666530616dc4c97333a6d41ee1be12d9b38f714376a7e81d9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.7.0-py3-none-any.whl:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracecheck-0.7.0-py3-none-any.whl
- Subject digest: e8c9a62e81214d9ba96a4b74af976310f84cbcc0dd5f6197043346c0a9a3b2b6
- Sigstore transparency entry: 1506505455
- Sigstore integration time: May 11, 2026
Source repository:
- Permalink: Mohdtalibakhtar/tracecheck@f0a42e47d58459536374ce884c3976e022fb6859
- Branch / Tag: refs/tags/v0.7.0
- Owner: https://github.com/Mohdtalibakhtar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f0a42e47d58459536374ce884c3976e022fb6859
- Trigger Event: push

tracecheck 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Tracecheck

Why

Quick start

Integrating tracecheck into your agent

1. Log each step from inside your agent

2. Author a golden test set

3. Replay the golden inputs through your agent, then run tracecheck

Evaluators

What each evaluator uniquely catches

Quick mental shortcut

How tracecheck compares to existing eval tools

LLM-as-judge configuration

Trace format

OpenTelemetry ingestion

Pydantic AI integration

LangChain / LangGraph integration

Roadmap

Examples

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance