Skip to main content

Evaluate multi step AI agent traces, not just single LLM responses.

Project description

Tracecheck

Evaluate multi step AI agent traces, not just single LLM responses.

Why

Most eval tools score one input/output pair at a time. Production agents take 7 tool calls, retry, hit context limits, and then produce an answer. The trace is the unit you actually need to evaluate. tracecheck ingests full agent traces and scores them on tool accuracy, context handling, step efficiency, failure modes, and final output quality.

Quick start

pip install tracecheck         # core, deterministic evaluators only
pip install tracecheck[llm]    # adds the LLM-as-judge extras (anthropic SDK)

tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml

# write a self-contained HTML report you can open in a browser
tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml \
  --output html --out report.html

Sample text output:

Trace: support_001_pass (support_agent)
  [PASS] tool_accuracy:    Tool sequence matches expected (3 calls).
  [PASS] step_efficiency:  Efficient: 3 tool call(s), no loops or retries.
  [PASS] failure_modes:    No failure modes detected.
  [PASS] context_drift:    Stayed on topic.
  [PASS] output_quality:   Reply confirms refund and dollar amount.
  -> PASS

Trace: support_002_fail (support_agent)
  [FAIL] tool_accuracy:    Expected [get_order, verify_item_mismatch, issue_refund],
                           got [get_order, get_order, get_order, search_products].
  [FAIL] step_efficiency:  Inefficient: 2 consecutive duplicate tool call(s),
                           1 excess tool call(s) over expected.
  [FAIL] failure_modes:    Detected: infinite_loop, context_window_overflow.
  [PASS] context_drift:    Stayed on topic but failed mechanically.
  [SKIP] output_quality:   Skipped: other evaluators failed on this trace.
  -> FAIL

Aggregate: 1/2 traces passed
exit code: 1

Or programmatically:

from tracecheck import load_traces, run_evals
from tracecheck.report import to_text

traces = load_traces("traces.jsonl")
reports = run_evals(traces, "evals.yaml")
print(to_text(reports))

The CLI exits with code 1 if any trace fails, so you can drop it into CI.

Evaluators

Evaluator What it checks Status
tool_accuracy Did the agent call the right tools in the right order? Built
step_efficiency Tool calls vs expected; flags consecutive duplicate calls and retries Built
failure_modes Tags traces with known failure shapes (loops, context overflow, tool errors) Built
context_drift Does the agent stay on topic across steps? (LLM as judge) Built
output_quality Final reply scored against a rubric, only after others pass (LLM as judge) Built

output_quality is deferred: the runner only invokes the judge after every other evaluator on the same trace passes. Already-broken traces do not burn judge tokens, and the report renders [SKIP] for them.

LLM-as-judge configuration

The two judge-based evaluators (context_drift, output_quality) need a backend. Configure it in your YAML:

evaluators:
  - tool_accuracy
  - step_efficiency
  - failure_modes
  - context_drift
  - output_quality

judge:
  provider: anthropic        # or "fake" for tests
  model: claude-sonnet-4-5
  max_tokens: 1024

output_quality:
  rubric: |
    The reply must accurately reflect the tool-call outcomes.
    Confirm dollar amounts when a refund was issued.

Rubric resolution is per-trace first, then YAML default — set trace.metadata.rubric to override per scenario.

The AnthropicJudge enables prompt caching on the system block, so judging N traces costs roughly 1 + 0.1 * (N − 1) system-prompt tokens. See tracecheck/judges/ for the protocol and the FakeJudge used in tests.

Trace format

A trace is a JSON object with an ordered list of steps:

{
  "trace_id": "support_001",
  "agent_name": "support_agent",
  "user_input": "I got the wrong color sweater, refund please",
  "expected_tools": ["get_order", "verify_item_mismatch", "issue_refund"],
  "steps": [
    {"type": "tool_call", "name": "get_order",
     "input": {"user_id": "u_123"}, "output": {"order_id": "o_99"}},
    {"type": "tool_call", "name": "verify_item_mismatch",
     "input": {"order_id": "o_99"}, "output": {"mismatch": true}},
    {"type": "tool_call", "name": "issue_refund",
     "input": {"order_id": "o_99"}, "output": {"status": "ok"}},
    {"type": "llm_call", "output": "Refunded $49.99 to your card."}
  ]
}

Step types: llm_call, tool_call, retry, error. Each step may carry latency_ms, tokens, timestamp, and an error message. See tracecheck/schema.py for the full Pydantic spec.

A .jsonl file is one trace per line. A .json file may be a single trace or an array.

Roadmap

  • Trace ingestion (JSON, JSONL)
  • Tool call accuracy evaluator
  • Step efficiency evaluator
  • Failure mode detection
  • Context drift evaluator
  • Output quality evaluator
  • Static HTML report (--output html)
  • OpenTelemetry span ingest
  • Pydantic AI native integration
  • LangGraph trace adapter
  • PyPI release

Examples

See examples/:

Architecture

traces.jsonl ──► ingest ──► Trace[] ──► runner ──► [Evaluator.evaluate(trace)] ──► report
                                            ▲
                                         evals.yaml

The library is intentionally small: a Pydantic schema, a loader, a registry of evaluators, and a runner that walks every trace through every configured evaluator. CLI and library share the same code path.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracecheck-0.4.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracecheck-0.4.0-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file tracecheck-0.4.0.tar.gz.

File metadata

  • Download URL: tracecheck-0.4.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.4.0.tar.gz
Algorithm Hash digest
SHA256 d600b418845246d5c0488f0a22cf2df21d680ee2eb0bc12f950ba79fa8579d25
MD5 9367c937d9efdafa47d289209e0d70f4
BLAKE2b-256 415571db11809acd4e14209a575c3dcfabad9fc5ac810c69054ba15a565cbe7c

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.4.0.tar.gz:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracecheck-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: tracecheck-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 483ba765fd7ffef96c57df04c8f18a114a3d613c57e7e7fb53d089c0eb699b69
MD5 0180d048c17508d90d91afd51dc2551a
BLAKE2b-256 2ee2b26147d86b4bd64b3e2803c9f860714b37b8ef622559d774b61e5363d552

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.4.0-py3-none-any.whl:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page