Evaluate multi step AI agent traces, not just single LLM responses.

These details have not been verified by PyPI

Project description

Tracecheck

Evaluate multi step AI agent traces, not just single LLM responses.

Why

Most eval tools score one input/output pair at a time. Production agents take 7 tool calls, retry, hit context limits, and then produce an answer. The trace is the unit you actually need to evaluate. tracecheck ingests full agent traces and scores them on tool accuracy, context handling, step efficiency, failure modes, and final output quality.

Quick start

pip install tracecheck         # core, deterministic evaluators only
pip install tracecheck[llm]    # adds the LLM-as-judge extras (anthropic SDK)

tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml

# write a self-contained HTML report you can open in a browser
tracecheck run \
  --traces examples/sample_traces.jsonl \
  --config examples/evals.yaml \
  --output html --out report.html

Sample text output:

Trace: support_001_pass (support_agent)
  [PASS] tool_accuracy:    Tool sequence matches expected (3 calls).
  [PASS] step_efficiency:  Efficient: 3 tool call(s), no loops or retries.
  [PASS] failure_modes:    No failure modes detected.
  [PASS] context_drift:    Stayed on topic.
  [PASS] output_quality:   Reply confirms refund and dollar amount.
  -> PASS

Trace: support_002_fail (support_agent)
  [FAIL] tool_accuracy:    Expected [get_order, verify_item_mismatch, issue_refund],
                           got [get_order, get_order, get_order, search_products].
  [FAIL] step_efficiency:  Inefficient: 2 consecutive duplicate tool call(s),
                           1 excess tool call(s) over expected.
  [FAIL] failure_modes:    Detected: infinite_loop, context_window_overflow.
  [PASS] context_drift:    Stayed on topic but failed mechanically.
  [SKIP] output_quality:   Skipped: other evaluators failed on this trace.
  -> FAIL

Aggregate: 1/2 traces passed
exit code: 1

Or programmatically:

from tracecheck import load_traces, run_evals
from tracecheck.report import to_text

traces = load_traces("traces.jsonl")
reports = run_evals(traces, "evals.yaml")
print(to_text(reports))

The CLI exits with code 1 if any trace fails, so you can drop it into CI.

Evaluators

Evaluator	What it checks	Status
`tool_accuracy`	Did the agent call the right tools in the right order?	Built
`step_efficiency`	Tool calls vs expected; flags consecutive duplicate calls and retries	Built
`failure_modes`	Tags traces with known failure shapes (loops, context overflow, tool errors)	Built
`context_drift`	Does the agent stay on topic across steps? (LLM as judge)	Built
`output_quality`	Final reply scored against a rubric, only after others pass (LLM as judge)	Built

output_quality is deferred: the runner only invokes the judge after every other evaluator on the same trace passes. Already-broken traces do not burn judge tokens, and the report renders [SKIP] for them.

LLM-as-judge configuration

The two judge-based evaluators (context_drift, output_quality) need a backend. Configure it in your YAML:

evaluators:
  - tool_accuracy
  - step_efficiency
  - failure_modes
  - context_drift
  - output_quality

judge:
  provider: anthropic        # or "fake" for tests
  model: claude-sonnet-4-5
  max_tokens: 1024

output_quality:
  rubric: |
    The reply must accurately reflect the tool-call outcomes.
    Confirm dollar amounts when a refund was issued.

Rubric resolution is per-trace first, then YAML default — set trace.metadata.rubric to override per scenario.

The AnthropicJudge enables prompt caching on the system block, so judging N traces costs roughly 1 + 0.1 * (N − 1) system-prompt tokens. See tracecheck/judges/ for the protocol and the FakeJudge used in tests.

Trace format

A trace is a JSON object with an ordered list of steps:

{
  "trace_id": "support_001",
  "agent_name": "support_agent",
  "user_input": "I got the wrong color sweater, refund please",
  "expected_tools": ["get_order", "verify_item_mismatch", "issue_refund"],
  "steps": [
    {"type": "tool_call", "name": "get_order",
     "input": {"user_id": "u_123"}, "output": {"order_id": "o_99"}},
    {"type": "tool_call", "name": "verify_item_mismatch",
     "input": {"order_id": "o_99"}, "output": {"mismatch": true}},
    {"type": "tool_call", "name": "issue_refund",
     "input": {"order_id": "o_99"}, "output": {"status": "ok"}},
    {"type": "llm_call", "output": "Refunded $49.99 to your card."}
  ]
}

Step types: llm_call, tool_call, retry, error. Each step may carry latency_ms, tokens, timestamp, and an error message. See tracecheck/schema.py for the full Pydantic spec.

A .jsonl file is one trace per line. A .json file may be a single trace or an array.

Roadmap

Trace ingestion (JSON, JSONL)
Tool call accuracy evaluator
Step efficiency evaluator
Failure mode detection
Context drift evaluator
Output quality evaluator
Static HTML report (--output html)
OpenTelemetry span ingest
Pydantic AI native integration
LangGraph trace adapter
PyPI release

Examples

See examples/:

basic_usage.py — programmatic API
sample_traces.jsonl — three traces (pass, fail, edge case)
evals.yaml — minimal deterministic config (no LLM key needed)
evals_with_judge.yaml — full config with the LLM-as-judge evaluators enabled

Architecture

traces.jsonl ──► ingest ──► Trace[] ──► runner ──► [Evaluator.evaluate(trace)] ──► report
                                            ▲
                                         evals.yaml

The library is intentionally small: a Pydantic schema, a loader, a registry of evaluators, and a runner that walks every trace through every configured evaluator. CLI and library share the same code path.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.0

May 11, 2026

0.5.0

May 11, 2026

This version

0.4.0

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracecheck-0.4.0.tar.gz (18.1 kB view details)

Uploaded May 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracecheck-0.4.0-py3-none-any.whl (24.3 kB view details)

Uploaded May 9, 2026 Python 3

File details

Details for the file tracecheck-0.4.0.tar.gz.

File metadata

Download URL: tracecheck-0.4.0.tar.gz
Upload date: May 9, 2026
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`d600b418845246d5c0488f0a22cf2df21d680ee2eb0bc12f950ba79fa8579d25`
MD5	`9367c937d9efdafa47d289209e0d70f4`
BLAKE2b-256	`415571db11809acd4e14209a575c3dcfabad9fc5ac810c69054ba15a565cbe7c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.4.0.tar.gz:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracecheck-0.4.0.tar.gz
- Subject digest: d600b418845246d5c0488f0a22cf2df21d680ee2eb0bc12f950ba79fa8579d25
- Sigstore transparency entry: 1485618968
- Sigstore integration time: May 9, 2026
Source repository:
- Permalink: Mohdtalibakhtar/tracecheck@2edc3d1ff7537635712b5aa9d2a3560363ea23cb
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/Mohdtalibakhtar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2edc3d1ff7537635712b5aa9d2a3560363ea23cb
- Trigger Event: push

File details

Details for the file tracecheck-0.4.0-py3-none-any.whl.

File metadata

Download URL: tracecheck-0.4.0-py3-none-any.whl
Upload date: May 9, 2026
Size: 24.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracecheck-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`483ba765fd7ffef96c57df04c8f18a114a3d613c57e7e7fb53d089c0eb699b69`
MD5	`0180d048c17508d90d91afd51dc2551a`
BLAKE2b-256	`2ee2b26147d86b4bd64b3e2803c9f860714b37b8ef622559d774b61e5363d552`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracecheck-0.4.0-py3-none-any.whl:

Publisher: release.yml on Mohdtalibakhtar/tracecheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracecheck-0.4.0-py3-none-any.whl
- Subject digest: 483ba765fd7ffef96c57df04c8f18a114a3d613c57e7e7fb53d089c0eb699b69
- Sigstore transparency entry: 1485618973
- Sigstore integration time: May 9, 2026
Source repository:
- Permalink: Mohdtalibakhtar/tracecheck@2edc3d1ff7537635712b5aa9d2a3560363ea23cb
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/Mohdtalibakhtar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2edc3d1ff7537635712b5aa9d2a3560363ea23cb
- Trigger Event: push

tracecheck 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Tracecheck

Why

Quick start

Evaluators

LLM-as-judge configuration

Trace format

Roadmap

Examples

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance