Replay-driven eval for f3dx and pydantic-ai traces. Take a JSONL trace, swap the model, get a per-case diff.

These details have not been verified by PyPI

Project description

tracewright

Trace-replay adapter for pydantic-evals. Take a JSONL trace, get a Dataset you can run any pydantic-evals evaluator against (LLMJudge, EqualsExpected, custom embedding-cosine — pydantic-evals owns the eval shape). The artifact your runtime already emits becomes the regression suite.

pip install tracewright

# pip install tracewright[pydantic-evals]
import asyncio
from pydantic_evals.evaluators import EqualsExpected, LLMJudge
from tracewright import to_pydantic_evals_dataset

dataset = to_pydantic_evals_dataset(
    "traces.jsonl",
    name="prod-regression-cw43",
    evaluators=(EqualsExpected(), LLMJudge(rubric="answer is factually correct")),
)

async def my_candidate(prompt: str) -> str:
    return await run_my_agent(prompt)

report = asyncio.run(dataset.evaluate(my_candidate))
report.print()  # markdown summary, per-case pass/fail, scorer rollups

Same path reads pydantic-ai's native logfire-shaped JSONL spans:

dataset = to_pydantic_evals_dataset(
    "logfire_export.jsonl",
    pydantic_ai_logfire=True,
    evaluators=(LLMJudge(rubric="answer is factually correct"),),
)

Lightweight in-process scorers (no pydantic-evals dep) still ship for users who prefer them:

from tracewright import ReplayEngine, parse_jsonl, ExactMatchScorer, PydanticEquivalenceScorer
from tracewright._parse import filter_replayable

rows = filter_replayable(parse_jsonl("traces.jsonl"))
engine = ReplayEngine(candidate_fn=my_candidate, candidate_model="claude-haiku-4")
for result in engine.replay_many(rows):
    if not result.all_passed:
        print(f"divergence: {result.case.prompt[:60]} -> {result.candidate_output[:60]}")

tracewright replay traces.jsonl --candidate myapp.replay:my_candidate \
    --candidate-model claude-haiku-4 \
    --report html=report.html \
    --budget "pass_rate=>=1.0,latency_p95=+10%" -v

Drop the resulting report.html into a CI artifact upload step. The single self-contained file (no JS, no external CSS) renders a side-by-side diff per case + scorer rollups + p50/p95 latency baseline-vs-candidate. --budget is comma-separated metric=op-value; supported metrics are latency_p50, latency_p95, latency_mean, score, pass_rate. Operators +% / -% compare candidate vs baseline; >=, <=, == compare absolute. Exit 2 on any violation.

Why

Agent evals are run-once snapshots today (Liu et al. 2024 "AgentBench" arXiv:2308.03688; Yang et al. 2024 "SWE-bench" arXiv:2310.06770). There's no standard way to hold the input distribution fixed and swap the model. Trivedi et al. 2024 ("Toolformer revisited" arXiv:2403.04746) names trace-replay as a missing primitive.

pydantic-evals already owns the eval-evaluator-report shape — built-in LLMJudge, EqualsExpected, Contains, IsInstance, MaxDuration, Python, the Evaluator Protocol for custom scorers, async retries, markdown reports. Tracewright deliberately does not reimplement any of that. It's a 50-line bridge: JSONL traces in, pydantic_evals.Dataset out. f3dx is the only Rust runtime emitting Logfire-shaped JSONL natively, and pydantic-ai with logfire enabled writes the same gen_ai.* span shape — both feed cleanly through this adapter.

Architecture

tracewright/
  src/tracewright/
    _models.py        TraceRow, Message, ReplayCase, ReplayResult, ScoreResult
    _parse.py         parse_jsonl + filter_replayable for f3dx-shaped rows
    _pydantic_ai.py   parse_pydantic_ai_jsonl for OTel logfire spans
    _score.py         Scorer Protocol + ExactMatchScorer + PydanticEquivalenceScorer
    _replay.py        ReplayEngine (parse -> case -> candidate_fn -> score)
    _pydantic_evals.py to_pydantic_evals_dataset adapter (the canonical path)
    _report.py        Report aggregation + LatencyStats + self-contained HTML render
    _budget.py        --budget parser + enforcer (latency_p50/p95/mean, score, pass_rate)
    cli.py            tracewright replay <trace.jsonl> --candidate <import:fn>
                      [--report html=PATH | json=PATH] [--budget SPEC]
  tests/
    fixtures/enriched_trace.jsonl       4-row f3dx-shaped fixture
    fixtures/pydantic_ai_spans.jsonl    3-row pydantic-ai/logfire-shaped fixture
    test_replay.py                       engine + parser core
    test_pydantic_ai_adapter.py          logfire span parser tests
    test_pydantic_equivalence.py         schema-validate-then-compare scorer tests

Required trace shape

The replay engine needs three fields beyond the metadata schema: prompt (str), system_prompt (str | None), output (str). Two sources work today out of the box:

f3dx: f3dx.configure_traces(path, capture_messages=True) opts in to writing the enriched fields. Off by default for PII-safety. parse_jsonl(path) reads them.
pydantic-ai with logfire: emits OTel spans with gen_ai.input.messages + gen_ai.output.messages attributes (JSON-encoded list[ChatMessage]). parse_pydantic_ai_jsonl(path) flattens those into TraceRow records the engine consumes the same way.

Layout

src/tracewright/      core library
tests/                pytest suite + fixtures
examples/             reference candidate fns for CLI smoke + docs
pyproject.toml        hatch build, optional [pydantic-ai] extra
.github/workflows/ci.yml   ubuntu/macos/windows + py3.10/3.12 + ruff + mypy + pytest + CLI smoke

What's not here yet

Tool-call divergence reporting
Direct ingestion of pydantic-ai Agent.iter() runs (today: post-run logfire JSONL only)

For embedding-cosine, LLM-judge, semantic-similarity, or any custom scoring: write a pydantic_evals.evaluators.Evaluator subclass and pass it via the evaluators= kwarg. Pydantic-evals owns that surface; tracewright deliberately doesn't compete with it.

License

MIT.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.8

Apr 28, 2026

0.0.7

Apr 28, 2026

This version

0.0.6

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracewright-0.0.6.tar.gz (21.4 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracewright-0.0.6-py3-none-any.whl (22.1 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file tracewright-0.0.6.tar.gz.

File metadata

Download URL: tracewright-0.0.6.tar.gz
Upload date: Apr 27, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracewright-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`eb9d81f166aac413abac22a3d75718b6b75f4209f186c4ada647d684389886fd`
MD5	`67e7c0b0d79aed129bf64b974986075d`
BLAKE2b-256	`829f455302deb7a5008c25d231eaa30b93df1befb442ba35e2241c5dc53da75f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracewright-0.0.6.tar.gz:

Publisher: release.yml on smigolsmigol/tracewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracewright-0.0.6.tar.gz
- Subject digest: eb9d81f166aac413abac22a3d75718b6b75f4209f186c4ada647d684389886fd
- Sigstore transparency entry: 1395071800
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: smigolsmigol/tracewright@d9abdc55d508e4daf1ddc1d42b16b6a619372021
- Branch / Tag: refs/tags/v0.0.6
- Owner: https://github.com/smigolsmigol
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d9abdc55d508e4daf1ddc1d42b16b6a619372021
- Trigger Event: push

File details

Details for the file tracewright-0.0.6-py3-none-any.whl.

File metadata

Download URL: tracewright-0.0.6-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 22.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracewright-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`82c9bd6986252c87a24da667152aca0e3fa9096ca7a85afdc230a7dbd67e1865`
MD5	`b3c4281e3219445d3b6686993bc05207`
BLAKE2b-256	`ccdac69675983efcb6ab1e704e23e247a01c98d248291fcab52046e86b734308`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracewright-0.0.6-py3-none-any.whl:

Publisher: release.yml on smigolsmigol/tracewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracewright-0.0.6-py3-none-any.whl
- Subject digest: 82c9bd6986252c87a24da667152aca0e3fa9096ca7a85afdc230a7dbd67e1865
- Sigstore transparency entry: 1395071835
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: smigolsmigol/tracewright@d9abdc55d508e4daf1ddc1d42b16b6a619372021
- Branch / Tag: refs/tags/v0.0.6
- Owner: https://github.com/smigolsmigol
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d9abdc55d508e4daf1ddc1d42b16b6a619372021
- Trigger Event: push

tracewright 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tracewright

Why

Architecture

Required trace shape

Layout

What's not here yet

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance