Skip to main content

Replay-driven eval for f3dx and pydantic-ai traces. Take a JSONL trace, swap the model, get a per-case diff.

Project description

tracewright

Trace-replay adapter for pydantic-evals. Take a JSONL trace, get a Dataset you can run any pydantic-evals evaluator against (LLMJudge, EqualsExpected, custom embedding-cosine — pydantic-evals owns the eval shape). The artifact your runtime already emits becomes the regression suite.

pip install tracewright
# pip install tracewright[pydantic-evals]
import asyncio
from pydantic_evals.evaluators import EqualsExpected, LLMJudge
from tracewright import to_pydantic_evals_dataset

dataset = to_pydantic_evals_dataset(
    "traces.jsonl",
    name="prod-regression-cw43",
    evaluators=(EqualsExpected(), LLMJudge(rubric="answer is factually correct")),
)

async def my_candidate(prompt: str) -> str:
    return await run_my_agent(prompt)

report = asyncio.run(dataset.evaluate(my_candidate))
report.print()  # markdown summary, per-case pass/fail, scorer rollups

Same path reads pydantic-ai's native logfire-shaped JSONL spans:

dataset = to_pydantic_evals_dataset(
    "logfire_export.jsonl",
    pydantic_ai_logfire=True,
    evaluators=(LLMJudge(rubric="answer is factually correct"),),
)

Lightweight in-process scorers (no pydantic-evals dep) still ship for users who prefer them:

from tracewright import ReplayEngine, parse_jsonl, ExactMatchScorer, PydanticEquivalenceScorer
from tracewright._parse import filter_replayable

rows = filter_replayable(parse_jsonl("traces.jsonl"))
engine = ReplayEngine(candidate_fn=my_candidate, candidate_model="claude-haiku-4")
for result in engine.replay_many(rows):
    if not result.all_passed:
        print(f"divergence: {result.case.prompt[:60]} -> {result.candidate_output[:60]}")
tracewright replay traces.jsonl --candidate myapp.replay:my_candidate \
    --candidate-model claude-haiku-4 \
    --report html=report.html \
    --budget "pass_rate=>=1.0,latency_p95=+10%" -v

Drop the resulting report.html into a CI artifact upload step. The single self-contained file (no JS, no external CSS) renders a side-by-side diff per case + scorer rollups + p50/p95 latency baseline-vs-candidate. --budget is comma-separated metric=op-value; supported metrics are latency_p50, latency_p95, latency_mean, score, pass_rate. Operators +% / -% compare candidate vs baseline; >=, <=, == compare absolute. Exit 2 on any violation.

Why

Agent evals are run-once snapshots today (Liu et al. 2024 "AgentBench" arXiv:2308.03688; Yang et al. 2024 "SWE-bench" arXiv:2310.06770). There's no standard way to hold the input distribution fixed and swap the model. Trivedi et al. 2024 ("Toolformer revisited" arXiv:2403.04746) names trace-replay as a missing primitive.

pydantic-evals already owns the eval-evaluator-report shape — built-in LLMJudge, EqualsExpected, Contains, IsInstance, MaxDuration, Python, the Evaluator Protocol for custom scorers, async retries, markdown reports. Tracewright deliberately does not reimplement any of that. It's a 50-line bridge: JSONL traces in, pydantic_evals.Dataset out. f3dx is the only Rust runtime emitting Logfire-shaped JSONL natively, and pydantic-ai with logfire enabled writes the same gen_ai.* span shape — both feed cleanly through this adapter.

Architecture

tracewright/
  src/tracewright/
    _models.py        TraceRow, Message, ReplayCase, ReplayResult, ScoreResult
    _parse.py         parse_jsonl + filter_replayable for f3dx-shaped rows
    _pydantic_ai.py   parse_pydantic_ai_jsonl for OTel logfire spans
    _score.py         Scorer Protocol + ExactMatchScorer + PydanticEquivalenceScorer
    _replay.py        ReplayEngine (parse -> case -> candidate_fn -> score)
    _pydantic_evals.py to_pydantic_evals_dataset adapter (the canonical path)
    _report.py        Report aggregation + LatencyStats + self-contained HTML render
    _budget.py        --budget parser + enforcer (latency_p50/p95/mean, score, pass_rate)
    cli.py            tracewright replay <trace.jsonl> --candidate <import:fn>
                      [--report html=PATH | json=PATH] [--budget SPEC]
  tests/
    fixtures/enriched_trace.jsonl       4-row f3dx-shaped fixture
    fixtures/pydantic_ai_spans.jsonl    3-row pydantic-ai/logfire-shaped fixture
    test_replay.py                       engine + parser core
    test_pydantic_ai_adapter.py          logfire span parser tests
    test_pydantic_equivalence.py         schema-validate-then-compare scorer tests

Required trace shape

The replay engine needs three fields beyond the metadata schema: prompt (str), system_prompt (str | None), output (str). Two sources work today out of the box:

  • f3dx: f3dx.configure_traces(path, capture_messages=True) opts in to writing the enriched fields. Off by default for PII-safety. parse_jsonl(path) reads them.
  • pydantic-ai with logfire: emits OTel spans with gen_ai.input.messages + gen_ai.output.messages attributes (JSON-encoded list[ChatMessage]). parse_pydantic_ai_jsonl(path) flattens those into TraceRow records the engine consumes the same way.

Layout

src/tracewright/      core library
tests/                pytest suite + fixtures
examples/             reference candidate fns for CLI smoke + docs
pyproject.toml        hatch build, optional [pydantic-ai] extra
.github/workflows/ci.yml   ubuntu/macos/windows + py3.10/3.12 + ruff + mypy + pytest + CLI smoke

What's not here yet

  • Tool-call divergence reporting
  • Direct ingestion of pydantic-ai Agent.iter() runs (today: post-run logfire JSONL only)

For embedding-cosine, LLM-judge, semantic-similarity, or any custom scoring: write a pydantic_evals.evaluators.Evaluator subclass and pass it via the evaluators= kwarg. Pydantic-evals owns that surface; tracewright deliberately doesn't compete with it.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracewright-0.0.6.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracewright-0.0.6-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file tracewright-0.0.6.tar.gz.

File metadata

  • Download URL: tracewright-0.0.6.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracewright-0.0.6.tar.gz
Algorithm Hash digest
SHA256 eb9d81f166aac413abac22a3d75718b6b75f4209f186c4ada647d684389886fd
MD5 67e7c0b0d79aed129bf64b974986075d
BLAKE2b-256 829f455302deb7a5008c25d231eaa30b93df1befb442ba35e2241c5dc53da75f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracewright-0.0.6.tar.gz:

Publisher: release.yml on smigolsmigol/tracewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracewright-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: tracewright-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracewright-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 82c9bd6986252c87a24da667152aca0e3fa9096ca7a85afdc230a7dbd67e1865
MD5 b3c4281e3219445d3b6686993bc05207
BLAKE2b-256 ccdac69675983efcb6ab1e704e23e247a01c98d248291fcab52046e86b734308

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracewright-0.0.6-py3-none-any.whl:

Publisher: release.yml on smigolsmigol/tracewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page