Skip to main content

Replay-driven eval for f3dx and pydantic-ai traces. Take a JSONL trace, swap the model, get a per-case diff.

Project description

tracewright

OpenSSF Scorecard

Trace-replay adapter for pydantic-evals. Take a JSONL trace, get a Dataset you can run any pydantic-evals evaluator against (LLMJudge, EqualsExpected, custom embedding-cosine - pydantic-evals owns the eval shape). The artifact your runtime already emits becomes the regression suite.

pip install tracewright
# pip install tracewright[pydantic-evals]
import asyncio
from pydantic_evals.evaluators import EqualsExpected, LLMJudge
from tracewright import to_pydantic_evals_dataset

dataset = to_pydantic_evals_dataset(
    "traces.jsonl",
    name="prod-regression-cw43",
    evaluators=(EqualsExpected(), LLMJudge(rubric="answer is factually correct")),
)

async def my_candidate(prompt: str) -> str:
    return await run_my_agent(prompt)

report = asyncio.run(dataset.evaluate(my_candidate))
report.print()  # markdown summary, per-case pass/fail, scorer rollups

Same path reads pydantic-ai's native logfire-shaped JSONL spans:

dataset = to_pydantic_evals_dataset(
    "logfire_export.jsonl",
    pydantic_ai_logfire=True,
    evaluators=(LLMJudge(rubric="answer is factually correct"),),
)

Lightweight in-process scorers (no pydantic-evals dep) still ship for users who prefer them:

from tracewright import ReplayEngine, parse_jsonl, ExactMatchScorer, PydanticEquivalenceScorer
from tracewright._parse import filter_replayable

rows = filter_replayable(parse_jsonl("traces.jsonl"))
engine = ReplayEngine(candidate_fn=my_candidate, candidate_model="claude-haiku-4")
for result in engine.replay_many(rows):
    if not result.all_passed:
        print(f"divergence: {result.case.prompt[:60]} -> {result.candidate_output[:60]}")
tracewright replay traces.jsonl --candidate myapp.replay:my_candidate \
    --candidate-model claude-haiku-4 \
    --report html=report.html \
    --budget "pass_rate=>=1.0,latency_p95=+10%" -v

Drop the resulting report.html into a CI artifact upload step. The single self-contained file (no JS, no external CSS) renders a side-by-side diff per case + scorer rollups + p50/p95 latency baseline-vs-candidate. --budget is comma-separated metric=op-value; supported metrics are latency_p50, latency_p95, latency_mean, score, pass_rate. Operators +% / -% compare candidate vs baseline; >=, <=, == compare absolute. Exit 2 on any violation.

Why

Agent evals are run-once snapshots today (Liu et al. 2024 "AgentBench" arXiv:2308.03688; Yang et al. 2024 "SWE-bench" arXiv:2310.06770). There's no standard way to hold the input distribution fixed and swap the model. Trivedi et al. 2024 ("Toolformer revisited" arXiv:2403.04746) names trace-replay as a missing primitive.

pydantic-evals already owns the eval-evaluator-report shape - built-in LLMJudge, EqualsExpected, Contains, IsInstance, MaxDuration, Python, the Evaluator Protocol for custom scorers, async retries, markdown reports. Tracewright deliberately does not reimplement any of that. It's a 50-line bridge: JSONL traces in, pydantic_evals.Dataset out. f3dx is the only Rust runtime emitting Logfire-shaped JSONL natively, and pydantic-ai with logfire enabled writes the same gen_ai.* span shape - both feed cleanly through this adapter.

Architecture

tracewright/
  src/tracewright/
    _models.py        TraceRow, Message, ReplayCase, ReplayResult, ScoreResult
    _parse.py         parse_jsonl + filter_replayable for f3dx-shaped rows
    _pydantic_ai.py   parse_pydantic_ai_jsonl for OTel logfire spans
    _score.py         Scorer Protocol + ExactMatchScorer + PydanticEquivalenceScorer
    _replay.py        ReplayEngine (parse -> case -> candidate_fn -> score)
    _pydantic_evals.py to_pydantic_evals_dataset adapter (the canonical path)
    _report.py        Report aggregation + LatencyStats + self-contained HTML render
    _budget.py        --budget parser + enforcer (latency_p50/p95/mean, score, pass_rate)
    cli.py            tracewright replay <trace.jsonl> --candidate <import:fn>
                      [--report html=PATH | json=PATH] [--budget SPEC]
  tests/
    fixtures/enriched_trace.jsonl       4-row f3dx-shaped fixture
    fixtures/pydantic_ai_spans.jsonl    3-row pydantic-ai/logfire-shaped fixture
    test_replay.py                       engine + parser core
    test_pydantic_ai_adapter.py          logfire span parser tests
    test_pydantic_equivalence.py         schema-validate-then-compare scorer tests

Required trace shape

The replay engine needs three fields beyond the metadata schema: prompt (str), system_prompt (str | None), output (str). Two sources work today out of the box:

  • f3dx: f3dx.configure_traces(path, capture_messages=True) opts in to writing the enriched fields. Off by default for PII-safety. parse_jsonl(path) reads them.
  • pydantic-ai with logfire: emits OTel spans with gen_ai.input.messages + gen_ai.output.messages attributes (JSON-encoded list[ChatMessage]). parse_pydantic_ai_jsonl(path) flattens those into TraceRow records the engine consumes the same way.

Layout

src/tracewright/      core library
tests/                pytest suite + fixtures
examples/             reference candidate fns for CLI smoke + docs
pyproject.toml        hatch build, optional [pydantic-ai] extra
.github/workflows/ci.yml   ubuntu/macos/windows + py3.10/3.12 + ruff + mypy + pytest + CLI smoke

What's not here yet

  • Direct ingestion of pydantic-ai Agent.iter() runs (today: post-run logfire JSONL only)

For embedding-cosine, LLM-judge, semantic-similarity, or any custom scoring: write a pydantic_evals.evaluators.Evaluator subclass and pass it via the evaluators= kwarg. Pydantic-evals owns that surface; tracewright deliberately doesn't compete with it.

Sibling projects

The f3d1 ecosystem:

  • f3dx - Rust runtime your Python imports. Drop-in for openai + anthropic SDKs with native SSE streaming, agent loop with concurrent tool dispatch, OTel emission. pip install f3dx.
  • f3dx-cache - Content-addressable LLM response cache + replay. redb + RFC 8785 JCS + BLAKE3. pip install f3dx-cache.
  • pydantic-cal - Calibration metrics for pydantic-evals: ECE, MCE, ACE, Brier, reliability diagrams, Fisher-Rao geometry kernel. pip install pydantic-cal.
  • f3dx-router - In-process Rust router for LLM providers. Hedged-parallel + 429/5xx hot-swap. pip install f3dx-router.
  • f3dx-bench - Public real-prod-traffic LLM benchmark dashboard. CF Worker + R2 + duckdb-wasm. Live.
  • llmkit - Hosted API gateway with budget enforcement, session tracking, cost dashboards, MCP server. llmkit.sh.
  • keyguard - Security linter for open source projects. Finds and fixes what others only report.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracewright-0.0.7.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracewright-0.0.7-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file tracewright-0.0.7.tar.gz.

File metadata

  • Download URL: tracewright-0.0.7.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tracewright-0.0.7.tar.gz
Algorithm Hash digest
SHA256 828a4c8df35bfbef71ecf359e0b7fb738d3a15f5b99b12cb716372b3e1850a16
MD5 25f2f0c77f40a26624a6921032e52272
BLAKE2b-256 ea79f8a292e582cf2938d8c51f4720bc291833210611c5035ae6286188c8556e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracewright-0.0.7.tar.gz:

Publisher: release.yml on smigolsmigol/tracewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracewright-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: tracewright-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tracewright-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a778170b047097b7e3a17cf0babbcdc61c6bbfb7f861baa7f9898bc952e25dcc
MD5 b1aa9ee0272e6605f7eba601049a7154
BLAKE2b-256 31c61762600a9ddef9c6c5d99e47543ad3f054c607d21b0020381c38380f5766

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracewright-0.0.7-py3-none-any.whl:

Publisher: release.yml on smigolsmigol/tracewright

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page