Replay-driven eval for f3dx and pydantic-ai traces. Take a JSONL trace, swap the model, get a per-case diff.
Project description
tracewright
Trace-replay adapter for pydantic-evals. Take a JSONL trace, get a Dataset you can run any pydantic-evals evaluator against (LLMJudge, EqualsExpected, custom embedding-cosine - pydantic-evals owns the eval shape). The artifact your runtime already emits becomes the regression suite.
pip install tracewright
# pip install tracewright[pydantic-evals]
import asyncio
from pydantic_evals.evaluators import EqualsExpected, LLMJudge
from tracewright import to_pydantic_evals_dataset
dataset = to_pydantic_evals_dataset(
"traces.jsonl",
name="prod-regression-cw43",
evaluators=(EqualsExpected(), LLMJudge(rubric="answer is factually correct")),
)
async def my_candidate(prompt: str) -> str:
return await run_my_agent(prompt)
report = asyncio.run(dataset.evaluate(my_candidate))
report.print() # markdown summary, per-case pass/fail, scorer rollups
Same path reads pydantic-ai's native logfire-shaped JSONL spans:
dataset = to_pydantic_evals_dataset(
"logfire_export.jsonl",
pydantic_ai_logfire=True,
evaluators=(LLMJudge(rubric="answer is factually correct"),),
)
Lightweight in-process scorers (no pydantic-evals dep) still ship for users who prefer them:
from tracewright import ReplayEngine, parse_jsonl, ExactMatchScorer, PydanticEquivalenceScorer
from tracewright._parse import filter_replayable
rows = filter_replayable(parse_jsonl("traces.jsonl"))
engine = ReplayEngine(candidate_fn=my_candidate, candidate_model="claude-haiku-4")
for result in engine.replay_many(rows):
if not result.all_passed:
print(f"divergence: {result.case.prompt[:60]} -> {result.candidate_output[:60]}")
tracewright replay traces.jsonl --candidate myapp.replay:my_candidate \
--candidate-model claude-haiku-4 \
--report html=report.html \
--budget "pass_rate=>=1.0,latency_p95=+10%" -v
Drop the resulting report.html into a CI artifact upload step. The single self-contained file (no JS, no external CSS) renders a side-by-side diff per case + scorer rollups + p50/p95 latency baseline-vs-candidate. --budget is comma-separated metric=op-value; supported metrics are latency_p50, latency_p95, latency_mean, score, pass_rate. Operators +% / -% compare candidate vs baseline; >=, <=, == compare absolute. Exit 2 on any violation.
Why
Agent evals are run-once snapshots today (Liu et al. 2024 "AgentBench" arXiv:2308.03688; Yang et al. 2024 "SWE-bench" arXiv:2310.06770). There's no standard way to hold the input distribution fixed and swap the model. Trivedi et al. 2024 ("Toolformer revisited" arXiv:2403.04746) names trace-replay as a missing primitive.
pydantic-evals already owns the eval-evaluator-report shape - built-in LLMJudge, EqualsExpected, Contains, IsInstance, MaxDuration, Python, the Evaluator Protocol for custom scorers, async retries, markdown reports. Tracewright deliberately does not reimplement any of that. It's a 50-line bridge: JSONL traces in, pydantic_evals.Dataset out. f3dx is the only Rust runtime emitting Logfire-shaped JSONL natively, and pydantic-ai with logfire enabled writes the same gen_ai.* span shape - both feed cleanly through this adapter.
Architecture
tracewright/
src/tracewright/
_models.py TraceRow, Message, ReplayCase, ReplayResult, ScoreResult
_parse.py parse_jsonl + filter_replayable for f3dx-shaped rows
_pydantic_ai.py parse_pydantic_ai_jsonl for OTel logfire spans
_score.py Scorer Protocol + ExactMatchScorer + PydanticEquivalenceScorer
_replay.py ReplayEngine (parse -> case -> candidate_fn -> score)
_pydantic_evals.py to_pydantic_evals_dataset adapter (the canonical path)
_report.py Report aggregation + LatencyStats + self-contained HTML render
_budget.py --budget parser + enforcer (latency_p50/p95/mean, score, pass_rate)
cli.py tracewright replay <trace.jsonl> --candidate <import:fn>
[--report html=PATH | json=PATH] [--budget SPEC]
tests/
fixtures/enriched_trace.jsonl 4-row f3dx-shaped fixture
fixtures/pydantic_ai_spans.jsonl 3-row pydantic-ai/logfire-shaped fixture
test_replay.py engine + parser core
test_pydantic_ai_adapter.py logfire span parser tests
test_pydantic_equivalence.py schema-validate-then-compare scorer tests
Required trace shape
The replay engine needs three fields beyond the metadata schema: prompt (str), system_prompt (str | None), output (str). Two sources work today out of the box:
- f3dx:
f3dx.configure_traces(path, capture_messages=True)opts in to writing the enriched fields. Off by default for PII-safety.parse_jsonl(path)reads them. - pydantic-ai with logfire: emits OTel spans with
gen_ai.input.messages+gen_ai.output.messagesattributes (JSON-encodedlist[ChatMessage]).parse_pydantic_ai_jsonl(path)flattens those intoTraceRowrecords the engine consumes the same way.
Layout
src/tracewright/ core library
tests/ pytest suite + fixtures
examples/ reference candidate fns for CLI smoke + docs
pyproject.toml hatch build, optional [pydantic-ai] extra
.github/workflows/ci.yml ubuntu/macos/windows + py3.10/3.12 + ruff + mypy + pytest + CLI smoke
What's not here yet
- Direct ingestion of pydantic-ai
Agent.iter()runs (today: post-run logfire JSONL only)
For embedding-cosine, LLM-judge, semantic-similarity, or any custom scoring: write a pydantic_evals.evaluators.Evaluator subclass and pass it via the evaluators= kwarg. Pydantic-evals owns that surface; tracewright deliberately doesn't compete with it.
Sibling projects
The f3d1 ecosystem:
f3dx- Rust runtime your Python imports. Drop-in for openai + anthropic SDKs with native SSE streaming, agent loop with concurrent tool dispatch, OTel emission.pip install f3dx.f3dx-cache- Content-addressable LLM response cache + replay. redb + RFC 8785 JCS + BLAKE3.pip install f3dx-cache.pydantic-cal- Calibration metrics forpydantic-evals: ECE, MCE, ACE, Brier, reliability diagrams, Fisher-Rao geometry kernel.pip install pydantic-cal.f3dx-router- In-process Rust router for LLM providers. Hedged-parallel + 429/5xx hot-swap.pip install f3dx-router.f3dx-bench- Public real-prod-traffic LLM benchmark dashboard. CF Worker + R2 + duckdb-wasm. Live.llmkit- Hosted API gateway with budget enforcement, session tracking, cost dashboards, MCP server. llmkit.sh.keyguard- Security linter for open source projects. Finds and fixes what others only report.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracewright-0.0.7.tar.gz.
File metadata
- Download URL: tracewright-0.0.7.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
828a4c8df35bfbef71ecf359e0b7fb738d3a15f5b99b12cb716372b3e1850a16
|
|
| MD5 |
25f2f0c77f40a26624a6921032e52272
|
|
| BLAKE2b-256 |
ea79f8a292e582cf2938d8c51f4720bc291833210611c5035ae6286188c8556e
|
Provenance
The following attestation bundles were made for tracewright-0.0.7.tar.gz:
Publisher:
release.yml on smigolsmigol/tracewright
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracewright-0.0.7.tar.gz -
Subject digest:
828a4c8df35bfbef71ecf359e0b7fb738d3a15f5b99b12cb716372b3e1850a16 - Sigstore transparency entry: 1397611093
- Sigstore integration time:
-
Permalink:
smigolsmigol/tracewright@de077b09b1ec36a70ecd86cd7d62f239a9f0f209 -
Branch / Tag:
refs/tags/v0.0.7 - Owner: https://github.com/smigolsmigol
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@de077b09b1ec36a70ecd86cd7d62f239a9f0f209 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tracewright-0.0.7-py3-none-any.whl.
File metadata
- Download URL: tracewright-0.0.7-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a778170b047097b7e3a17cf0babbcdc61c6bbfb7f861baa7f9898bc952e25dcc
|
|
| MD5 |
b1aa9ee0272e6605f7eba601049a7154
|
|
| BLAKE2b-256 |
31c61762600a9ddef9c6c5d99e47543ad3f054c607d21b0020381c38380f5766
|
Provenance
The following attestation bundles were made for tracewright-0.0.7-py3-none-any.whl:
Publisher:
release.yml on smigolsmigol/tracewright
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracewright-0.0.7-py3-none-any.whl -
Subject digest:
a778170b047097b7e3a17cf0babbcdc61c6bbfb7f861baa7f9898bc952e25dcc - Sigstore transparency entry: 1397611131
- Sigstore integration time:
-
Permalink:
smigolsmigol/tracewright@de077b09b1ec36a70ecd86cd7d62f239a9f0f209 -
Branch / Tag:
refs/tags/v0.0.7 - Owner: https://github.com/smigolsmigol
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@de077b09b1ec36a70ecd86cd7d62f239a9f0f209 -
Trigger Event:
push
-
Statement type: