Agent runtime tracing and deterministic replay for LLM applications
Project description
TraceForge
Agent runtime tracing + LLM-mock replay for Python. Pip install. Async-first. Self-contained reports.
⚠️ Pre-release (v0.2). Tracer, replay, instrumentors, cost tracking, and the pytest plugin are implemented end-to-end and tested. APIs are stabilising — don't depend on this in production yet.
TraceForge records every LLM call, tool invocation, error, and state transition your agent makes into a typed span. The output is a replayable run.jsonl artifact plus a self-contained HTML report you can open in any browser — no server, no SaaS, no SDK lock-in. Replay mode re-executes the agent with cached LLM responses (or cached tool outputs) so you can verify the execution path without burning API calls.
pip install "traceforge-llm[anthropic]" # or [openai], [all]
traceforge init && python agent.py
Why TraceForge
| TraceForge | LangSmith | Langfuse | OpenLLMetry | print() |
|
|---|---|---|---|---|---|
| Pip install, no account, no server | ✅ | — | partial (self-host) | ✅ | ✅ |
| Records LLM I/O + tool I/O + state per span | ✅ | ✅ | ✅ | partial | — |
Replay with cached LLM responses (llm-mock) |
✅ | — | — | — | — |
| Dry-run replay with cached tool outputs | ✅ | — | — | — | — |
| Self-contained HTML report (no CDN, no server) | ✅ | — | — | — | — |
| Auto-cost tracking per-span + per-run | ✅ | ✅ | ✅ | partial | — |
| First-class pytest plugin with snapshot testing | ✅ | — | — | — | — |
| Auto-patches your SDK clients | opt-in | ✅ | ✅ | ✅ | n/a |
| Cloud storage / hosted dashboard | — | ✅ | ✅ | via vendor | — |
Where TraceForge fits: when you need a local, file-based, replayable record of what your agent did — for debugging, CI regression tests, or post-hoc analysis — without sending your traces to anyone else's database. Auto-patching frameworks like LangSmith give you a UI; OpenLLMetry gives you OTel pipes. TraceForge gives you a JSONL you can git diff, an HTML you can email, and a tracer.replay() you can run offline.
60-second quickstart
1. Install and scaffold.
pip install "traceforge-llm[anthropic]"
traceforge init
traceforge init writes traceforge.yaml, a working agent.py example, and a .gitignore entry.
2. Wrap your agent.
import asyncio
from anthropic import AsyncAnthropic
from traceforge import Tracer
from traceforge.integrations.anthropic import AnthropicInstrumentor
tracer = Tracer()
async def main():
async with tracer.run() as run:
client = AnthropicInstrumentor(run).instrument(AsyncAnthropic())
response = await client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": "What is 2 + 2?"}],
)
print(response.content[0].text)
run.trace.print_summary()
asyncio.run(main())
3. Run.
export ANTHROPIC_API_KEY=sk-ant-...
python agent.py
You get a Rich-formatted summary on stdout, plus a directory at .traceforge/runs/<ulid>-<run-name>/ containing manifest.json, run.jsonl, and a self-contained report.html.
Library-only API (no instrumentor)
async with tracer.run() as run:
run.record_llm_call(
provider="anthropic",
model="claude-haiku-4-5",
messages=[...],
response="...",
input_tokens=12, output_tokens=4, latency_ms=180,
)
run.record_tool_call("search", tool_input={"q": "..."}, tool_output={"hits": 3})
run.custom("phase.done", metadata={"step": 1})
run.trace.print_summary()
Manual recording is the lower-level API the instrumentors are built on. Useful when you don't want TraceForge anywhere near the SDK call.
Decorator sugar
@tracer.trace
async def my_agent(query, _run=None):
_run.record_tool_call("search", {"q": query}, {"hits": 3})
return "done"
await my_agent("hello") # auto-saves trace to .traceforge/runs/
trace = tracer.last()
Reports
tracer.run() writes three files per run to .traceforge/runs/<ulid>-<run-name>/:
.traceforge/runs/01KS8E...-true-elk/
├── report.html ← open this
├── run.jsonl ← replayable artifact (one span per line + manifest)
└── manifest.json ← aggregate counts + cost + token totals
Terminal: run.trace.print_summary() prints a Rich panel + span tree:
HTML: self-contained, dark theme, no CDN. Open report.html in any browser (or via traceforge open <run-name>):
- Stat strip at the top: duration, span count, LLM calls, tool calls, token totals, cost, errors
- Span cards with type-coded left border (indigo LLM, cyan tool, slate custom, red errors)
- Collapsible payloads for system prompts, message arrays, and tool I/O — kept folded by default so the page stays scannable
- Per-span cost rendered inline for every LLM call
A live example sits at docs/example-report.html — open it directly to see the layout.
Replay
# llm-mock: LLM responses served from cache, tools execute live
result = await tracer.replay(trace, agent_fn, mode="llm-mock")
# dry-run: both LLM responses AND tool outputs served from cache, no network
result = await tracer.replay(trace, agent_fn, mode="dry-run")
result.print()
# Similarity: 100% · Status: ALIGNED
The replay engine builds two interceptors keyed by SHA-256 of the original messages / tool inputs, and hands them to your agent function. Your agent consults the interceptor before calling out:
async def my_agent(query, _run=None, _mock_llm=None, _mock_tool=None):
cached = _mock_llm.get(messages) if _mock_llm else None
if cached is not None:
return cached
return await client.messages.create(...)
The shipped instrumentors handle this for you — just pass mock_interceptor=_mock_llm when instrumenting:
client = AnthropicInstrumentor(_run, mock_interceptor=_mock_llm).instrument(AsyncAnthropic())
Similarity scoring. ReplayResult.similarity_score is the ratio of matching span types between original and replayed traces. Below 0.4 the replay is marked DIVERGED. See docs/replay-faq.md for why replays diverge and how to fix it.
Cost tracking
Every LLM span gets a USD cost estimate attached automatically, looked up from a built-in pricing table for the major Anthropic and OpenAI models (with longest-prefix matching, so versioned IDs like claude-haiku-4-5-20251001 resolve correctly). Aggregates flow into manifest.total_cost_usd.
async with tracer.run() as run:
await client.messages.create(...) # instrumentor records cost
print(run.trace.manifest.total_cost_usd) # → 0.0042
Override the table for negotiated contract pricing or private models:
from traceforge import Tracer
from traceforge.pricing import ModelPrice
tracer = Tracer(pricing={
"my-internal-model": ModelPrice(input_per_million=0.5, output_per_million=1.5),
"claude-opus-4-7": ModelPrice(input_per_million=12.0, output_per_million=60.0),
})
Unknown models cost 0 and emit a one-shot warning so the trace still saves.
Pytest plugin
pip install traceforge-llm auto-registers a pytest plugin (via pytest11 entry point). Three fixtures appear in any test suite:
import pytest
@pytest.mark.asyncio
async def test_agent_runs_under_budget(tracer, tf_assert, tf_snapshot):
async with tracer.run() as run:
await my_agent("hello", _run=run)
tf_assert(
run.trace,
has_span="search",
llm_calls=1,
max_cost_usd=0.01,
max_tokens=2000,
)
# Golden-trace snapshot — fails if the span-type sequence drifts.
tf_snapshot.assert_match(run.trace, "agent_v1")
| Fixture | What it gives you |
|---|---|
tracer |
A non-auto-saving Tracer per test (no .traceforge/ cruft) |
tf_assert(trace, ...) |
One-line common assertions: has_span, has_span_type, no_errors, llm_calls, tool_calls, max_cost_usd, max_tokens, min_spans |
tf_snapshot.assert_match(trace, name) |
Records the trace on first run, then asserts span-type-sequence similarity ≥ 0.8 on every subsequent run |
Snapshots live in tests/__tf_snapshots__/<name>.jsonl by default — override with --tf-snapshot-dir. Commit them like you commit any other golden file. Refresh after intentional changes with:
pytest --tf-update-snapshots
Instrumentors
| Provider | Class | Wraps |
|---|---|---|
| Anthropic | traceforge.integrations.anthropic.AnthropicInstrumentor |
client.messages.create |
| OpenAI | traceforge.integrations.openai.OpenAIInstrumentor |
client.chat.completions.create |
| LangChain | traceforge.integrations.langchain.LangChainInstrumentor |
manual record_chain_step / record_llm_step |
LangChain is intentionally manual — auto-patching is fragile across versions, so TraceForge ships a bridge helper you call from your callback handler.
CLI
| Command | Purpose |
|---|---|
traceforge init |
Scaffold traceforge.yaml, agent.py example, .gitignore entry |
traceforge list |
Table of local runs (newest first) |
traceforge open <id> |
Open a run's HTML report in your browser |
traceforge show <id> |
Print a run summary to the terminal |
<id> accepts a ULID prefix or the human-readable run name (brave-salmon).
Non-goals
- No auto-patching by default. Instrumentors are opt-in. Your code stays explicit about what's being traced.
- No time-travel debugging. TraceForge records and replays; it does not pause your agent mid-flight.
- No cloud storage. Traces live in
.traceforge/runs/. Bring your own object store if you want central retention. - No built-in eval scoring. TraceForge captures the run; pair it with
evalkit(or your own scorer) for grading.
Status
| Feature | Status |
|---|---|
Async + sync context manager, @tracer.trace decorator |
✅ shipped |
| Anthropic / OpenAI instrumentors | ✅ shipped |
| LangChain bridge (manual) | ✅ shipped |
File store: manifest.json + run.jsonl + report.html |
✅ shipped |
| Self-contained HTML report (no CDN) | ✅ shipped |
| LLM-mock replay | ✅ shipped |
| Dry-run replay (tool cache) | ✅ shipped |
| Cost tracking (per-span + manifest total) | ✅ shipped (v0.2) |
| Custom pricing tables | ✅ shipped (v0.2) |
Pytest plugin (tracer, tf_assert) |
✅ shipped (v0.2) |
Trace snapshot testing (tf_snapshot) |
✅ shipped (v0.2) |
| Streaming + tool-use in instrumentors | deferred |
traceforge diff (span-level diff) |
deferred |
Slim mode (--slim) |
deferred |
| LangGraph auto-instrumentation | manual only |
| Cloud storage backends | non-goal |
Track progress and propose features via GitHub Issues.
Docs
- Replay FAQ — why replays diverge, and how to fix it
- Example HTML report — live, self-contained
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceforge_llm-0.2.0.tar.gz.
File metadata
- Download URL: traceforge_llm-0.2.0.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f993861e482483c25f933d4470b71be8114ccd6a7c1297ba7570690032f9fb7
|
|
| MD5 |
077640b298c529d8adae2015b2d37bac
|
|
| BLAKE2b-256 |
7eb08d7eb9e3f29b408df88a432d77c729a6d510b64623fd56b9521b7f8657e0
|
Provenance
The following attestation bundles were made for traceforge_llm-0.2.0.tar.gz:
Publisher:
release.yml on Danultimate/traceforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
traceforge_llm-0.2.0.tar.gz -
Subject digest:
4f993861e482483c25f933d4470b71be8114ccd6a7c1297ba7570690032f9fb7 - Sigstore transparency entry: 1607137210
- Sigstore integration time:
-
Permalink:
Danultimate/traceforge@9b5fbf1de03b53446672ae21cefde48748a1b4b0 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Danultimate
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9b5fbf1de03b53446672ae21cefde48748a1b4b0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file traceforge_llm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: traceforge_llm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1aa0dfe1ddbfd0829a07dcdaab0e0b85eb5546551cc0878b2b12e0e79cec1adf
|
|
| MD5 |
3d629ee27d74794eab55bf0944b24b11
|
|
| BLAKE2b-256 |
5d7fd654909361eb73bc7d58a92545736b7fd9d6c4680dffa81ddf696187df3e
|
Provenance
The following attestation bundles were made for traceforge_llm-0.2.0-py3-none-any.whl:
Publisher:
release.yml on Danultimate/traceforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
traceforge_llm-0.2.0-py3-none-any.whl -
Subject digest:
1aa0dfe1ddbfd0829a07dcdaab0e0b85eb5546551cc0878b2b12e0e79cec1adf - Sigstore transparency entry: 1607137353
- Sigstore integration time:
-
Permalink:
Danultimate/traceforge@9b5fbf1de03b53446672ae21cefde48748a1b4b0 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Danultimate
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9b5fbf1de03b53446672ae21cefde48748a1b4b0 -
Trigger Event:
push
-
Statement type: