Agent runtime tracing and deterministic replay for LLM applications

These details have not been verified by PyPI

Project description

TraceForge

Agent runtime tracing + LLM-mock replay for Python. Pip install. Async-first. Self-contained reports.

TraceForge HTML report

⚠️ Pre-release (v0.2). Tracer, replay, instrumentors, cost tracking, and the pytest plugin are implemented end-to-end and tested. APIs are stabilising — don't depend on this in production yet.

TraceForge records every LLM call, tool invocation, error, and state transition your agent makes into a typed span. The output is a replayable run.jsonl artifact plus a self-contained HTML report you can open in any browser — no server, no SaaS, no SDK lock-in. Replay mode re-executes the agent with cached LLM responses (or cached tool outputs) so you can verify the execution path without burning API calls.

pip install "traceforge-llm[anthropic]"   # or [openai], [all]
traceforge init && python agent.py

Why TraceForge

	TraceForge	LangSmith	Langfuse	OpenLLMetry	`print()`
Pip install, no account, no server	✅	—	partial (self-host)	✅	✅
Records LLM I/O + tool I/O + state per span	✅	✅	✅	partial	—
Replay with cached LLM responses (`llm-mock`)	✅	—	—	—	—
Dry-run replay with cached tool outputs	✅	—	—	—	—
Self-contained HTML report (no CDN, no server)	✅	—	—	—	—
Auto-cost tracking per-span + per-run	✅	✅	✅	partial	—
First-class pytest plugin with snapshot testing	✅	—	—	—	—
Auto-patches your SDK clients	opt-in	✅	✅	✅	n/a
Cloud storage / hosted dashboard	—	✅	✅	via vendor	—

Where TraceForge fits: when you need a local, file-based, replayable record of what your agent did — for debugging, CI regression tests, or post-hoc analysis — without sending your traces to anyone else's database. Auto-patching frameworks like LangSmith give you a UI; OpenLLMetry gives you OTel pipes. TraceForge gives you a JSONL you can git diff, an HTML you can email, and a tracer.replay() you can run offline.

60-second quickstart

1. Install and scaffold.

pip install "traceforge-llm[anthropic]"
traceforge init

traceforge init writes traceforge.yaml, a working agent.py example, and a .gitignore entry.

2. Wrap your agent.

import asyncio
from anthropic import AsyncAnthropic
from traceforge import Tracer
from traceforge.integrations.anthropic import AnthropicInstrumentor

tracer = Tracer()

async def main():
    async with tracer.run() as run:
        client = AnthropicInstrumentor(run).instrument(AsyncAnthropic())
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": "What is 2 + 2?"}],
        )
        print(response.content[0].text)
    run.trace.print_summary()

asyncio.run(main())

3. Run.

export ANTHROPIC_API_KEY=sk-ant-...
python agent.py

You get a Rich-formatted summary on stdout, plus a directory at .traceforge/runs/<ulid>-<run-name>/ containing manifest.json, run.jsonl, and a self-contained report.html.

Library-only API (no instrumentor)

async with tracer.run() as run:
    run.record_llm_call(
        provider="anthropic",
        model="claude-haiku-4-5",
        messages=[...],
        response="...",
        input_tokens=12, output_tokens=4, latency_ms=180,
    )
    run.record_tool_call("search", tool_input={"q": "..."}, tool_output={"hits": 3})
    run.custom("phase.done", metadata={"step": 1})

run.trace.print_summary()

Manual recording is the lower-level API the instrumentors are built on. Useful when you don't want TraceForge anywhere near the SDK call.

Decorator sugar

@tracer.trace
async def my_agent(query, _run=None):
    _run.record_tool_call("search", {"q": query}, {"hits": 3})
    return "done"

await my_agent("hello")        # auto-saves trace to .traceforge/runs/
trace = tracer.last()

Reports

tracer.run() writes three files per run to .traceforge/runs/<ulid>-<run-name>/:

.traceforge/runs/01KS8E...-true-elk/
├── report.html      ← open this
├── run.jsonl        ← replayable artifact (one span per line + manifest)
└── manifest.json    ← aggregate counts + cost + token totals

Terminal: run.trace.print_summary() prints a Rich panel + span tree:

Terminal report

HTML: self-contained, dark theme, no CDN. Open report.html in any browser (or via traceforge open <run-name>):

Stat strip at the top: duration, span count, LLM calls, tool calls, token totals, cost, errors
Span cards with type-coded left border (indigo LLM, cyan tool, slate custom, red errors)
Collapsible payloads for system prompts, message arrays, and tool I/O — kept folded by default so the page stays scannable
Per-span cost rendered inline for every LLM call

A live example sits at docs/example-report.html — open it directly to see the layout.

Replay

# llm-mock: LLM responses served from cache, tools execute live
result = await tracer.replay(trace, agent_fn, mode="llm-mock")

# dry-run: both LLM responses AND tool outputs served from cache, no network
result = await tracer.replay(trace, agent_fn, mode="dry-run")

result.print()
# Similarity: 100%  ·  Status: ALIGNED

The replay engine builds two interceptors keyed by SHA-256 of the original messages / tool inputs, and hands them to your agent function. Your agent consults the interceptor before calling out:

async def my_agent(query, _run=None, _mock_llm=None, _mock_tool=None):
    cached = _mock_llm.get(messages) if _mock_llm else None
    if cached is not None:
        return cached
    return await client.messages.create(...)

The shipped instrumentors handle this for you — just pass mock_interceptor=_mock_llm when instrumenting:

client = AnthropicInstrumentor(_run, mock_interceptor=_mock_llm).instrument(AsyncAnthropic())

Similarity scoring. ReplayResult.similarity_score is the ratio of matching span types between original and replayed traces. Below 0.4 the replay is marked DIVERGED. See docs/replay-faq.md for why replays diverge and how to fix it.

Cost tracking

Every LLM span gets a USD cost estimate attached automatically, looked up from a built-in pricing table for the major Anthropic and OpenAI models (with longest-prefix matching, so versioned IDs like claude-haiku-4-5-20251001 resolve correctly). Aggregates flow into manifest.total_cost_usd.

async with tracer.run() as run:
    await client.messages.create(...)  # instrumentor records cost

print(run.trace.manifest.total_cost_usd)  # → 0.0042

Override the table for negotiated contract pricing or private models:

from traceforge import Tracer
from traceforge.pricing import ModelPrice

tracer = Tracer(pricing={
    "my-internal-model": ModelPrice(input_per_million=0.5, output_per_million=1.5),
    "claude-opus-4-7":   ModelPrice(input_per_million=12.0, output_per_million=60.0),
})

Unknown models cost 0 and emit a one-shot warning so the trace still saves.

Pytest plugin

pip install traceforge-llm auto-registers a pytest plugin (via pytest11 entry point). Three fixtures appear in any test suite:

import pytest

@pytest.mark.asyncio
async def test_agent_runs_under_budget(tracer, tf_assert, tf_snapshot):
    async with tracer.run() as run:
        await my_agent("hello", _run=run)

    tf_assert(
        run.trace,
        has_span="search",
        llm_calls=1,
        max_cost_usd=0.01,
        max_tokens=2000,
    )

    # Golden-trace snapshot — fails if the span-type sequence drifts.
    tf_snapshot.assert_match(run.trace, "agent_v1")

Fixture	What it gives you
`tracer`	A non-auto-saving `Tracer` per test (no `.traceforge/` cruft)
`tf_assert(trace, ...)`	One-line common assertions: `has_span`, `has_span_type`, `no_errors`, `llm_calls`, `tool_calls`, `max_cost_usd`, `max_tokens`, `min_spans`
`tf_snapshot.assert_match(trace, name)`	Records the trace on first run, then asserts span-type-sequence similarity ≥ 0.8 on every subsequent run

Snapshots live in tests/__tf_snapshots__/<name>.jsonl by default — override with --tf-snapshot-dir. Commit them like you commit any other golden file. Refresh after intentional changes with:

pytest --tf-update-snapshots

Instrumentors

Provider	Class	Wraps
Anthropic	`traceforge.integrations.anthropic.AnthropicInstrumentor`	`client.messages.create`
OpenAI	`traceforge.integrations.openai.OpenAIInstrumentor`	`client.chat.completions.create`
LangChain	`traceforge.integrations.langchain.LangChainInstrumentor`	manual `record_chain_step` / `record_llm_step`

LangChain is intentionally manual — auto-patching is fragile across versions, so TraceForge ships a bridge helper you call from your callback handler.

CLI

Command	Purpose
`traceforge init`	Scaffold `traceforge.yaml`, `agent.py` example, `.gitignore` entry
`traceforge list`	Table of local runs (newest first)
`traceforge open <id>`	Open a run's HTML report in your browser
`traceforge show <id>`	Print a run summary to the terminal

<id> accepts a ULID prefix or the human-readable run name (brave-salmon).

Non-goals

No auto-patching by default. Instrumentors are opt-in. Your code stays explicit about what's being traced.
No time-travel debugging. TraceForge records and replays; it does not pause your agent mid-flight.
No cloud storage. Traces live in .traceforge/runs/. Bring your own object store if you want central retention.
No built-in eval scoring. TraceForge captures the run; pair it with evalkit (or your own scorer) for grading.

Status

Feature	Status
Async + sync context manager, `@tracer.trace` decorator	✅ shipped
Anthropic / OpenAI instrumentors	✅ shipped
LangChain bridge (manual)	✅ shipped
File store: `manifest.json` + `run.jsonl` + `report.html`	✅ shipped
Self-contained HTML report (no CDN)	✅ shipped
LLM-mock replay	✅ shipped
Dry-run replay (tool cache)	✅ shipped
Cost tracking (per-span + manifest total)	✅ shipped (v0.2)
Custom pricing tables	✅ shipped (v0.2)
Pytest plugin (`tracer`, `tf_assert`)	✅ shipped (v0.2)
Trace snapshot testing (`tf_snapshot`)	✅ shipped (v0.2)
Streaming + tool-use in instrumentors	deferred
`traceforge diff` (span-level diff)	deferred
Slim mode (`--slim`)	deferred
LangGraph auto-instrumentation	manual only
Cloud storage backends	non-goal

Track progress and propose features via GitHub Issues.

Docs

Replay FAQ — why replays diverge, and how to fix it
Example HTML report — live, self-contained

License

Apache 2.0.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceforge_llm-0.2.0.tar.gz (40.6 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceforge_llm-0.2.0-py3-none-any.whl (37.1 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file traceforge_llm-0.2.0.tar.gz.

File metadata

Download URL: traceforge_llm-0.2.0.tar.gz
Upload date: May 22, 2026
Size: 40.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for traceforge_llm-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4f993861e482483c25f933d4470b71be8114ccd6a7c1297ba7570690032f9fb7`
MD5	`077640b298c529d8adae2015b2d37bac`
BLAKE2b-256	`7eb08d7eb9e3f29b408df88a432d77c729a6d510b64623fd56b9521b7f8657e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for traceforge_llm-0.2.0.tar.gz:

Publisher: release.yml on Danultimate/traceforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: traceforge_llm-0.2.0.tar.gz
- Subject digest: 4f993861e482483c25f933d4470b71be8114ccd6a7c1297ba7570690032f9fb7
- Sigstore transparency entry: 1607137210
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: Danultimate/traceforge@9b5fbf1de03b53446672ae21cefde48748a1b4b0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Danultimate
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9b5fbf1de03b53446672ae21cefde48748a1b4b0
- Trigger Event: push

File details

Details for the file traceforge_llm-0.2.0-py3-none-any.whl.

File metadata

Download URL: traceforge_llm-0.2.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 37.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for traceforge_llm-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1aa0dfe1ddbfd0829a07dcdaab0e0b85eb5546551cc0878b2b12e0e79cec1adf`
MD5	`3d629ee27d74794eab55bf0944b24b11`
BLAKE2b-256	`5d7fd654909361eb73bc7d58a92545736b7fd9d6c4680dffa81ddf696187df3e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for traceforge_llm-0.2.0-py3-none-any.whl:

Publisher: release.yml on Danultimate/traceforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: traceforge_llm-0.2.0-py3-none-any.whl
- Subject digest: 1aa0dfe1ddbfd0829a07dcdaab0e0b85eb5546551cc0878b2b12e0e79cec1adf
- Sigstore transparency entry: 1607137353
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: Danultimate/traceforge@9b5fbf1de03b53446672ae21cefde48748a1b4b0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Danultimate
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9b5fbf1de03b53446672ae21cefde48748a1b4b0
- Trigger Event: push

traceforge-llm 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TraceForge

Why TraceForge

60-second quickstart

Reports

Replay

Cost tracking

Pytest plugin

Instrumentors

CLI

Non-goals

Status

Docs

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance