Skip to main content

Agent runtime tracing and deterministic replay for LLM applications

Project description

TraceForge

Agent runtime tracing + LLM-mock replay for Python. Pip install. Async-first. Self-contained reports.

PyPI version Python versions License: Apache 2.0 CI Replay

TraceForge HTML report

⚠️ Pre-release (v0.2). Tracer, replay, instrumentors, cost tracking, and the pytest plugin are implemented end-to-end and tested. APIs are stabilising — don't depend on this in production yet.

TraceForge records every LLM call, tool invocation, error, and state transition your agent makes into a typed span. The output is a replayable run.jsonl artifact plus a self-contained HTML report you can open in any browser — no server, no SaaS, no SDK lock-in. Replay mode re-executes the agent with cached LLM responses (or cached tool outputs) so you can verify the execution path without burning API calls.

pip install "traceforge-llm[anthropic]"   # or [openai], [all]
traceforge init && python agent.py

Why TraceForge

TraceForge LangSmith Langfuse OpenLLMetry print()
Pip install, no account, no server partial (self-host)
Records LLM I/O + tool I/O + state per span partial
Replay with cached LLM responses (llm-mock)
Dry-run replay with cached tool outputs
Self-contained HTML report (no CDN, no server)
Auto-cost tracking per-span + per-run partial
First-class pytest plugin with snapshot testing
Auto-patches your SDK clients opt-in n/a
Cloud storage / hosted dashboard via vendor

Where TraceForge fits: when you need a local, file-based, replayable record of what your agent did — for debugging, CI regression tests, or post-hoc analysis — without sending your traces to anyone else's database. Auto-patching frameworks like LangSmith give you a UI; OpenLLMetry gives you OTel pipes. TraceForge gives you a JSONL you can git diff, an HTML you can email, and a tracer.replay() you can run offline.


60-second quickstart

1. Install and scaffold.

pip install "traceforge-llm[anthropic]"
traceforge init

traceforge init writes traceforge.yaml, a working agent.py example, and a .gitignore entry.

2. Wrap your agent.

import asyncio
from anthropic import AsyncAnthropic
from traceforge import Tracer
from traceforge.integrations.anthropic import AnthropicInstrumentor

tracer = Tracer()

async def main():
    async with tracer.run() as run:
        client = AnthropicInstrumentor(run).instrument(AsyncAnthropic())
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": "What is 2 + 2?"}],
        )
        print(response.content[0].text)
    run.trace.print_summary()

asyncio.run(main())

3. Run.

export ANTHROPIC_API_KEY=sk-ant-...
python agent.py

You get a Rich-formatted summary on stdout, plus a directory at .traceforge/runs/<ulid>-<run-name>/ containing manifest.json, run.jsonl, and a self-contained report.html.

Library-only API (no instrumentor)
async with tracer.run() as run:
    run.record_llm_call(
        provider="anthropic",
        model="claude-haiku-4-5",
        messages=[...],
        response="...",
        input_tokens=12, output_tokens=4, latency_ms=180,
    )
    run.record_tool_call("search", tool_input={"q": "..."}, tool_output={"hits": 3})
    run.custom("phase.done", metadata={"step": 1})

run.trace.print_summary()

Manual recording is the lower-level API the instrumentors are built on. Useful when you don't want TraceForge anywhere near the SDK call.

Decorator sugar
@tracer.trace
async def my_agent(query, _run=None):
    _run.record_tool_call("search", {"q": query}, {"hits": 3})
    return "done"

await my_agent("hello")        # auto-saves trace to .traceforge/runs/
trace = tracer.last()

Reports

tracer.run() writes three files per run to .traceforge/runs/<ulid>-<run-name>/:

.traceforge/runs/01KS8E...-true-elk/
├── report.html      ← open this
├── run.jsonl        ← replayable artifact (one span per line + manifest)
└── manifest.json    ← aggregate counts + cost + token totals

Terminal: run.trace.print_summary() prints a Rich panel + span tree:

Terminal report

HTML: self-contained, dark theme, no CDN. Open report.html in any browser (or via traceforge open <run-name>):

  • Stat strip at the top: duration, span count, LLM calls, tool calls, token totals, cost, errors
  • Span cards with type-coded left border (indigo LLM, cyan tool, slate custom, red errors)
  • Collapsible payloads for system prompts, message arrays, and tool I/O — kept folded by default so the page stays scannable
  • Per-span cost rendered inline for every LLM call

A live example sits at docs/example-report.html — open it directly to see the layout.


Replay

# llm-mock: LLM responses served from cache, tools execute live
result = await tracer.replay(trace, agent_fn, mode="llm-mock")

# dry-run: both LLM responses AND tool outputs served from cache, no network
result = await tracer.replay(trace, agent_fn, mode="dry-run")

result.print()
# Similarity: 100%  ·  Status: ALIGNED

The replay engine builds two interceptors keyed by SHA-256 of the original messages / tool inputs, and hands them to your agent function. Your agent consults the interceptor before calling out:

async def my_agent(query, _run=None, _mock_llm=None, _mock_tool=None):
    cached = _mock_llm.get(messages) if _mock_llm else None
    if cached is not None:
        return cached
    return await client.messages.create(...)

The shipped instrumentors handle this for you — just pass mock_interceptor=_mock_llm when instrumenting:

client = AnthropicInstrumentor(_run, mock_interceptor=_mock_llm).instrument(AsyncAnthropic())

Similarity scoring. ReplayResult.similarity_score is the ratio of matching span types between original and replayed traces. Below 0.4 the replay is marked DIVERGED. See docs/replay-faq.md for why replays diverge and how to fix it.


Cost tracking

Every LLM span gets a USD cost estimate attached automatically, looked up from a built-in pricing table for the major Anthropic and OpenAI models (with longest-prefix matching, so versioned IDs like claude-haiku-4-5-20251001 resolve correctly). Aggregates flow into manifest.total_cost_usd.

async with tracer.run() as run:
    await client.messages.create(...)  # instrumentor records cost

print(run.trace.manifest.total_cost_usd)  # → 0.0042

Override the table for negotiated contract pricing or private models:

from traceforge import Tracer
from traceforge.pricing import ModelPrice

tracer = Tracer(pricing={
    "my-internal-model": ModelPrice(input_per_million=0.5, output_per_million=1.5),
    "claude-opus-4-7":   ModelPrice(input_per_million=12.0, output_per_million=60.0),
})

Unknown models cost 0 and emit a one-shot warning so the trace still saves.


Pytest plugin

pip install traceforge-llm auto-registers a pytest plugin (via pytest11 entry point). Three fixtures appear in any test suite:

import pytest

@pytest.mark.asyncio
async def test_agent_runs_under_budget(tracer, tf_assert, tf_snapshot):
    async with tracer.run() as run:
        await my_agent("hello", _run=run)

    tf_assert(
        run.trace,
        has_span="search",
        llm_calls=1,
        max_cost_usd=0.01,
        max_tokens=2000,
    )

    # Golden-trace snapshot — fails if the span-type sequence drifts.
    tf_snapshot.assert_match(run.trace, "agent_v1")
Fixture What it gives you
tracer A non-auto-saving Tracer per test (no .traceforge/ cruft)
tf_assert(trace, ...) One-line common assertions: has_span, has_span_type, no_errors, llm_calls, tool_calls, max_cost_usd, max_tokens, min_spans
tf_snapshot.assert_match(trace, name) Records the trace on first run, then asserts span-type-sequence similarity ≥ 0.8 on every subsequent run

Snapshots live in tests/__tf_snapshots__/<name>.jsonl by default — override with --tf-snapshot-dir. Commit them like you commit any other golden file. Refresh after intentional changes with:

pytest --tf-update-snapshots

Instrumentors

Provider Class Wraps
Anthropic traceforge.integrations.anthropic.AnthropicInstrumentor client.messages.create
OpenAI traceforge.integrations.openai.OpenAIInstrumentor client.chat.completions.create
LangChain traceforge.integrations.langchain.LangChainInstrumentor manual record_chain_step / record_llm_step

LangChain is intentionally manual — auto-patching is fragile across versions, so TraceForge ships a bridge helper you call from your callback handler.


CLI

Command Purpose
traceforge init Scaffold traceforge.yaml, agent.py example, .gitignore entry
traceforge list Table of local runs (newest first)
traceforge open <id> Open a run's HTML report in your browser
traceforge show <id> Print a run summary to the terminal

<id> accepts a ULID prefix or the human-readable run name (brave-salmon).


Non-goals

  • No auto-patching by default. Instrumentors are opt-in. Your code stays explicit about what's being traced.
  • No time-travel debugging. TraceForge records and replays; it does not pause your agent mid-flight.
  • No cloud storage. Traces live in .traceforge/runs/. Bring your own object store if you want central retention.
  • No built-in eval scoring. TraceForge captures the run; pair it with evalkit (or your own scorer) for grading.

Status

Feature Status
Async + sync context manager, @tracer.trace decorator ✅ shipped
Anthropic / OpenAI instrumentors ✅ shipped
LangChain bridge (manual) ✅ shipped
File store: manifest.json + run.jsonl + report.html ✅ shipped
Self-contained HTML report (no CDN) ✅ shipped
LLM-mock replay ✅ shipped
Dry-run replay (tool cache) ✅ shipped
Cost tracking (per-span + manifest total) ✅ shipped (v0.2)
Custom pricing tables ✅ shipped (v0.2)
Pytest plugin (tracer, tf_assert) ✅ shipped (v0.2)
Trace snapshot testing (tf_snapshot) ✅ shipped (v0.2)
Streaming + tool-use in instrumentors deferred
traceforge diff (span-level diff) deferred
Slim mode (--slim) deferred
LangGraph auto-instrumentation manual only
Cloud storage backends non-goal

Track progress and propose features via GitHub Issues.


Docs

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceforge_llm-0.2.0.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceforge_llm-0.2.0-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file traceforge_llm-0.2.0.tar.gz.

File metadata

  • Download URL: traceforge_llm-0.2.0.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for traceforge_llm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4f993861e482483c25f933d4470b71be8114ccd6a7c1297ba7570690032f9fb7
MD5 077640b298c529d8adae2015b2d37bac
BLAKE2b-256 7eb08d7eb9e3f29b408df88a432d77c729a6d510b64623fd56b9521b7f8657e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for traceforge_llm-0.2.0.tar.gz:

Publisher: release.yml on Danultimate/traceforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file traceforge_llm-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: traceforge_llm-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for traceforge_llm-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1aa0dfe1ddbfd0829a07dcdaab0e0b85eb5546551cc0878b2b12e0e79cec1adf
MD5 3d629ee27d74794eab55bf0944b24b11
BLAKE2b-256 5d7fd654909361eb73bc7d58a92545736b7fd9d6c4680dffa81ddf696187df3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for traceforge_llm-0.2.0-py3-none-any.whl:

Publisher: release.yml on Danultimate/traceforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page