Test harness for AI agents that call tools. Record, replay, fuzz, and debug.
Project description
TraceForge
A test harness for AI agents that call tools.
If you're building agents with tool-calling (on Ollama, local models, etc.) and you're tired of staring at logs trying to figure out why your agent called the wrong tool or returned garbage — this is for you.
What it does
You write a YAML file describing what your agent should do. TraceForge runs it, records everything, and then lets you analyze the recordings without re-running the model.
name: calculator_agent
agent:
model: qwen2.5:7b-instruct
system_prompt: "You are a calculator assistant."
tools:
- name: calculate
description: "Perform a math calculation"
parameters:
type: object
properties:
expression: { type: string }
required: [expression]
mock_responses: [{ result: 42 }]
steps:
- user_message: "What is 6 times 7?"
expectations:
- type: tool_called
tool: calculate
- type: response_contains
values: ["42"]
$ traceforge run ./scenarios/ --runs 10
╭───────────────────── TraceForge Report ──────────────────────╮
│ SCENARIO PASS FAIL RATE CONSIST AVG MS │
│ OK calculator_agent 10/10 0/10 100% 1.00 1,059 │
│ XX multi_step_math 0/10 10/10 0% 1.00 3,598 │
│ OK simple_chat 10/10 0/10 100% 1.00 898 │
│ OK weather_agent 10/10 0/10 100% 1.00 1,246 │
│ │
│ OVERALL: 75.0% pass rate │
╰──────────────────────────────────────────────────────────────╯
The idea
Running an LLM is expensive and slow. But once you have a recording of what it did, you can re-evaluate it instantly, fuzz it, minimize it, and analyze it — all offline.
TraceForge records every agent run as an immutable, content-addressed trace (SHA-256 hashed). Then it gives you tools to work with those traces:
- Replay — re-evaluate a trace with different expectations, no model needed
- Fuzz — mutate tool responses (nulls, type swaps, empty strings) and see what breaks your agent
- MinRepro — your agent runs 4 steps and fails; delta debugging finds the 1 step that actually matters
- Mine — automatically discover behavioral rules from passing traces ("calculate is always called at step 0", "expression is always non-empty")
- Attribute — when something fails, run counterfactual experiments to find out why ("the agent is sensitive to tool output values, not format")
Install
pip install traceforge
Or from source:
git clone https://github.com/AbhimanyuBhagwati/TraceForge.git
cd TraceForge
pip install -e ".[dev]"
You'll need Ollama running locally with a model pulled:
ollama pull qwen2.5:7b-instruct
Quick start
# Create example scenarios
traceforge init
# Run them
traceforge run ./examples/scenarios/ --runs 5
# See what you've got
traceforge traces
traceforge info
# Replay a trace offline (no model call)
traceforge replay <trace-id>
# Fuzz tool responses
traceforge fuzz ./examples/scenarios/
# Find minimal failing case
traceforge minrepro <failing-trace-id> --scenario ./examples/scenarios/
# Discover behavioral patterns
traceforge mine calculator_agent -v
# Find root cause of failure
traceforge attribute <failing-trace-id> --scenario ./examples/scenarios/
How it works
YAML scenario
|
v
traceforge run -> traces (content-addressed, stored locally)
|
v
traceforge replay -> re-evaluate offline
traceforge fuzz -> break tool responses, find fragility
traceforge minrepro -> shrink failing trace to minimal case
traceforge mine -> discover behavioral rules from traces
traceforge attribute -> counterfactual analysis of failures
|
v
CLI output / HTML report / JSON export
Everything after run works on stored traces. Run the model once, analyze as many times as you want.
Expectations
10 built-in expectation types you can use in your YAML:
| Type | What it checks |
|---|---|
tool_called |
Agent called this tool |
tool_not_called |
Agent didn't call this tool |
tool_args_contain |
Tool was called with these arguments |
response_contains |
Agent's response includes these strings |
response_not_contains |
Agent's response doesn't include this |
response_matches_regex |
Response matches a regex |
llm_judge |
Another LLM evaluates the response |
latency_under |
Step completed within N ms |
no_tool_errors |
No tool calls returned errors |
tool_call_count |
Tool was called exactly/at least/at most N times |
Invariant mining
Instead of writing expectations by hand, let TraceForge figure them out:
$ traceforge mine calculator_agent -v
╭────────────── Invariant Mining Report ───────────────╮
│ Traces analyzed: 15 (15 passing, 0 failing) │
│ Invariants discovered: 5 │
│ │
│ - 'calculate' is always called at step 0 │
│ - 'calculate' is called 1-5 times per run │
│ - 'calculate.expression' is always non-empty │
│ - Step 0 response length is 30-48 chars │
│ - Step 0 latency is under 3916ms │
╰──────────────────────────────────────────────────────╯
Run enough traces and the miner will find rules that hold in all passing traces but break in failing ones. Those are your bugs.
Causal attribution
When a trace fails, TraceForge can run counterfactual experiments — change one thing at a time, re-run the agent, and see what flips the outcome.
$ traceforge attribute <trace-id> --scenario ./scenarios/
╭────────────── Causal Attribution Report ─────────────╮
│ Failing step: 2 | Interventions: 23 | Flips: 7 │
│ │
│ CAUSAL FACTOR SENSITIVITY │
│ tool_output_value 40% │
│ tool_output_format 0% │
│ system_prompt_clause 0% │
╰──────────────────────────────────────────────────────╯
"40% of value changes flipped the outcome. Format and prompt don't matter." Now you know where to look.
Requirements
- Python 3.12+
- Ollama running locally
- A pulled model (tested with
qwen2.5:7b-instruct)
Tests
pytest tests/ -v
183 tests, runs in about a second.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceforge-0.2.0.tar.gz.
File metadata
- Download URL: traceforge-0.2.0.tar.gz
- Upload date:
- Size: 57.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
446e211643bc9866b423d2d023609d2e9119aaf4c009426397c4b040fabfc343
|
|
| MD5 |
317d23b226b3c05004ed11d468a8afb2
|
|
| BLAKE2b-256 |
d868ab1ee3af1f24508a41c738723b5d835e4d2b87868d60d5f3e5d7eae904ac
|
File details
Details for the file traceforge-0.2.0-py3-none-any.whl.
File metadata
- Download URL: traceforge-0.2.0-py3-none-any.whl
- Upload date:
- Size: 44.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c080dfb843bea76167deec13b6e0b2de2938ffa19a42f902e7d1a4c78b6b2dac
|
|
| MD5 |
f9936aa861ed8f3241a13a24807672e4
|
|
| BLAKE2b-256 |
40eeb1bdef96c57e1b3bd7d871503ecfa60746ece69b3d9bf1934492c6424cb4
|