Test harness for AI agents that call tools. Record, replay, fuzz, and debug.

These details have not been verified by PyPI

Project links

Project description

TraceForge

A test harness for AI agents that call tools.

If you're building agents with tool-calling (on Ollama, local models, etc.) and you're tired of staring at logs trying to figure out why your agent called the wrong tool or returned garbage — this is for you.

What it does

You write a YAML file describing what your agent should do. TraceForge runs it, records everything, and then lets you analyze the recordings without re-running the model.

name: calculator_agent
agent:
  model: qwen2.5:7b-instruct
  system_prompt: "You are a calculator assistant."
  tools:
    - name: calculate
      description: "Perform a math calculation"
      parameters:
        type: object
        properties:
          expression: { type: string }
        required: [expression]
      mock_responses: [{ result: 42 }]

steps:
  - user_message: "What is 6 times 7?"
    expectations:
      - type: tool_called
        tool: calculate
      - type: response_contains
        values: ["42"]

$ traceforge run ./scenarios/ --runs 10

╭───────────────────── TraceForge Report ──────────────────────╮
│ SCENARIO            PASS  FAIL  RATE  CONSIST  AVG MS       │
│ OK calculator_agent  10/10 0/10 100%    1.00    1,059       │
│ XX multi_step_math    0/10 10/10  0%    1.00    3,598       │
│ OK simple_chat       10/10 0/10 100%    1.00      898       │
│ OK weather_agent     10/10 0/10 100%    1.00    1,246       │
│                                                              │
│ OVERALL: 75.0% pass rate                                     │
╰──────────────────────────────────────────────────────────────╯

The idea

Running an LLM is expensive and slow. But once you have a recording of what it did, you can re-evaluate it instantly, fuzz it, minimize it, and analyze it — all offline.

TraceForge records every agent run as an immutable, content-addressed trace (SHA-256 hashed). Then it gives you tools to work with those traces:

Replay — re-evaluate a trace with different expectations, no model needed
Fuzz — mutate tool responses (nulls, type swaps, empty strings) and see what breaks your agent
MinRepro — your agent runs 4 steps and fails; delta debugging finds the 1 step that actually matters
Mine — automatically discover behavioral rules from passing traces ("calculate is always called at step 0", "expression is always non-empty")
Attribute — when something fails, run counterfactual experiments to find out why ("the agent is sensitive to tool output values, not format")

Install

pip install traceforge

Or from source:

git clone https://github.com/AbhimanyuBhagwati/TraceForge.git
cd TraceForge
pip install -e ".[dev]"

You'll need Ollama running locally with a model pulled:

ollama pull qwen2.5:7b-instruct

Quick start

# Create example scenarios
traceforge init

# Run them
traceforge run ./examples/scenarios/ --runs 5

# See what you've got
traceforge traces
traceforge info

# Replay a trace offline (no model call)
traceforge replay <trace-id>

# Fuzz tool responses
traceforge fuzz ./examples/scenarios/

# Find minimal failing case
traceforge minrepro <failing-trace-id> --scenario ./examples/scenarios/

# Discover behavioral patterns
traceforge mine calculator_agent -v

# Find root cause of failure
traceforge attribute <failing-trace-id> --scenario ./examples/scenarios/

How it works

YAML scenario
     |
     v
traceforge run         ->  traces (content-addressed, stored locally)
     |
     v
traceforge replay      ->  re-evaluate offline
traceforge fuzz        ->  break tool responses, find fragility
traceforge minrepro    ->  shrink failing trace to minimal case
traceforge mine        ->  discover behavioral rules from traces
traceforge attribute   ->  counterfactual analysis of failures
     |
     v
CLI output / HTML report / JSON export

Everything after run works on stored traces. Run the model once, analyze as many times as you want.

Expectations

10 built-in expectation types you can use in your YAML:

Type	What it checks
`tool_called`	Agent called this tool
`tool_not_called`	Agent didn't call this tool
`tool_args_contain`	Tool was called with these arguments
`response_contains`	Agent's response includes these strings
`response_not_contains`	Agent's response doesn't include this
`response_matches_regex`	Response matches a regex
`llm_judge`	Another LLM evaluates the response
`latency_under`	Step completed within N ms
`no_tool_errors`	No tool calls returned errors
`tool_call_count`	Tool was called exactly/at least/at most N times

Invariant mining

Instead of writing expectations by hand, let TraceForge figure them out:

$ traceforge mine calculator_agent -v

╭────────────── Invariant Mining Report ───────────────╮
│ Traces analyzed: 15 (15 passing, 0 failing)          │
│ Invariants discovered: 5                             │
│                                                      │
│   - 'calculate' is always called at step 0           │
│   - 'calculate' is called 1-5 times per run          │
│   - 'calculate.expression' is always non-empty       │
│   - Step 0 response length is 30-48 chars            │
│   - Step 0 latency is under 3916ms                   │
╰──────────────────────────────────────────────────────╯

Run enough traces and the miner will find rules that hold in all passing traces but break in failing ones. Those are your bugs.

Causal attribution

When a trace fails, TraceForge can run counterfactual experiments — change one thing at a time, re-run the agent, and see what flips the outcome.

$ traceforge attribute <trace-id> --scenario ./scenarios/

╭────────────── Causal Attribution Report ─────────────╮
│ Failing step: 2 | Interventions: 23 | Flips: 7      │
│                                                      │
│  CAUSAL FACTOR          SENSITIVITY                  │
│  tool_output_value         40%                       │
│  tool_output_format         0%                       │
│  system_prompt_clause       0%                       │
╰──────────────────────────────────────────────────────╯

"40% of value changes flipped the outcome. Format and prompt don't matter." Now you know where to look.

Requirements

Python 3.12+
Ollama running locally
A pulled model (tested with qwen2.5:7b-instruct)

Tests

pytest tests/ -v

183 tests, runs in about a second.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceforge-0.2.0.tar.gz (57.8 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceforge-0.2.0-py3-none-any.whl (44.0 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file traceforge-0.2.0.tar.gz.

File metadata

Download URL: traceforge-0.2.0.tar.gz
Upload date: Feb 22, 2026
Size: 57.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for traceforge-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`446e211643bc9866b423d2d023609d2e9119aaf4c009426397c4b040fabfc343`
MD5	`317d23b226b3c05004ed11d468a8afb2`
BLAKE2b-256	`d868ab1ee3af1f24508a41c738723b5d835e4d2b87868d60d5f3e5d7eae904ac`

See more details on using hashes here.

File details

Details for the file traceforge-0.2.0-py3-none-any.whl.

File metadata

Download URL: traceforge-0.2.0-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 44.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for traceforge-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c080dfb843bea76167deec13b6e0b2de2938ffa19a42f902e7d1a4c78b6b2dac`
MD5	`f9936aa861ed8f3241a13a24807672e4`
BLAKE2b-256	`40eeb1bdef96c57e1b3bd7d871503ecfa60746ece69b3d9bf1934492c6424cb4`

See more details on using hashes here.

traceforge 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TraceForge

What it does

The idea

Install

Quick start

How it works

Expectations

Invariant mining

Causal attribution

Requirements

Tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes