Skip to main content

Test harness for AI agents that call tools. Record, replay, fuzz, and debug.

Project description

TraceForge

A test harness for AI agents that call tools.

If you're building agents with tool-calling (on Ollama, local models, etc.) and you're tired of staring at logs trying to figure out why your agent called the wrong tool or returned garbage — this is for you.

What it does

You write a YAML file describing what your agent should do. TraceForge runs it, records everything, and then lets you analyze the recordings without re-running the model.

name: calculator_agent
agent:
  model: qwen2.5:7b-instruct
  system_prompt: "You are a calculator assistant."
  tools:
    - name: calculate
      description: "Perform a math calculation"
      parameters:
        type: object
        properties:
          expression: { type: string }
        required: [expression]
      mock_responses: [{ result: 42 }]

steps:
  - user_message: "What is 6 times 7?"
    expectations:
      - type: tool_called
        tool: calculate
      - type: response_contains
        values: ["42"]
$ traceforge run ./scenarios/ --runs 10

╭───────────────────── TraceForge Report ──────────────────────╮
│ SCENARIO            PASS  FAIL  RATE  CONSIST  AVG MS       │
│ OK calculator_agent  10/10 0/10 100%    1.00    1,059       │
│ XX multi_step_math    0/10 10/10  0%    1.00    3,598       │
│ OK simple_chat       10/10 0/10 100%    1.00      898       │
│ OK weather_agent     10/10 0/10 100%    1.00    1,246       │
│                                                              │
│ OVERALL: 75.0% pass rate                                     │
╰──────────────────────────────────────────────────────────────╯

The idea

Running an LLM is expensive and slow. But once you have a recording of what it did, you can re-evaluate it instantly, fuzz it, minimize it, and analyze it — all offline.

TraceForge records every agent run as an immutable, content-addressed trace (SHA-256 hashed). Then it gives you tools to work with those traces:

  • Replay — re-evaluate a trace with different expectations, no model needed
  • Fuzz — mutate tool responses (nulls, type swaps, empty strings) and see what breaks your agent
  • MinRepro — your agent runs 4 steps and fails; delta debugging finds the 1 step that actually matters
  • Mine — automatically discover behavioral rules from passing traces ("calculate is always called at step 0", "expression is always non-empty")
  • Attribute — when something fails, run counterfactual experiments to find out why ("the agent is sensitive to tool output values, not format")

Install

pip install traceforge

Or from source:

git clone https://github.com/AbhimanyuBhagwati/TraceForge.git
cd TraceForge
pip install -e ".[dev]"

You'll need Ollama running locally with a model pulled:

ollama pull qwen2.5:7b-instruct

Quick start

# Create example scenarios
traceforge init

# Run them
traceforge run ./examples/scenarios/ --runs 5

# See what you've got
traceforge traces
traceforge info

# Replay a trace offline (no model call)
traceforge replay <trace-id>

# Fuzz tool responses
traceforge fuzz ./examples/scenarios/

# Find minimal failing case
traceforge minrepro <failing-trace-id> --scenario ./examples/scenarios/

# Discover behavioral patterns
traceforge mine calculator_agent -v

# Find root cause of failure
traceforge attribute <failing-trace-id> --scenario ./examples/scenarios/

How it works

YAML scenario
     |
     v
traceforge run         ->  traces (content-addressed, stored locally)
     |
     v
traceforge replay      ->  re-evaluate offline
traceforge fuzz        ->  break tool responses, find fragility
traceforge minrepro    ->  shrink failing trace to minimal case
traceforge mine        ->  discover behavioral rules from traces
traceforge attribute   ->  counterfactual analysis of failures
     |
     v
CLI output / HTML report / JSON export

Everything after run works on stored traces. Run the model once, analyze as many times as you want.

Expectations

10 built-in expectation types you can use in your YAML:

Type What it checks
tool_called Agent called this tool
tool_not_called Agent didn't call this tool
tool_args_contain Tool was called with these arguments
response_contains Agent's response includes these strings
response_not_contains Agent's response doesn't include this
response_matches_regex Response matches a regex
llm_judge Another LLM evaluates the response
latency_under Step completed within N ms
no_tool_errors No tool calls returned errors
tool_call_count Tool was called exactly/at least/at most N times

Invariant mining

Instead of writing expectations by hand, let TraceForge figure them out:

$ traceforge mine calculator_agent -v

╭────────────── Invariant Mining Report ───────────────╮
│ Traces analyzed: 15 (15 passing, 0 failing)          │
│ Invariants discovered: 5                             │
│                                                      │
│   - 'calculate' is always called at step 0           │
│   - 'calculate' is called 1-5 times per run          │
│   - 'calculate.expression' is always non-empty       │
│   - Step 0 response length is 30-48 chars            │
│   - Step 0 latency is under 3916ms                   │
╰──────────────────────────────────────────────────────╯

Run enough traces and the miner will find rules that hold in all passing traces but break in failing ones. Those are your bugs.

Causal attribution

When a trace fails, TraceForge can run counterfactual experiments — change one thing at a time, re-run the agent, and see what flips the outcome.

$ traceforge attribute <trace-id> --scenario ./scenarios/

╭────────────── Causal Attribution Report ─────────────╮
│ Failing step: 2 | Interventions: 23 | Flips: 7      │
│                                                      │
│  CAUSAL FACTOR          SENSITIVITY                  │
│  tool_output_value         40%                       │
│  tool_output_format         0%                       │
│  system_prompt_clause       0%                       │
╰──────────────────────────────────────────────────────╯

"40% of value changes flipped the outcome. Format and prompt don't matter." Now you know where to look.

Requirements

  • Python 3.12+
  • Ollama running locally
  • A pulled model (tested with qwen2.5:7b-instruct)

Tests

pytest tests/ -v

183 tests, runs in about a second.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceforge-0.2.0.tar.gz (57.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceforge-0.2.0-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file traceforge-0.2.0.tar.gz.

File metadata

  • Download URL: traceforge-0.2.0.tar.gz
  • Upload date:
  • Size: 57.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for traceforge-0.2.0.tar.gz
Algorithm Hash digest
SHA256 446e211643bc9866b423d2d023609d2e9119aaf4c009426397c4b040fabfc343
MD5 317d23b226b3c05004ed11d468a8afb2
BLAKE2b-256 d868ab1ee3af1f24508a41c738723b5d835e4d2b87868d60d5f3e5d7eae904ac

See more details on using hashes here.

File details

Details for the file traceforge-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: traceforge-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 44.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for traceforge-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c080dfb843bea76167deec13b6e0b2de2938ffa19a42f902e7d1a4c78b6b2dac
MD5 f9936aa861ed8f3241a13a24807672e4
BLAKE2b-256 40eeb1bdef96c57e1b3bd7d871503ecfa60746ece69b3d9bf1934492c6424cb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page