Skip to main content

Trajectory-based CI testing for AI agents

Project description

traceix

Trajectory-based CI testing for AI agents.

traceix lets you declare which tools your agent should call — and in what order — as plain YAML, then run those assertions in CI the same way you'd run unit tests. No LLM-as-judge, no flaky eval pipelines: if the trajectory doesn't match, the build fails.

pip install traceix

Why traceix?

LLM-powered agents are non-deterministic. The same prompt might call search_flights then confirm_booking today, but skip straight to confirm_booking tomorrow. traceix makes that observable and enforceable:

  • Declare the expected trajectory in YAML — which tools, in what order, with what args.
  • Run it in CI — the agent still calls the real LLM; only the tool responses are mocked.
  • Get a clear pass/fail — no prompt-engineering an evaluator, no statistical thresholds.

Quick start

pip install traceix
traceix init          # detects your framework, scaffolds traceix.yaml + tests/example.yaml

Write a test (tests/book_flight.yaml):

name: book-flight-basic
input: "Book me the cheapest flight from NYC to SFO"

mocks:
  search_flights:
    return: { flights: [{ id: F1, price: 390 }, { id: F2, price: 420 }] }
  confirm_booking:
    return: { booking_id: BK-001, status: confirmed }

expected:
  trajectory:
    mode: contains        # these steps must appear in order (others allowed)
    steps:
      - tool: search_flights
        args: { origin: NYC, destination: SFO }
        arg_mode: partial  # only check the keys listed above
      - tool: confirm_booking
        arg_mode: ignore
  forbidden_tools: [cancel_booking]

Run it:

traceix run tests/ --handler mypackage.agent:run
  ✓  book-flight-basic   1/1   2 steps   142ms
  ──────────────────────────────────────────────
  1 passed · 0 failed

Integration

@traceix_tool decorator (LangChain / LangGraph — recommended)

Add @traceix_tool above your @tool decorators. Your agent handler stays completely unchanged — traceix patches mocks in during test runs without touching it:

# mypackage/tools.py
from langchain_core.tools import tool
from traceix import traceix_tool

@traceix_tool   # ← add this line; nothing else changes
@tool
def search_flights(origin: str, destination: str) -> dict:
    """Search available flights."""
    ...  # real implementation
# mypackage/agent.py — no traceix imports, no changes needed
def run(user_input: str) -> str:
    graph = create_react_agent(model, [search_flights, confirm_booking])
    result = graph.invoke({"messages": [HumanMessage(content=user_input)]})
    return result["messages"][-1].content

tools= parameter (any framework)

An alternative for frameworks where @traceix_tool isn't available. You don't write any mock code — traceix builds callable tool objects from the mocks: section in your YAML and passes them as the tools list. Your handler just accepts and forwards them:

# mypackage/agent.py
def run(input: str, tools: list) -> str:  # ← accept the injected tools
    graph = create_react_agent(model, tools)  # ← forward them to the agent
    result = graph.invoke({"messages": [HumanMessage(content=input)]})
    return result["messages"][-1].content

Given this YAML:

mocks:
  search_flights:
    return: { flights: [{ id: F1, price: 390 }] }
  confirm_booking:
    return: { booking_id: BK-001, status: confirmed }

traceix calls your handler as run(input="...", tools=[<mocked search_flights>, <mocked confirm_booking>]). The agent calls those tools, they return what the YAML says, and traceix records the trajectory.

traceix auto-detects which mode you're using based on whether your handler has a tools parameter.


CLI commands

Command What it does
traceix init Detect framework, scaffold traceix.yaml + example test
traceix run tests/ Run tests, exit 0 on pass / 1 on fail
traceix run tests/ --fixture record Save real tool responses to .traceix/fixtures/
traceix run tests/ --fixture replay Replay recorded responses in CI
traceix snapshot tests/ Save golden trajectory baselines
traceix check tests/ Compare live run against saved baselines
traceix compare tests/ --a "model=X" --b "model=Y" A/B test two model configs side by side

Trajectory modes

mode controls how the expected steps are matched against the agent's actual tool calls. Set it under expected.trajectory.mode.

contains — listed steps must appear in order, but the agent can call other tools in between. Most permissive and the most common choice:

# passes for: search → clarify → confirm  (extra "clarify" step is fine)
mode: contains
steps:
  - tool: search_flights
  - tool: confirm_booking

strict — the agent must call exactly these tools, in exactly this order, nothing more:

# fails if the agent calls any extra tool or reorders steps
mode: strict
steps:
  - tool: search_flights
  - tool: confirm_booking

unordered — all listed tools must be called, but order doesn't matter:

# passes whether the agent searches flights before or after searching hotels
mode: unordered
steps:
  - tool: search_flights
  - tool: search_hotels

within — listed steps must appear as a contiguous block (no other tools in between), but can be preceded or followed by anything:

# passes for: login → search → confirm → logout
# fails if any tool appears between search and confirm
mode: within
steps:
  - tool: search_flights
  - tool: confirm_booking

Arg modes

arg_mode controls how strictly the tool's arguments are checked. Set it per step under expected.trajectory.steps.

ignore — only assert the tool was called, don't check arguments at all:

- tool: complete_todo
  arg_mode: ignore

partial — assert only the keys you list; extra arguments the agent passes are fine:

- tool: add_todo
  args: { title: "Buy groceries" }
  arg_mode: partial   # passes even if agent also sent priority: "high"

exact — every argument must match and no extra keys are allowed. This is the default when arg_mode is omitted:

- tool: add_todo
  args: { title: "Buy groceries", priority: "medium" }
  arg_mode: exact     # fails if agent passes any other key

In practice: use ignore when you only care that a tool ran, partial when you want to pin one or two key arguments, and exact when the full payload matters (e.g. a payment or deletion).


Framework support

Framework Integration
LangGraph @traceix_tool decorator or tools= injection
CrewAI tools= injection
Anthropic SDK tools= injection
OpenAI SDK tools= injection
Any other tools= injection

Configuration

Set defaults in traceix.yaml (or [tool.traceix] in pyproject.toml):

handler: mypackage.agent:run
runs: 3          # runs per test case (increase in CI for confidence)
tolerance: 0.67  # fraction of runs that must pass
fixture_mode: replay

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceix-0.1.2.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceix-0.1.2-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file traceix-0.1.2.tar.gz.

File metadata

  • Download URL: traceix-0.1.2.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for traceix-0.1.2.tar.gz
Algorithm Hash digest
SHA256 eb15b0aa3cede0246db0ff05af0c7aec9ff14024347fa12b2abf33175ab34d98
MD5 f85bce904e97a9c9320bfea5d7c1ac1a
BLAKE2b-256 1dc5eeb8ff70abf955f86d400241f9d6d078c30fe47eb061e95604d2f497755e

See more details on using hashes here.

File details

Details for the file traceix-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: traceix-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for traceix-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8b1a668e743526ab58410461e7e73ebc5f6162a997e3f3f6ced9acda6886c03f
MD5 c9f0ac274a56a0172de65ec46f66b707
BLAKE2b-256 722988182cf7c36941d078043912cf48953fc4ee38732bd8d460691cabed6a5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page