Skip to main content

Tracks down every bug in your agent workflow.

Project description

           __
      (___()'`;
      /,    /`
      \\"--\\

AgentHound

Sniff out every bug in your agent workflow.

License Python 3.9+


AgentHound Debugger

AgentHound Stats Dashboard

AgentHound is the pytest-native testing framework for AI agent workflows. It records real agent sessions, replays them deterministically, and lets you assert on behavior and correctness — all without making a single API call.

pip install agenthound            # Core framework
pip install agenthound[ui]        # + debug UI

The Problem

AI agents are the fastest-growing category in software. But they're nearly untestable with existing methods:

  • Non-deterministic — the same prompt produces different outputs every time
  • Multi-step — errors cascade through tool calls and decision chains
  • Expensive — every test run burns tokens and money
  • Slow — round-trips to LLM APIs add seconds per assertion

The result: 38% of organizations are piloting agents, but only 11% have them in production. The gap is testing.

The Solution

AgentHound brings the workflow developers already know — write test, make it pass, ship — to AI agents:

from agenthound import replay, expect

@replay("tests/fixtures/refund_flow.json")
def test_refund_agent(session):
    result = my_agent.run("I want to return order ORD-123")

    expect(session).tools_called(["lookup_order", "process_refund"])
    expect(session).completed_successfully()
    expect(result).contains("refund")
$ pytest tests/ -v
tests/test_refund.py::test_refund_agent PASSED        [100%]

============================== 1 passed in 0.02s ==============================

Zero API calls. Runs in milliseconds.

Quickstart

1. Record

Run your agent once and capture every LLM call, tool invocation, and token count:

from agenthound import record_session

with record_session("tests/fixtures/refund_flow.json") as session:
    result = my_agent.run("I want to return order ORD-123")
    session.tag("happy_path", "refund")

This writes a JSON fixture file. API keys are automatically redacted. Commit it to git alongside your tests.

2. Replay

The @replay decorator intercepts all HTTP calls and serves the recorded responses. Your agent thinks it's talking to the real API:

from agenthound import replay, expect

@replay("tests/fixtures/refund_flow.json")
def test_refund_agent(session):
    result = my_agent.run("I want to return order ORD-123")

    expect(session).tools_called(["lookup_order", "process_refund"])
    expect(result).contains("refund")

3. Ship

pytest tests/ -v

No API keys in CI. No network calls. No flaky tests. Just deterministic, sub-second assertions.

Features

Mock LLM Responses

Don't want to record first? Define responses inline:

from agenthound import mock_llm, expect

@mock_llm(responses=[
    {"tool_call": "search", "args": {"q": "weather in SF"}},
    "It's 65F and sunny in San Francisco today.",
])
def test_weather_agent(session):
    result = my_agent.run("What's the weather?")

    expect(session).tools_called(["search"])
    expect(session).has_llm_calls(2)
    expect(result).contains("sunny")

Works with both providers:

@mock_llm(responses=["Hello from Claude!"], provider="anthropic")
def test_with_claude(session):
    ...

Failure Injection

Test how your agent handles the real world — timeouts, rate limits, broken tools:

from agenthound import replay, inject_failure, expect

@replay("tests/fixtures/refund_flow.json")
@inject_failure(tool="process_refund", error="TimeoutError", at_call=1)
def test_handles_refund_timeout(session):
    result = my_agent.run("I want to return order ORD-123")
    # The agent should retry or degrade gracefully

Token Budgets

Prevent runaway loops and token bloat with first-class assertions:

@replay("tests/fixtures/research_pipeline.json")
def test_research_stays_within_budget(session):
    result = research_agent.run("Analyze the competitive landscape")

    expect(session).total_tokens_under(50000)   # Token budget
    expect(session).max_turns(10)               # Prevent runaway loops

Graduated Assertion Engine

Four layers of assertions. Most tests never need anything beyond Layer 3:

Layer What it checks Example
Schema Structure, counts has_llm_calls(3), no_errors(), all_calls_have_usage()
Constraints Budgets, limits total_tokens_under(5000), latency_under(3000), max_turns(5)
Trace Tool behavior tools_called(["search", "respond"]), tool_called_with("search", {"q": "test"})
Content Response text final_response_contains("refund"), final_response_matches(r"REF-\d+")

Chain them for readable, comprehensive assertions:

(
    expect(session)
    .has_llm_calls(3)
    .tools_called(["lookup_order", "process_refund"])
    .model_used("gpt-4o-mini")
    .no_errors()
    .final_response_contains("refund")
    .final_response_matches(r"REF-\d+")
)

Framework Support

AgentHound intercepts at the httpx transport level, so it works with any LLM SDK built on httpx — no framework-specific adapters needed:

Framework How it works
OpenAI SDK Intercepts openai.chat.completions.create()
Anthropic SDK Intercepts anthropic.messages.create()
LangGraph / LangChain Intercepts underlying SDK calls
Pydantic AI Intercepts underlying SDK calls
CrewAI Intercepts underlying SDK calls
Any httpx-based client Intercepted automatically

CI/CD

AgentHound is a standard pytest plugin. It works everywhere pytest works:

# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"
      - run: pytest tests/ -v
        # No API keys needed — tests replay from fixtures

API Reference

Recording

from agenthound import record_session

# Record all LLM calls within the block and save to a fixture file
with record_session("path/to/fixture.json", metadata={"env": "dev"}) as session:
    result = my_agent.run("prompt")
    session.tag("happy_path", "v2")

Replay

from agenthound import replay

# Replay a recorded fixture — all HTTP calls return recorded responses
@replay("path/to/fixture.json", strict=True)
def test_my_agent(session):
    result = my_agent.run("prompt")

Mocking

from agenthound import mock_llm, mock_tool

# Mock LLM responses with a sequence of responses
@mock_llm(responses=["Hello!", {"tool_call": "search", "args": {"q": "test"}}], provider="openai")
def test_with_mock(session):
    ...

# Mock a tool function by import path
@mock_tool("search", target="myapp.tools.search_fn", returns={"results": []})
def test_with_mocked_tool():
    ...

Failure Injection

from agenthound import inject_failure

# Inject an error at the Nth call to a specific tool
@replay("fixtures/session.json")
@inject_failure(tool="process_refund", error="TimeoutError", at_call=1)
def test_failure_handling(session):
    ...

Auto-Record

import agenthound

# Global: capture all LLM calls, split sessions by 2s idle timeout
agenthound.auto_record("sessions/", tags=["dev"], metadata={"env": "local"})
# ... run your agent ...
agenthound.stop_auto_record()

# Per-function: each call becomes a fixture
@agenthound.recorded("sessions/", tags=["support"])
def handle_request(user_input):
    return agent.run(user_input)

OTEL Import

from agenthound.importers.otel import import_otel_trace

import_otel_trace("trace.json", "fixtures/prod-session.json")
agenthound-import otel trace.json fixtures/prod-session.json

Session Assertions

from agenthound import expect

# Schema (Layer 1)
expect(session).has_llm_calls(3)
expect(session).has_llm_calls_between(1, 5)
expect(session).no_errors()
expect(session).all_calls_have_usage()

# Constraints (Layer 2)
expect(session).total_tokens_under(5000)
expect(session).latency_under(3000)
expect(session).max_turns(5)

# Trace (Layer 3)
expect(session).tools_called(["search", "respond"])
expect(session).tools_called_unordered({"search", "respond"})
expect(session).tool_called("search", times=2)
expect(session).tool_called_with("search", {"q": "test"})
expect(session).tool_sequence(["search", "respond"])
expect(session).no_tool_errors()
expect(session).model_used("gpt-4o-mini")
expect(session).completed_successfully()

# Content (Layer 4)
expect(session).final_response_contains("refund")
expect(session).any_response_contains("order")
expect(session).final_response_matches(r"REF-\d+")

Result Assertions

expect(result).contains("refund")
expect(result).matches(r"REF-\d+")
expect(result).equals("expected value")
expect(result).is_type(str)
expect(result).has_field("status", "success")

Session Properties

session.llm_calls          # List[LLMCall] — all LLM calls in order
session.tools_called        # List[str] — ordered tool names
session.tool_retries        # Dict[str, int] — call count per tool
session.total_tokens         # int — total tokens across all calls
session.total_duration_ms    # float — total wall-clock time
session.tags                # List[str] — tags from recording
session.metadata            # Dict — metadata from recording

pytest CLI Options

pytest --agenthound-record          # Run in recording mode (real API calls)
pytest --agenthound-update          # Re-record existing fixtures
pytest --agenthound-fixtures-dir    # Set fixtures directory (default: tests/fixtures)

Beyond Tests: Auto-Record and Live Debug

AgentHound isn't just for test suites. You can record and debug agent sessions during development and in production.

Auto-Record

Automatically capture every agent interaction without changing your code:

import agenthound

# Enable global auto-recording — every LLM call is captured
agenthound.auto_record("sessions/")

# Your existing code runs unchanged
result = my_agent.run("Hello")          # -> sessions/2026-03-21T10-00-00_001.json
result = my_agent.run("Return ORD-123") # -> sessions/2026-03-21T10-00-05_002.json

# Disable when done
agenthound.stop_auto_record()

Sessions are split automatically: if no API call happens for 2+ seconds, the current session is flushed to a file and a new one starts.

@recorded Decorator

For more control, decorate specific functions so each invocation saves a fixture:

import agenthound

@agenthound.recorded("sessions/", tags=["support"])
def handle_support_request(user_input):
    return agent.run(user_input)

# Each call saves a separate fixture
handle_support_request("Return order ORD-100")
handle_support_request("Where is my order?")

Debug UI

A local web UI for stepping through recorded sessions:

pip install agenthound[ui]
agenthound-ui --fixtures-dir sessions/
# Open http://127.0.0.1:7600

Features:

  • Fixture browser — see all recorded sessions with tags, tokens, and step count
  • Step-through debugger — click through each LLM call and tool invocation
  • Step inspector — see model, tokens, tool arguments, and response text at each step
  • Stats dashboard — aggregate totals across all sessions: tokens, models, tags, providers
  • Keyboard navigation — arrow keys to step forward/back
  • Live mode — toggle live updates to see new sessions appear in real-time as your agent runs

Live Proxy Mode

The debug UI includes a built-in HTTP proxy that intercepts live LLM calls from your running application — no code changes required. Point your app's SDK at the proxy, and every API call is forwarded to the real provider, recorded as a fixture, and appears in the UI in real-time.

1. Start the UI (the proxy is included automatically):

agenthound-ui --fixtures-dir sessions/ --port 7600

2. Point your app at the proxy:

The proxy lives at http://127.0.0.1:7600/proxy. Set your SDK's base URL to route through it:

# Anthropic SDK
export ANTHROPIC_BASE_URL=http://127.0.0.1:7600/proxy

# OpenAI SDK
export OPENAI_BASE_URL=http://127.0.0.1:7600/proxy

Or set it in code:

from anthropic import Anthropic
client = Anthropic(base_url="http://127.0.0.1:7600/proxy")

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:7600/proxy")

3. Use your app normally. Every LLM call flows through the proxy to the real API. Responses are returned unchanged. Each group of calls (separated by 5 seconds of idle time) is saved as a fixture file and appears live in the UI.

Docker: If your app runs in Docker, use host.docker.internal to reach the proxy on the host:

ANTHROPIC_BASE_URL=http://host.docker.internal:7600/proxy

The proxy is transparent — your app behaves exactly as before, but you get full visibility into every prompt, response, and token count.

Import Production Traces

Convert OpenTelemetry traces from Langfuse, Jaeger, or any OTEL-compatible tool into AgentHound fixtures, then debug them locally:

agenthound-import otel trace-export.json fixture.json
agenthound-ui --fixtures-dir .

Or use the Python API:

from agenthound.importers.otel import import_otel_trace

import_otel_trace("trace-export.json", "fixtures/prod-session.json")

How It Works

AgentHound operates at the httpx transport layer — the same HTTP client used internally by both the OpenAI and Anthropic Python SDKs.

Recording: A custom httpx.BaseTransport wraps the real transport. Every HTTP request and response passes through unchanged, but gets captured into a structured log. On exit, the log is serialized to a JSON fixture file with auth headers automatically redacted.

Replay: A different custom transport serves pre-recorded responses in sequence. The Nth HTTP call gets the Nth recorded response. Your agent's code runs exactly as it would in production — it has no idea it's talking to a replay.

Assertions: The fixture contains two layers of data. The raw HTTP layer (used by replay) and a semantic layer with parsed LLM calls, tool invocations, and token counts (used by assertions). This separation keeps replay faithful and assertions ergonomic.

Your Agent Code
      |
      v
  SDK (OpenAI / Anthropic)
      |
      v
  httpx.Client
      |
      v
  AgentHound Transport  <-- intercepts here
      |
      v
  Real API (recording) or Fixture (replay)

Contributing

git clone https://github.com/martinwells/agenthound.git
cd agenthound
pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenthound-0.1.0.tar.gz (658.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agenthound-0.1.0-py3-none-any.whl (115.6 kB view details)

Uploaded Python 3

File details

Details for the file agenthound-0.1.0.tar.gz.

File metadata

  • Download URL: agenthound-0.1.0.tar.gz
  • Upload date:
  • Size: 658.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agenthound-0.1.0.tar.gz
Algorithm Hash digest
SHA256 af218a9b7c0d11196166d13e07dcdb4f38ea6df994d3334a8c5f194aeb4ff1cf
MD5 5a8791c2ee114288df804b3179a6a5e7
BLAKE2b-256 16236aedbd8301215b9b56d034c8e118e97f98eec3f6119f5340968508adae84

See more details on using hashes here.

File details

Details for the file agenthound-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agenthound-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 115.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agenthound-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9ee4fad440a5131932bd8bc32f45b1d7202a7b8df904303b5ca283ecc2bca75
MD5 bfa4a74f93537d9bac74bfe0ae0888a6
BLAKE2b-256 a295d9e08d7e0479647a917355d3dbaf58ad96a257bb31324ff69952cb922e9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page