Skip to main content

Universal agent testing and evaluation toolkit - record, replay, mock, and benchmark AI agents

Project description

Agentest

Universal testing and evaluation toolkit for AI agents.

PyPI version Python versions CI License Downloads

Quick StartFeaturesEvaluatorsCLIArchitecture


Only 52% of teams shipping AI agents run evaluations. Agentest makes it dead simple to test, evaluate, and benchmark agents — regardless of which framework or LLM provider you use.

pip install agentest

Why Agentest?

The AI agent ecosystem has exploded, but testing and evaluation hasn't kept up. Most teams ship agents to production without systematic testing.

Agentest brings software engineering best practices to agent development:

Capability What You Get
Record & Replay Capture real agent sessions, replay them deterministically — no expensive LLM calls needed for testing
Tool Mocking Mock any tool call with a fluent, pytest-style API: .when(...).returns(...)
7 Built-in Evaluators Grade agents on task completion, safety, cost, latency, tool usage, and more
Model Comparison Run the same tasks across Claude, GPT, Gemini — compare pass rates, cost, and latency
MCP Server Testing Test MCP servers for protocol compliance and tool schema validation
pytest Plugin Drop-in integration with auto-registered fixtures and custom markers
CLI & Web UI agentest evaluate, agentest replay, agentest summary, and a FastAPI dashboard

What Makes It Different

  • Framework-agnostic — Works with any agent framework (LangChain, CrewAI, AutoGen, LlamaIndex, custom agents). No vendor lock-in.
  • Auto-instrumentationagentest.instrument() patches anthropic/openai clients to auto-record traces with zero code changes.
  • Native adapters — First-class integrations for LangChain, CrewAI, AutoGen, LlamaIndex, Claude Agent SDK, and OpenAI Agents SDK.
  • LLM-provider-agnostic — Built-in cost tracking for Anthropic, OpenAI, and Google models.
  • Offline-first — No network calls required for recording, replaying, or evaluating.
  • CI/CD ready — GitHub Action for running evaluations in your pipeline.
  • Minimal dependencies — 6 core runtime deps. Optional extras for web UI and frameworks.

Quick Start

1. Record and Evaluate

from agentest import Recorder, TaskCompletionEvaluator, SafetyEvaluator, CostEvaluator

# Record an agent interaction
recorder = Recorder(task="Summarize README.md")
recorder.record_message("user", "Please summarize README.md")
recorder.record_tool_call(
    name="read_file",
    arguments={"path": "README.md"},
    result="# My Project\nThis is a sample project.",
)
recorder.record_llm_response(
    model="claude-sonnet-4-6",
    content="This is a sample project.",
    input_tokens=100,
    output_tokens=20,
)
trace = recorder.finalize(success=True)

# Evaluate
for evaluator in [TaskCompletionEvaluator(), SafetyEvaluator(), CostEvaluator(max_cost=0.10)]:
    result = evaluator.evaluate(trace)
    print(f"{result.evaluator}: {'PASS' if result.passed else 'FAIL'} ({result.score:.2f})")

# Save for replay
recorder.save("traces/summarize.yaml")

2. Mock Tools for Deterministic Testing

from agentest import ToolMock, MockToolkit

toolkit = MockToolkit()

# Simple returns
toolkit.mock("read_file").returns("file contents")

# Conditional returns
toolkit.mock("search") \
    .when(query="python").returns(["python result"]) \
    .when(query="rust").returns(["rust result"]) \
    .otherwise().returns([])

# Sequential returns (pagination, retries)
toolkit.mock("get_page").returns_sequence(["page 1", "page 2", "page 3"])

# Custom logic
toolkit.mock("calculator").responds_with(lambda args: args["a"] + args["b"])

# Error simulation
toolkit.mock("flaky_api").raises(TimeoutError("service unavailable"))

# Use them
result = toolkit.execute("read_file", path="test.txt")  # "file contents"
result = toolkit.execute("search", query="python")       # ["python result"]

# Assertions
toolkit.mock("read_file").assert_called()
toolkit.mock("read_file").assert_called_with(path="test.txt")
toolkit.assert_all_called()

3. Replay Recorded Sessions

from agentest import Recorder, Replayer

# Load a previously recorded trace
trace = Recorder.load("traces/summarize.yaml")
replayer = Replayer(trace, strict=True)

# Replay — get the exact same responses
response = replayer.next_llm_response()
tool_result = replayer.next_tool_result("read_file")

# Generate mock functions from the trace
mocks = replayer.create_tool_mock()
result = mocks["read_file"]()  # Returns the recorded result

4. Benchmark Across Models

from agentest import BenchmarkRunner, ModelComparison
from agentest.benchmark.runner import BenchmarkTask

comparison = ModelComparison()

for model in ["claude-sonnet-4-6", "gpt-4o", "gpt-4o-mini"]:
    runner = BenchmarkRunner(name=f"bench_{model}", evaluators=[...])
    runner.add_task(BenchmarkTask(
        name="summarize",
        description="Summarize a document",
        task_fn=lambda: run_your_agent(model, "Summarize README.md"),
    ))
    comparison.add_result(model, runner.run())

# Compare
table = comparison.comparison_table()
best = comparison.best_model("avg_score")
diff = comparison.diff("claude-sonnet-4-6", "gpt-4o")

# Export
comparison.to_markdown("results.md")
comparison.to_csv("results.csv")

5. pytest Integration

# tests/test_my_agent.py — fixtures auto-registered via entry point

def test_agent_completes_task(agent_recorder, agent_eval_suite):
    agent_recorder.trace.task = "Summarize a document"
    agent_recorder.record_tool_call(name="read_file", arguments={"path": "doc.txt"}, result="...")
    agent_recorder.record_llm_response(model="claude-sonnet-4-6", content="Summary here.")
    trace = agent_recorder.finalize(success=True)

    results = agent_eval_suite.evaluate_all(trace)
    assert all(r.passed for r in results)

def test_mocked_tools(agent_toolkit):
    agent_toolkit.mock("search").when(query="python").returns(["result"])
    assert agent_toolkit.execute("search", query="python") == ["result"]

def test_safety():
    from agentest import Recorder, SafetyEvaluator
    rec = Recorder(task="test")
    rec.record_tool_call(name="bash", arguments={"command": "rm -rf /"}, result="")
    assert not SafetyEvaluator().evaluate(rec.finalize()).passed

Run with:

pytest tests/ --agentest-max-cost=0.50 --agentest-max-tokens=100000

6. Test MCP Servers

from agentest.mcp_testing import MCPServerTester, MCPAssertions

tester = MCPServerTester(command=["python", "-m", "my_mcp_server"])

# Run standard compliance tests
results = tester.run_standard_tests()
MCPAssertions(results).all_passed().has_tool("read_file").max_latency(5000)

# Test specific tools
result = tester.test_tool_call("read_file", arguments={"path": "/tmp/test.txt"})
assert result.passed

# Validate tool schemas
schema_results = tester.test_tool_schema_validation()
MCPAssertions(schema_results).all_passed()

7. Auto-Instrumentation (Zero Code Changes)

import agentest

# Monkey-patch anthropic and openai clients globally
agentest.instrument()

# All API calls are now automatically recorded
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=100,
)

# Get all recorded traces
traces = agentest.get_traces()
agentest.uninstrument()  # Remove patches when done

8. Framework Integrations

# LangChain
from agentest.integrations.langchain import AgentestCallbackHandler
handler = AgentestCallbackHandler(task="My chain")
result = chain.invoke({"input": "Hello"}, config={"callbacks": [handler]})
trace = handler.get_trace()

# CrewAI
from agentest.integrations.crewai import record_crew
result, trace = record_crew(crew, inputs={"topic": "testing"})

# AutoGen
from agentest.integrations.autogen import record_autogen_chat
result, trace = record_autogen_chat(user_proxy, assistant, "Hello")

# Claude Agent SDK
from agentest.integrations.claude_agent_sdk import AgentestTracer
tracer = AgentestTracer(task="My agent")
result, trace = tracer.record(agent.run, "What is 2+2?")

# OpenAI Agents SDK
from agentest.integrations.openai_agents import AgentestTracer
tracer = AgentestTracer(task="My agent")
result, trace = tracer.record(runner.run_sync, agent, "Hello")

Install framework extras: pip install agentest[langchain,crewai,autogen,llamaindex]

9. GitHub Action

# .github/workflows/agent-eval.yml
- uses: ColinHarker/agentest@v1
  with:
    traces-dir: traces/
    evaluators: task_completion,safety,tool_usage
    max-cost: "1.00"
    check-safety: "true"
    fail-on-error: "true"

Features

Record & Replay

Capture agent interactions as immutable trace snapshots (YAML or JSON). Replay them deterministically without making real LLM or tool calls — saving time and money during development.

# Record with context manager — auto-finalizes on exit
with Recorder(task="My task") as rec:
    rec.record_message("user", "Do something")
    rec.record_llm_response("claude-sonnet-4-6", "Done.", 100, 20)
# trace = rec.trace

Tool Mocking

Fluent builder API with conditional returns, sequences, regex matching, custom handlers, and full assertion support. Test agent logic without real integrations.

Safety Evaluation

Built-in detection for:

  • Unsafe commandsrm -rf /, DROP TABLE, sudo chmod 777, eval(, curl | sh, etc.
  • PII leakage — SSNs, credit card numbers, emails, API keys, AWS credentials
  • Custom patterns — supply your own regex rules
  • Blocked tools — prevent specific tools from being called

Cost & Latency Tracking

Automatic cost estimation with built-in pricing for Claude (Opus, Sonnet, Haiku), GPT-4o, GPT-4o-mini, O3, and O4-mini. Set budgets and latency limits as evaluator constraints.

Model Comparison

Side-by-side benchmarking with export to CSV and Markdown. Compare pass rates, average scores, total cost, and latency across any number of models.

CLI

# Initialize Agentest in your project
agentest init

# Evaluate a recorded trace
agentest evaluate traces/my_trace.yaml --max-cost 0.50 --check-safety

# Replay a trace
agentest replay traces/my_trace.yaml

# Summarize all traces in a directory
agentest summary traces/ --format table

# Compare two traces side-by-side
agentest diff traces/v1.yaml traces/v2.yaml

# Watch a directory and re-evaluate on changes
agentest watch traces/ --check-safety --interval 5

# Launch the web UI dashboard
agentest serve --traces-dir traces/ --port 8000

Built-in Evaluators

Evaluator What it checks Scoring
TaskCompletionEvaluator Success status, errors, message count, required tools, failed tool calls -0.25 per issue (min 0.0)
SafetyEvaluator Dangerous commands, PII leakage, blocked tools, custom regex patterns -0.2 per violation (min 0.0)
CostEvaluator Total cost, token count, LLM call count against budgets Binary pass/fail
LatencyEvaluator Total duration and per-call latency against limits 1.0 pass, 0.5 fail
ToolUsageEvaluator Required/forbidden tools, retry limits, error rates -0.2 per issue (min 0.0)
LLMJudgeEvaluator LLM-graded evaluation against custom criteria (Anthropic or OpenAI) LLM-assigned score
CompositeEvaluator Combines multiple evaluators with AND/OR logic Average of all scores

Architecture

agentest/
├── core.py              # Data models: AgentTrace, ToolCall, LLMResponse, TraceSession
├── recorder/
│   ├── recorder.py      # Record agent sessions to YAML/JSON
│   └── replayer.py      # Replay sessions deterministically
├── mocking/
│   └── tool_mock.py     # ToolMock, MockToolkit — fluent builder + assertions
├── evaluators/
│   ├── base.py          # Evaluator ABC, EvalResult, CompositeEvaluator, LLMJudge
│   └── builtin.py       # TaskCompletion, Safety, Cost, Latency, ToolUsage
├── benchmark/
│   ├── runner.py        # BenchmarkRunner (sync + async), BenchmarkTask, BenchmarkResult
│   └── comparison.py    # ModelComparison, ModelScore — CSV/Markdown export
├── integrations/
│   ├── instrument.py    # Auto-instrumentation for anthropic/openai
│   ├── langchain.py     # LangChain callback handler adapter
│   ├── crewai.py        # CrewAI crew recorder
│   ├── autogen.py       # AutoGen conversation recorder
│   ├── llamaindex.py    # LlamaIndex callback handler
│   ├── claude_agent_sdk.py # Claude Agent SDK tracer
│   └── openai_agents.py # OpenAI Agents SDK tracer
├── mcp_testing/
│   ├── server_tester.py # MCPServerTester — subprocess-based JSON-RPC testing
│   └── assertions.py    # MCPAssertions — fluent assertion chains
├── reporters/
│   ├── console.py       # Rich console output
│   └── json_reporter.py # Machine-readable JSON reports
├── server/
│   └── app.py           # FastAPI web UI for trace exploration
├── pytest_plugin.py     # Auto-registered fixtures, markers, and trace collectors
└── cli.py               # Click CLI with 8 commands

Design Principles

  • Pydantic models for strict typing and automatic serialization
  • Builder pattern for fluent APIs (ToolMock, Recorder)
  • Strategy pattern for pluggable evaluators
  • Composite pattern for evaluator aggregation
  • Zero framework coupling — works with any agent that produces traces

Comparison

Feature Agentest LangSmith LangFuse Braintrust
Record & Replay
Tool Mocking Basic
Safety Evaluator
MCP Server Testing
pytest Integration
Framework-Agnostic
Open Source
Cost Tracking
Web UI Basic Full Full Full
Centralized Backend
Auto-Instrumentation
Framework Adapters ✅ (7) Partial
GitHub Action

Best for: Local development, CI/CD pipelines, deterministic testing, safety compliance, multi-model benchmarking.

Installation

pip install agentest              # Core (recording, evaluation, CLI)
pip install agentest[web]         # Web UI dashboard
pip install agentest[langchain]   # LangChain adapter
pip install agentest[crewai]      # CrewAI adapter
pip install agentest[autogen]     # AutoGen adapter
pip install agentest[llamaindex]  # LlamaIndex adapter
pip install agentest[all]         # Everything

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentest-1.0.2.tar.gz (150.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentest-1.0.2-py3-none-any.whl (105.2 kB view details)

Uploaded Python 3

File details

Details for the file agentest-1.0.2.tar.gz.

File metadata

  • Download URL: agentest-1.0.2.tar.gz
  • Upload date:
  • Size: 150.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentest-1.0.2.tar.gz
Algorithm Hash digest
SHA256 2dc1f14c8b7a2396b8e8696ab549173b708bea46224dbeb1bb4c34b5533080a7
MD5 2e11efa77abf2f681e9489f1ee322924
BLAKE2b-256 5e36e5738821dc3199335cb35106bdd8611742a7e8e797ee09fe1814fae3d9eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentest-1.0.2.tar.gz:

Publisher: publish.yml on ColinHarker/agentest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentest-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: agentest-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 105.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentest-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4ce911e2cb84aa48b033aca4da7782cccaf45b181a2fe1e81d510f9798dfa2bc
MD5 2007f70363bdd6b179b368767879c91b
BLAKE2b-256 0e9724b8ad37c8eabb127a44d0c1dc5110f2b6862608e0150d8a5880bcb4bc4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentest-1.0.2-py3-none-any.whl:

Publisher: publish.yml on ColinHarker/agentest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page