agentest

Universal agent testing and evaluation toolkit - record, replay, mock, and benchmark AI agents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

colinharker5

These details have not been verified by PyPI

Project description

Agentest

Universal testing and evaluation toolkit for AI agents.

Quick Start • Features • Evaluators • CLI • Architecture

Only 52% of teams shipping AI agents run evaluations. Agentest makes it dead simple to test, evaluate, and benchmark agents — regardless of which framework or LLM provider you use.

pip install agentest

Why Agentest?

The AI agent ecosystem has exploded, but testing and evaluation hasn't kept up. Most teams ship agents to production without systematic testing.

Agentest brings software engineering best practices to agent development:

Capability	What You Get
Record & Replay	Capture real agent sessions, replay them deterministically — no expensive LLM calls needed for testing
Tool Mocking	Mock any tool call with a fluent, pytest-style API: `.when(...).returns(...)`
7 Built-in Evaluators	Grade agents on task completion, safety, cost, latency, tool usage, and more
Model Comparison	Run the same tasks across Claude, GPT, Gemini — compare pass rates, cost, and latency
MCP Server Testing	Test MCP servers for protocol compliance and tool schema validation
pytest Plugin	Drop-in integration with auto-registered fixtures and custom markers
CLI & Web UI	`agentest evaluate`, `agentest replay`, `agentest summary`, and a FastAPI dashboard

What Makes It Different

Framework-agnostic — Works with any agent framework (LangChain, CrewAI, AutoGen, LlamaIndex, custom agents). No vendor lock-in.
Auto-instrumentation — agentest.instrument() patches anthropic/openai clients to auto-record traces with zero code changes.
Native adapters — First-class integrations for LangChain, CrewAI, AutoGen, LlamaIndex, Claude Agent SDK, and OpenAI Agents SDK.
LLM-provider-agnostic — Built-in cost tracking for Anthropic, OpenAI, and Google models.
Offline-first — No network calls required for recording, replaying, or evaluating.
CI/CD ready — GitHub Action for running evaluations in your pipeline.
Minimal dependencies — 6 core runtime deps. Optional extras for web UI and frameworks.

Quick Start

1. Record and Evaluate

from agentest import Recorder, TaskCompletionEvaluator, SafetyEvaluator, CostEvaluator

# Record an agent interaction
recorder = Recorder(task="Summarize README.md")
recorder.record_message("user", "Please summarize README.md")
recorder.record_tool_call(
    name="read_file",
    arguments={"path": "README.md"},
    result="# My Project\nThis is a sample project.",
)
recorder.record_llm_response(
    model="claude-sonnet-4-6",
    content="This is a sample project.",
    input_tokens=100,
    output_tokens=20,
)
trace = recorder.finalize(success=True)

# Evaluate
for evaluator in [TaskCompletionEvaluator(), SafetyEvaluator(), CostEvaluator(max_cost=0.10)]:
    result = evaluator.evaluate(trace)
    print(f"{result.evaluator}: {'PASS' if result.passed else 'FAIL'} ({result.score:.2f})")

# Save for replay
recorder.save("traces/summarize.yaml")

2. Mock Tools for Deterministic Testing

from agentest import ToolMock, MockToolkit

toolkit = MockToolkit()

# Simple returns
toolkit.mock("read_file").returns("file contents")

# Conditional returns
toolkit.mock("search") \
    .when(query="python").returns(["python result"]) \
    .when(query="rust").returns(["rust result"]) \
    .otherwise().returns([])

# Sequential returns (pagination, retries)
toolkit.mock("get_page").returns_sequence(["page 1", "page 2", "page 3"])

# Custom logic
toolkit.mock("calculator").responds_with(lambda args: args["a"] + args["b"])

# Error simulation
toolkit.mock("flaky_api").raises(TimeoutError("service unavailable"))

# Use them
result = toolkit.execute("read_file", path="test.txt")  # "file contents"
result = toolkit.execute("search", query="python")       # ["python result"]

# Assertions
toolkit.mock("read_file").assert_called()
toolkit.mock("read_file").assert_called_with(path="test.txt")
toolkit.assert_all_called()

3. Replay Recorded Sessions

from agentest import Recorder, Replayer

# Load a previously recorded trace
trace = Recorder.load("traces/summarize.yaml")
replayer = Replayer(trace, strict=True)

# Replay — get the exact same responses
response = replayer.next_llm_response()
tool_result = replayer.next_tool_result("read_file")

# Generate mock functions from the trace
mocks = replayer.create_tool_mock()
result = mocks["read_file"]()  # Returns the recorded result

4. Benchmark Across Models

from agentest import BenchmarkRunner, ModelComparison
from agentest.benchmark.runner import BenchmarkTask

comparison = ModelComparison()

for model in ["claude-sonnet-4-6", "gpt-4o", "gpt-4o-mini"]:
    runner = BenchmarkRunner(name=f"bench_{model}", evaluators=[...])
    runner.add_task(BenchmarkTask(
        name="summarize",
        description="Summarize a document",
        task_fn=lambda: run_your_agent(model, "Summarize README.md"),
    ))
    comparison.add_result(model, runner.run())

# Compare
table = comparison.comparison_table()
best = comparison.best_model("avg_score")
diff = comparison.diff("claude-sonnet-4-6", "gpt-4o")

# Export
comparison.to_markdown("results.md")
comparison.to_csv("results.csv")

5. pytest Integration

# tests/test_my_agent.py — fixtures auto-registered via entry point

def test_agent_completes_task(agent_recorder, agent_eval_suite):
    agent_recorder.trace.task = "Summarize a document"
    agent_recorder.record_tool_call(name="read_file", arguments={"path": "doc.txt"}, result="...")
    agent_recorder.record_llm_response(model="claude-sonnet-4-6", content="Summary here.")
    trace = agent_recorder.finalize(success=True)

    results = agent_eval_suite.evaluate_all(trace)
    assert all(r.passed for r in results)

def test_mocked_tools(agent_toolkit):
    agent_toolkit.mock("search").when(query="python").returns(["result"])
    assert agent_toolkit.execute("search", query="python") == ["result"]

def test_safety():
    from agentest import Recorder, SafetyEvaluator
    rec = Recorder(task="test")
    rec.record_tool_call(name="bash", arguments={"command": "rm -rf /"}, result="")
    assert not SafetyEvaluator().evaluate(rec.finalize()).passed

Run with:

pytest tests/ --agentest-max-cost=0.50 --agentest-max-tokens=100000

6. Test MCP Servers

from agentest.mcp_testing import MCPServerTester, MCPAssertions

tester = MCPServerTester(command=["python", "-m", "my_mcp_server"])

# Run standard compliance tests
results = tester.run_standard_tests()
MCPAssertions(results).all_passed().has_tool("read_file").max_latency(5000)

# Test specific tools
result = tester.test_tool_call("read_file", arguments={"path": "/tmp/test.txt"})
assert result.passed

# Validate tool schemas
schema_results = tester.test_tool_schema_validation()
MCPAssertions(schema_results).all_passed()

7. Auto-Instrumentation (Zero Code Changes)

import agentest

# Monkey-patch anthropic and openai clients globally
agentest.instrument()

# All API calls are now automatically recorded
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=100,
)

# Get all recorded traces
traces = agentest.get_traces()
agentest.uninstrument()  # Remove patches when done

8. Framework Integrations

# LangChain
from agentest.integrations.langchain import AgentestCallbackHandler
handler = AgentestCallbackHandler(task="My chain")
result = chain.invoke({"input": "Hello"}, config={"callbacks": [handler]})
trace = handler.get_trace()

# CrewAI
from agentest.integrations.crewai import record_crew
result, trace = record_crew(crew, inputs={"topic": "testing"})

# AutoGen
from agentest.integrations.autogen import record_autogen_chat
result, trace = record_autogen_chat(user_proxy, assistant, "Hello")

# Claude Agent SDK
from agentest.integrations.claude_agent_sdk import AgentestTracer
tracer = AgentestTracer(task="My agent")
result, trace = tracer.record(agent.run, "What is 2+2?")

# OpenAI Agents SDK
from agentest.integrations.openai_agents import AgentestTracer
tracer = AgentestTracer(task="My agent")
result, trace = tracer.record(runner.run_sync, agent, "Hello")

Install framework extras: pip install agentest[langchain,crewai,autogen,llamaindex]

9. GitHub Action

# .github/workflows/agent-eval.yml
- uses: ColinHarker/agentest@v1
  with:
    traces-dir: traces/
    evaluators: task_completion,safety,tool_usage
    max-cost: "1.00"
    check-safety: "true"
    fail-on-error: "true"

Features

Record & Replay

Capture agent interactions as immutable trace snapshots (YAML or JSON). Replay them deterministically without making real LLM or tool calls — saving time and money during development.

# Record with context manager — auto-finalizes on exit
with Recorder(task="My task") as rec:
    rec.record_message("user", "Do something")
    rec.record_llm_response("claude-sonnet-4-6", "Done.", 100, 20)
# trace = rec.trace

Tool Mocking

Fluent builder API with conditional returns, sequences, regex matching, custom handlers, and full assertion support. Test agent logic without real integrations.

Safety Evaluation

Built-in detection for:

Unsafe commands — rm -rf /, DROP TABLE, sudo chmod 777, eval(, curl | sh, etc.
PII leakage — SSNs, credit card numbers, emails, API keys, AWS credentials
Custom patterns — supply your own regex rules
Blocked tools — prevent specific tools from being called

Cost & Latency Tracking

Automatic cost estimation with built-in pricing for Claude (Opus, Sonnet, Haiku), GPT-4o, GPT-4o-mini, O3, and O4-mini. Set budgets and latency limits as evaluator constraints.

Model Comparison

Side-by-side benchmarking with export to CSV and Markdown. Compare pass rates, average scores, total cost, and latency across any number of models.

CLI

# Initialize Agentest in your project
agentest init

# Evaluate a recorded trace
agentest evaluate traces/my_trace.yaml --max-cost 0.50 --check-safety

# Replay a trace
agentest replay traces/my_trace.yaml

# Summarize all traces in a directory
agentest summary traces/ --format table

# Compare two traces side-by-side
agentest diff traces/v1.yaml traces/v2.yaml

# Watch a directory and re-evaluate on changes
agentest watch traces/ --check-safety --interval 5

# Launch the web UI dashboard
agentest serve --traces-dir traces/ --port 8000

Built-in Evaluators

Evaluator	What it checks	Scoring
`TaskCompletionEvaluator`	Success status, errors, message count, required tools, failed tool calls	-0.25 per issue (min 0.0)
`SafetyEvaluator`	Dangerous commands, PII leakage, blocked tools, custom regex patterns	-0.2 per violation (min 0.0)
`CostEvaluator`	Total cost, token count, LLM call count against budgets	Binary pass/fail
`LatencyEvaluator`	Total duration and per-call latency against limits	1.0 pass, 0.5 fail
`ToolUsageEvaluator`	Required/forbidden tools, retry limits, error rates	-0.2 per issue (min 0.0)
`LLMJudgeEvaluator`	LLM-graded evaluation against custom criteria (Anthropic or OpenAI)	LLM-assigned score
`CompositeEvaluator`	Combines multiple evaluators with AND/OR logic	Average of all scores

Architecture

agentest/
├── core.py              # Data models: AgentTrace, ToolCall, LLMResponse, TraceSession
├── recorder/
│   ├── recorder.py      # Record agent sessions to YAML/JSON
│   └── replayer.py      # Replay sessions deterministically
├── mocking/
│   └── tool_mock.py     # ToolMock, MockToolkit — fluent builder + assertions
├── evaluators/
│   ├── base.py          # Evaluator ABC, EvalResult, CompositeEvaluator, LLMJudge
│   └── builtin.py       # TaskCompletion, Safety, Cost, Latency, ToolUsage
├── benchmark/
│   ├── runner.py        # BenchmarkRunner (sync + async), BenchmarkTask, BenchmarkResult
│   └── comparison.py    # ModelComparison, ModelScore — CSV/Markdown export
├── integrations/
│   ├── instrument.py    # Auto-instrumentation for anthropic/openai
│   ├── langchain.py     # LangChain callback handler adapter
│   ├── crewai.py        # CrewAI crew recorder
│   ├── autogen.py       # AutoGen conversation recorder
│   ├── llamaindex.py    # LlamaIndex callback handler
│   ├── claude_agent_sdk.py # Claude Agent SDK tracer
│   └── openai_agents.py # OpenAI Agents SDK tracer
├── mcp_testing/
│   ├── server_tester.py # MCPServerTester — subprocess-based JSON-RPC testing
│   └── assertions.py    # MCPAssertions — fluent assertion chains
├── reporters/
│   ├── console.py       # Rich console output
│   └── json_reporter.py # Machine-readable JSON reports
├── server/
│   └── app.py           # FastAPI web UI for trace exploration
├── pytest_plugin.py     # Auto-registered fixtures, markers, and trace collectors
└── cli.py               # Click CLI with 8 commands

Design Principles

Pydantic models for strict typing and automatic serialization
Builder pattern for fluent APIs (ToolMock, Recorder)
Strategy pattern for pluggable evaluators
Composite pattern for evaluator aggregation
Zero framework coupling — works with any agent that produces traces

Comparison

Feature	Agentest	LangSmith	LangFuse	Braintrust
Record & Replay	✅	—	—	—
Tool Mocking	✅	Basic	—	—
Safety Evaluator	✅	—	—	—
MCP Server Testing	✅	—	—	—
pytest Integration	✅	—	—	—
Framework-Agnostic	✅	❌	❌	❌
Open Source	✅	❌	❌	❌
Cost Tracking	✅	✅	✅	✅
Web UI	Basic	Full	Full	Full
Centralized Backend	—	✅	✅	✅
Auto-Instrumentation	✅	✅	✅	✅
Framework Adapters	✅ (7)	❌	Partial	—
GitHub Action	✅	—	—	—

Best for: Local development, CI/CD pipelines, deterministic testing, safety compliance, multi-model benchmarking.

Installation

pip install agentest              # Core (recording, evaluation, CLI)
pip install agentest[web]         # Web UI dashboard
pip install agentest[langchain]   # LangChain adapter
pip install agentest[crewai]      # CrewAI adapter
pip install agentest[autogen]     # AutoGen adapter
pip install agentest[llamaindex]  # LlamaIndex adapter
pip install agentest[all]         # Everything

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

colinharker5

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.2

Mar 6, 2026

This version

0.2.0

Mar 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentest-0.2.0.tar.gz (242.0 kB view details)

Uploaded Mar 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentest-0.2.0-py3-none-any.whl (71.7 kB view details)

Uploaded Mar 6, 2026 Python 3

File details

Details for the file agentest-0.2.0.tar.gz.

File metadata

Download URL: agentest-0.2.0.tar.gz
Upload date: Mar 6, 2026
Size: 242.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentest-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b42e66054250d2ac55ed954c450ca66e7e7bd67648b9907a44067abe3182e698`
MD5	`7ced0d756b2a9baaad5c14ad84d30b87`
BLAKE2b-256	`8eb2fc2cf1eceaadc97303b6dfc6726c98d3d6e4e444d8525162d7c73d3f7243`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentest-0.2.0.tar.gz:

Publisher: publish.yml on ColinHarker/agentest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentest-0.2.0.tar.gz
- Subject digest: b42e66054250d2ac55ed954c450ca66e7e7bd67648b9907a44067abe3182e698
- Sigstore transparency entry: 1049161173
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: ColinHarker/agentest@eca90c897e26bfca2e1825bd49fb4b5e74610de9
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/ColinHarker
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@eca90c897e26bfca2e1825bd49fb4b5e74610de9
- Trigger Event: release

File details

Details for the file agentest-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentest-0.2.0-py3-none-any.whl
Upload date: Mar 6, 2026
Size: 71.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentest-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bfa8e81e75ab700b3a790e2a9e8b785ffedb83e3cd8a4eb58d2d32c3bf6f3570`
MD5	`1ddd8b575d005513949b725b3dacebac`
BLAKE2b-256	`eaff5ffa4273f77e521972db81fe9b3004f55feb87c578198d111f21f5a47a10`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentest-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ColinHarker/agentest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentest-0.2.0-py3-none-any.whl
- Subject digest: bfa8e81e75ab700b3a790e2a9e8b785ffedb83e3cd8a4eb58d2d32c3bf6f3570
- Sigstore transparency entry: 1049161235
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: ColinHarker/agentest@eca90c897e26bfca2e1825bd49fb4b5e74610de9
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/ColinHarker
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@eca90c897e26bfca2e1825bd49fb4b5e74610de9
- Trigger Event: release

agentest 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Why Agentest?

What Makes It Different

Quick Start

1. Record and Evaluate

2. Mock Tools for Deterministic Testing

3. Replay Recorded Sessions

4. Benchmark Across Models

5. pytest Integration

6. Test MCP Servers

7. Auto-Instrumentation (Zero Code Changes)

8. Framework Integrations

9. GitHub Action

Features

Record & Replay

Tool Mocking

Safety Evaluation

Cost & Latency Tracking

Model Comparison

CLI

Built-in Evaluators

Architecture

Design Principles

Comparison

Installation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance