The pytest for AI agents — capture, replay, mock, and evaluate agent behavior.
Project description
evalcraft
The pytest for AI agents. Capture, replay, mock, and evaluate agent behavior — without burning API credits on every test run.
The problem
Agent testing is broken:
- Expensive. Running 200 tests against GPT-4 costs real money. Every commit.
- Non-deterministic. Tests fail randomly because LLMs aren't functions.
- No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.
Evalcraft fixes this by recording agent runs as cassettes (like VCR for HTTP), then replaying them deterministically. Your test suite goes from 10 minutes + $5 to 200ms + $0.
How it works
Your Agent
│
▼
┌─────────────┐ record ┌──────────────┐
│ CaptureCtx │ ────────────► │ Cassette │ (plain JSON, git-friendly)
│ │ │ (spans[]) │
└─────────────┘ └──────┬───────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
replay() MockLLM / assert_*()
(zero API calls) MockTool() (scorers)
│ │
└──────────────┬──────────────────┘
▼
pytest / CI gate
(200ms, $0.00)
Install
pip install evalcraft
# With pytest plugin
pip install "evalcraft[pytest]"
# With framework adapters
pip install "evalcraft[openai]" # OpenAI SDK adapter
pip install "evalcraft[langchain]" # LangChain/LangGraph adapter
# Everything
pip install "evalcraft[all]"
5-minute quickstart
1. Capture an agent run
from evalcraft import CaptureContext
with CaptureContext(
name="weather_agent_test",
agent_name="weather_agent",
save_path="tests/cassettes/weather.json",
) as ctx:
ctx.record_input("What's the weather in Paris?")
# Run your agent — wrap tool/LLM calls with record_* methods
ctx.record_tool_call("get_weather", args={"city": "Paris"}, result={"temp": 18, "condition": "cloudy"})
ctx.record_llm_call(
model="gpt-4o",
input="User asked about weather. Tool returned: cloudy 18°C",
output="It's 18°C and cloudy in Paris right now.",
prompt_tokens=120,
completion_tokens=15,
cost_usd=0.0008,
)
ctx.record_output("It's 18°C and cloudy in Paris right now.")
cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0008
2. Replay without API calls
from evalcraft import replay
# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")
assert run.replayed is True
assert run.cassette.output_text == "It's 18°C and cloudy in Paris right now."
3. Assert tool behavior
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under
run = replay("tests/cassettes/weather.json")
assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_cost_under(run, max_usd=0.05).passed
4. Mock LLM responses
from evalcraft import MockLLM, MockTool, CaptureContext
llm = MockLLM()
llm.add_response("*", "It's sunny and 22°C.") # wildcard match
search = MockTool("web_search")
search.returns({"results": [{"title": "Weather Paris", "snippet": "Sunny, 22°C"}]})
with CaptureContext(name="mocked_run", save_path="tests/cassettes/mocked.json") as ctx:
ctx.record_input("Weather in Paris?")
search_result = search.call(query="Paris weather today")
response = llm.complete(f"Search result: {search_result}")
ctx.record_output(response.content)
search.assert_called(times=1)
search.assert_called_with(query="Paris weather today")
llm.assert_called(times=1)
5. Use with pytest
# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under
def test_agent_calls_weather_tool():
run = replay("tests/cassettes/weather.json")
result = assert_tool_called(run, "get_weather")
assert result.passed, result.message
def test_agent_tool_sequence():
run = replay("tests/cassettes/weather.json")
result = assert_tool_order(run, ["get_weather"])
assert result.passed, result.message
def test_agent_cost_budget():
run = replay("tests/cassettes/weather.json")
result = assert_cost_under(run, max_usd=0.01)
assert result.passed, result.message
def test_agent_output():
run = replay("tests/cassettes/weather.json")
assert "Paris" in run.cassette.output_text or "cloudy" in run.cassette.output_text
pytest tests/ -v
# 200ms, $0.00
Examples
Four complete, self-contained example projects — each with pre-recorded cassettes, working test suites, and step-by-step READMEs.
| Example | Scenario | What it demonstrates |
|---|---|---|
| openai-agent/ | Customer support agent (ShopEasy) | OpenAIAdapter, tool call assertions, golden sets, MockLLM + MockTool unit tests |
| anthropic-agent/ | Code review bot (PRs via Claude) | AnthropicAdapter, multi-turn testing, security assertions, add_sequential_responses |
| langgraph-workflow/ | RAG policy Q&A pipeline | LangGraphAdapter, node-order assertions, SpanKind.AGENT_STEP inspection, citation validation |
| ci-pipeline/ | GitHub Actions CI gate | GitHub Actions workflow, standalone gate script, cassette refresh strategy |
Run any example in 60 seconds (no API key needed)
cd examples/openai-agent
pip install -r requirements.txt
pytest tests/ -v
# 15 tests pass in ~0.3s, $0.00
All cassettes are pre-recorded and committed to the repo. Tests replay them deterministically — no API key, no network calls, no cost.
Why Evalcraft?
| Evalcraft | Braintrust | LangSmith | Promptfoo | |
|---|---|---|---|---|
| Cassette-based replay | ✅ | ❌ | ❌ | ❌ |
| Zero-cost CI testing | ✅ | ❌ | ❌ | Partial |
| pytest-native | ✅ | ❌ | ❌ | ❌ |
| Mock LLM / Tools | ✅ | ❌ | ❌ | ❌ |
| Framework agnostic | ✅ | ✅ | ✅ | ✅ |
| Self-hostable | ✅ | ❌ | Partial | ✅ |
| Observability dashboard | ❌ | ✅ | ✅ | ❌ |
| Pricing | Free / OSS | Paid SaaS | Paid SaaS | Free / OSS |
Evalcraft is a testing tool, not an observability platform. Use Braintrust or LangSmith for production tracing; use Evalcraft to keep your test suite fast and free.
Features
| Feature | Description |
|---|---|
| Capture | Record every LLM call, tool use, and agent decision as a cassette |
| Replay | Re-run cassettes deterministically — no API calls, zero cost |
| Mock LLM | Substitute real LLMs with deterministic mocks (exact / pattern / wildcard) |
| Mock Tools | Mock any tool with static, dynamic, sequential, or error-simulating responses |
| Scorers | Built-in assertions for tool calls, output content, cost, latency, tokens |
| Diff | Compare two cassette runs to detect regressions |
| CLI | evalcraft replay, evalcraft diff, evalcraft inspect from your terminal |
| pytest plugin | Native fixtures and markers — cassette, mock_llm, @pytest.mark.evalcraft |
Supported frameworks
| Framework | Integration |
|---|---|
| OpenAI SDK | evalcraft.adapters.openai — auto-records all chat.completions.create calls |
| LangGraph | evalcraft.adapters.langgraph — callback handler for graphs and chains |
| Any agent | Manual record_tool_call / record_llm_call works with any framework |
OpenAI
from evalcraft.adapters.openai import patch_openai
from evalcraft import CaptureContext
import openai
patch_openai(openai) # all subsequent calls are auto-recorded
with CaptureContext(name="openai_run", save_path="tests/cassettes/openai_run.json") as ctx:
ctx.record_input("Summarize the French Revolution")
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize the French Revolution"}],
)
ctx.record_output(response.choices[0].message.content)
LangGraph
from evalcraft.adapters.langgraph import EvalcraftCallbackHandler
from evalcraft import CaptureContext
handler = EvalcraftCallbackHandler()
with CaptureContext(name="langgraph_run", save_path="tests/cassettes/lg_run.json") as ctx:
ctx.record_input("Plan a trip to Tokyo")
graph = build_travel_agent()
result = graph.invoke(
{"messages": [{"role": "user", "content": "Plan a trip to Tokyo"}]},
config={"callbacks": [handler]},
)
ctx.record_output(result["messages"][-1].content)
CLI reference
evalcraft [command] [options]
evalcraft replay
evalcraft replay tests/cassettes/weather.json
evalcraft replay tests/cassettes/weather.json --override get_weather='{"temp": 5, "condition": "snow"}'
evalcraft diff
evalcraft diff tests/cassettes/weather_v1.json tests/cassettes/weather_v2.json
# Tool sequence: ['get_weather'] → ['get_weather', 'send_alert']
# Output text changed
# Tokens: 135 → 210
evalcraft inspect
evalcraft inspect tests/cassettes/weather.json
evalcraft inspect tests/cassettes/weather.json --kind tool_call
evalcraft run
evalcraft run tests/cassettes/
# ✓ weather.json (3 spans, $0.0008, 450ms)
# ✓ search.json (7 spans, $0.0021, 1200ms)
# 2/2 passed
Data model
Cassette
├── id, name, agent_name, framework
├── input_text, output_text
├── total_tokens, total_cost_usd, total_duration_ms
├── llm_call_count, tool_call_count
├── fingerprint (SHA-256 of span content — detects regressions)
└── spans[]
├── Span (llm_request / llm_response)
│ ├── model, token_usage, cost_usd
│ └── input, output
└── Span (tool_call)
├── tool_name, tool_args, tool_result
└── duration_ms, error
Cassettes are plain JSON — check them into git, diff them in PRs.
Contributing
git clone https://github.com/beyhangl/evalcraft
cd evalcraft
pip install -e ".[dev]"
pytest
- Format:
ruff format . - Lint:
ruff check . - Type check:
mypy evalcraft/
PRs welcome. Please open an issue first for significant changes. See CONTRIBUTING.md for details.
License
MIT © 2026 Beyhan Gül. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalcraft-0.1.0.tar.gz.
File metadata
- Download URL: evalcraft-0.1.0.tar.gz
- Upload date:
- Size: 312.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28c9bb6a244023646da3fa0e54240616710959633e362ef03b5af24e15ae4a4e
|
|
| MD5 |
49bb9ef693f847ae5e2b598ed7259215
|
|
| BLAKE2b-256 |
c656cc35928d8f2f15e21ac0b1c42ec4f6870a4e00c6d7cebca71e700c1922d0
|
Provenance
The following attestation bundles were made for evalcraft-0.1.0.tar.gz:
Publisher:
publish.yml on beyhangl/evalcraft
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evalcraft-0.1.0.tar.gz -
Subject digest:
28c9bb6a244023646da3fa0e54240616710959633e362ef03b5af24e15ae4a4e - Sigstore transparency entry: 1049123872
- Sigstore integration time:
-
Permalink:
beyhangl/evalcraft@d342d72fd84128bc575bea73f192348320bbeb4f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/beyhangl
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d342d72fd84128bc575bea73f192348320bbeb4f -
Trigger Event:
push
-
Statement type:
File details
Details for the file evalcraft-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evalcraft-0.1.0-py3-none-any.whl
- Upload date:
- Size: 99.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73fd636cbf9e2ecfe53bda98007ccab1e4bf10e261960f869df3f33a757b7371
|
|
| MD5 |
53fe4d4ed122466631ae54680b0dfdaa
|
|
| BLAKE2b-256 |
8ac2f0ffca6d7b795cf817b727f417d689a0b4c953fa7690c34a172cfffea133
|
Provenance
The following attestation bundles were made for evalcraft-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on beyhangl/evalcraft
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evalcraft-0.1.0-py3-none-any.whl -
Subject digest:
73fd636cbf9e2ecfe53bda98007ccab1e4bf10e261960f869df3f33a757b7371 - Sigstore transparency entry: 1049123881
- Sigstore integration time:
-
Permalink:
beyhangl/evalcraft@d342d72fd84128bc575bea73f192348320bbeb4f -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/beyhangl
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d342d72fd84128bc575bea73f192348320bbeb4f -
Trigger Event:
push
-
Statement type: