Skip to main content

The pytest for AI agents — capture, replay, mock, and evaluate agent behavior.

Project description

evalcraft

The pytest for AI agents. Capture, replay, mock, and evaluate agent behavior — without burning API credits on every test run.

CI PyPI Python License


The problem

Agent testing is broken:

  • Expensive. Running 200 tests against GPT-4 costs real money. Every commit.
  • Non-deterministic. Tests fail randomly because LLMs aren't functions.
  • No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.

Evalcraft fixes this by recording agent runs as cassettes (like VCR for HTTP), then replaying them deterministically. Your test suite goes from 10 minutes + $5 to 200ms + $0.


How it works

  Your Agent
      │
      ▼
┌─────────────┐    record     ┌──────────────┐
│  CaptureCtx │ ────────────► │   Cassette   │  (plain JSON, git-friendly)
│             │               │  (spans[])   │
└─────────────┘               └──────┬───────┘
                                     │
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
              replay()          MockLLM /        assert_*()
           (zero API calls)    MockTool()       (scorers)
                    │                                 │
                    └──────────────┬──────────────────┘
                                   ▼
                            pytest / CI gate
                           (200ms, $0.00)

Install

pip install evalcraft

# With pytest plugin
pip install "evalcraft[pytest]"

# With framework adapters
pip install "evalcraft[openai]"     # OpenAI SDK adapter
pip install "evalcraft[langchain]"  # LangChain/LangGraph adapter

# Everything
pip install "evalcraft[all]"

5-minute quickstart

1. Capture an agent run

from evalcraft import CaptureContext

with CaptureContext(
    name="weather_agent_test",
    agent_name="weather_agent",
    save_path="tests/cassettes/weather.json",
) as ctx:
    ctx.record_input("What's the weather in Paris?")

    # Run your agent — wrap tool/LLM calls with record_* methods
    ctx.record_tool_call("get_weather", args={"city": "Paris"}, result={"temp": 18, "condition": "cloudy"})
    ctx.record_llm_call(
        model="gpt-4o",
        input="User asked about weather. Tool returned: cloudy 18°C",
        output="It's 18°C and cloudy in Paris right now.",
        prompt_tokens=120,
        completion_tokens=15,
        cost_usd=0.0008,
    )

    ctx.record_output("It's 18°C and cloudy in Paris right now.")

cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0008

2. Replay without API calls

from evalcraft import replay

# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")

assert run.replayed is True
assert run.cassette.output_text == "It's 18°C and cloudy in Paris right now."

3. Assert tool behavior

from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under

run = replay("tests/cassettes/weather.json")

assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_cost_under(run, max_usd=0.05).passed

4. Mock LLM responses

from evalcraft import MockLLM, MockTool, CaptureContext

llm = MockLLM()
llm.add_response("*", "It's sunny and 22°C.")  # wildcard match

search = MockTool("web_search")
search.returns({"results": [{"title": "Weather Paris", "snippet": "Sunny, 22°C"}]})

with CaptureContext(name="mocked_run", save_path="tests/cassettes/mocked.json") as ctx:
    ctx.record_input("Weather in Paris?")

    search_result = search.call(query="Paris weather today")
    response = llm.complete(f"Search result: {search_result}")

    ctx.record_output(response.content)

search.assert_called(times=1)
search.assert_called_with(query="Paris weather today")
llm.assert_called(times=1)

5. Use with pytest

# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under

def test_agent_calls_weather_tool():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_called(run, "get_weather")
    assert result.passed, result.message

def test_agent_tool_sequence():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_order(run, ["get_weather"])
    assert result.passed, result.message

def test_agent_cost_budget():
    run = replay("tests/cassettes/weather.json")
    result = assert_cost_under(run, max_usd=0.01)
    assert result.passed, result.message

def test_agent_output():
    run = replay("tests/cassettes/weather.json")
    assert "Paris" in run.cassette.output_text or "cloudy" in run.cassette.output_text
pytest tests/ -v
# 200ms, $0.00

Examples

Four complete, self-contained example projects — each with pre-recorded cassettes, working test suites, and step-by-step READMEs.

Example Scenario What it demonstrates
openai-agent/ Customer support agent (ShopEasy) OpenAIAdapter, tool call assertions, golden sets, MockLLM + MockTool unit tests
anthropic-agent/ Code review bot (PRs via Claude) AnthropicAdapter, multi-turn testing, security assertions, add_sequential_responses
langgraph-workflow/ RAG policy Q&A pipeline LangGraphAdapter, node-order assertions, SpanKind.AGENT_STEP inspection, citation validation
ci-pipeline/ GitHub Actions CI gate GitHub Actions workflow, standalone gate script, cassette refresh strategy

Run any example in 60 seconds (no API key needed)

cd examples/openai-agent
pip install -r requirements.txt
pytest tests/ -v
# 15 tests pass in ~0.3s, $0.00

All cassettes are pre-recorded and committed to the repo. Tests replay them deterministically — no API key, no network calls, no cost.


Why Evalcraft?

Evalcraft Braintrust LangSmith Promptfoo
Cassette-based replay
Zero-cost CI testing Partial
pytest-native
Mock LLM / Tools
Framework agnostic
Self-hostable Partial
Observability dashboard
Pricing Free / OSS Paid SaaS Paid SaaS Free / OSS

Evalcraft is a testing tool, not an observability platform. Use Braintrust or LangSmith for production tracing; use Evalcraft to keep your test suite fast and free.


Features

Feature Description
Capture Record every LLM call, tool use, and agent decision as a cassette
Replay Re-run cassettes deterministically — no API calls, zero cost
Mock LLM Substitute real LLMs with deterministic mocks (exact / pattern / wildcard)
Mock Tools Mock any tool with static, dynamic, sequential, or error-simulating responses
Scorers Built-in assertions for tool calls, output content, cost, latency, tokens
Diff Compare two cassette runs to detect regressions
CLI evalcraft replay, evalcraft diff, evalcraft inspect from your terminal
pytest plugin Native fixtures and markers — cassette, mock_llm, @pytest.mark.evalcraft

Supported frameworks

Framework Integration
OpenAI SDK evalcraft.adapters.openai — auto-records all chat.completions.create calls
LangGraph evalcraft.adapters.langgraph — callback handler for graphs and chains
Any agent Manual record_tool_call / record_llm_call works with any framework

OpenAI

from evalcraft.adapters.openai import patch_openai
from evalcraft import CaptureContext
import openai

patch_openai(openai)  # all subsequent calls are auto-recorded

with CaptureContext(name="openai_run", save_path="tests/cassettes/openai_run.json") as ctx:
    ctx.record_input("Summarize the French Revolution")

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Summarize the French Revolution"}],
    )

    ctx.record_output(response.choices[0].message.content)

LangGraph

from evalcraft.adapters.langgraph import EvalcraftCallbackHandler
from evalcraft import CaptureContext

handler = EvalcraftCallbackHandler()

with CaptureContext(name="langgraph_run", save_path="tests/cassettes/lg_run.json") as ctx:
    ctx.record_input("Plan a trip to Tokyo")

    graph = build_travel_agent()
    result = graph.invoke(
        {"messages": [{"role": "user", "content": "Plan a trip to Tokyo"}]},
        config={"callbacks": [handler]},
    )

    ctx.record_output(result["messages"][-1].content)

CLI reference

evalcraft [command] [options]

evalcraft replay

evalcraft replay tests/cassettes/weather.json
evalcraft replay tests/cassettes/weather.json --override get_weather='{"temp": 5, "condition": "snow"}'

evalcraft diff

evalcraft diff tests/cassettes/weather_v1.json tests/cassettes/weather_v2.json
# Tool sequence: ['get_weather'] → ['get_weather', 'send_alert']
# Output text changed
# Tokens: 135 → 210

evalcraft inspect

evalcraft inspect tests/cassettes/weather.json
evalcraft inspect tests/cassettes/weather.json --kind tool_call

evalcraft run

evalcraft run tests/cassettes/
# ✓ weather.json   (3 spans, $0.0008, 450ms)
# ✓ search.json    (7 spans, $0.0021, 1200ms)
# 2/2 passed

Data model

Cassette
├── id, name, agent_name, framework
├── input_text, output_text
├── total_tokens, total_cost_usd, total_duration_ms
├── llm_call_count, tool_call_count
├── fingerprint  (SHA-256 of span content — detects regressions)
└── spans[]
    ├── Span (llm_request / llm_response)
    │   ├── model, token_usage, cost_usd
    │   └── input, output
    └── Span (tool_call)
        ├── tool_name, tool_args, tool_result
        └── duration_ms, error

Cassettes are plain JSON — check them into git, diff them in PRs.


Contributing

git clone https://github.com/beyhangl/evalcraft
cd evalcraft
pip install -e ".[dev]"
pytest
  • Format: ruff format .
  • Lint: ruff check .
  • Type check: mypy evalcraft/

PRs welcome. Please open an issue first for significant changes. See CONTRIBUTING.md for details.


License

MIT © 2026 Beyhan Gül. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalcraft-0.1.0.tar.gz (312.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalcraft-0.1.0-py3-none-any.whl (99.3 kB view details)

Uploaded Python 3

File details

Details for the file evalcraft-0.1.0.tar.gz.

File metadata

  • Download URL: evalcraft-0.1.0.tar.gz
  • Upload date:
  • Size: 312.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for evalcraft-0.1.0.tar.gz
Algorithm Hash digest
SHA256 28c9bb6a244023646da3fa0e54240616710959633e362ef03b5af24e15ae4a4e
MD5 49bb9ef693f847ae5e2b598ed7259215
BLAKE2b-256 c656cc35928d8f2f15e21ac0b1c42ec4f6870a4e00c6d7cebca71e700c1922d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalcraft-0.1.0.tar.gz:

Publisher: publish.yml on beyhangl/evalcraft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalcraft-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evalcraft-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 99.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for evalcraft-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73fd636cbf9e2ecfe53bda98007ccab1e4bf10e261960f869df3f33a757b7371
MD5 53fe4d4ed122466631ae54680b0dfdaa
BLAKE2b-256 8ac2f0ffca6d7b795cf817b727f417d689a0b4c953fa7690c34a172cfffea133

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalcraft-0.1.0-py3-none-any.whl:

Publisher: publish.yml on beyhangl/evalcraft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page