Skip to main content

VCR for AI agents — record agent runs as cassettes and replay them deterministically in CI for $0.

Project description

Evalcraft

Deterministic tests for AI agents — generated from one real run.

Capture an agent run and evalcraft writes a pytest that locks its tool calls, output shape, and cost — then replays it in CI for $0. Like VCR for HTTP, but it writes the agent tests for you.

CI PyPI Python License


Get Started in 60 Seconds

pip install evalcraft
evalcraft init                # scaffolds tests/cassettes/ and a sample test
pytest --evalcraft            # run with recording

That's it. Your first cassette is recorded, committed to git, and replays for free on every future run. See the 5-minute quickstart for the full walkthrough.


The problem

Agent testing is broken:

  • Expensive. Running 200 tests against GPT-4.1 costs real money. Every commit.
  • Non-deterministic. Tests fail randomly because LLMs aren't functions.
  • No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.

Evalcraft records agent runs as cassettes (like VCR for HTTP) and replays them deterministically — so the tests that exercise your agent's plumbing (tool wiring, control flow, output shape, cost/latency budgets) drop from 10 minutes + $5 to 200ms + $0. For the questions that genuinely need a live model — quality, drift, LLM-judge, RAG — run live-eval on a schedule.


How it works

  Your Agent
      |
      v
+-------------+    record     +--------------+
|  CaptureCtx | ------------> |   Cassette   |  (plain JSON, git-friendly)
|             |               |  (spans[])   |
+-------------+               +------+-------+
                                     |
                    +----------------+----------------+
                    v                v                v
              replay()          MockLLM /        assert_*()
           (zero API calls)    MockTool()       (scorers)
                    |                                 |
                    +----------------+----------------+
                                     v
                            pytest / CI gate
                           (200ms, $0.00)

Install

pip install evalcraft

# With pytest plugin
pip install "evalcraft[pytest]"

# With framework adapters
pip install "evalcraft[openai]"       # OpenAI SDK adapter
pip install "evalcraft[anthropic]"    # Anthropic SDK adapter
pip install "evalcraft[gemini]"       # Google Gemini adapter
pip install "evalcraft[pydantic-ai]"  # Pydantic AI adapter
pip install "evalcraft[langchain]"    # LangChain/LangGraph adapter

# Everything
pip install "evalcraft[all]"

5-minute quickstart

1. Capture an agent run

from evalcraft import CaptureContext

with CaptureContext(
    name="weather_agent_test",
    agent_name="weather_agent",
    save_path="tests/cassettes/weather.json",
) as ctx:
    ctx.record_input("What's the weather in Paris?")

    # Run your agent — wrap tool/LLM calls with record_* methods
    ctx.record_tool_call("get_weather", args={"city": "Paris"}, result={"temp": 18, "condition": "cloudy"})
    ctx.record_llm_call(
        model="gpt-4.1-mini",
        input="User asked about weather. Tool returned: cloudy 18C",
        output="It's 18C and cloudy in Paris right now.",
        prompt_tokens=120,
        completion_tokens=15,
        cost_usd=0.0003,
    )

    ctx.record_output("It's 18C and cloudy in Paris right now.")

cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0003

2. Replay without API calls

from evalcraft import replay

# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")

assert run.replayed is True
assert run.cassette.output_text == "It's 18C and cloudy in Paris right now."

3. Assert tool behavior

from evalcraft import replay, assert_tool_called, assert_cost_under

run = replay("tests/cassettes/weather.json")

assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_cost_under(run, max_usd=0.05).passed

4. LLM-as-Judge evaluation

⚠️ These are live scorers. Unlike replay + the structural scorers (which are offline, deterministic, and $0), the LLM-as-Judge / RAG / pairwise scorers call a real model at test time — they cost money, need an API key, and are non-deterministic (use eval_n + confidence intervals). See Offline vs. live scorers.

from evalcraft import replay, assert_output_semantic, assert_factual_consistency

run = replay("tests/cassettes/weather.json")

# Semantic evaluation — uses an LLM to judge output quality
result = assert_output_semantic(run, criteria="Mentions temperature and city name")
assert result.passed

# Factual consistency check
result = assert_factual_consistency(run, ground_truth="Paris is 18C and cloudy")
assert result.passed

5. RAG evaluation metrics

from evalcraft import replay, assert_faithfulness, assert_answer_relevance

run = replay("tests/cassettes/rag_agent.json")
contexts = ["Paris has a population of 2.1 million...", "The Eiffel Tower..."]

# Does the output stay faithful to retrieved context?
assert assert_faithfulness(run, contexts=contexts).passed

# Does the answer address the original question?
assert assert_answer_relevance(run, query="Tell me about Paris").passed

6. Use with pytest

# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_cost_under

def test_agent_calls_weather_tool():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_called(run, "get_weather")
    assert result.passed, result.message

def test_agent_cost_budget():
    run = replay("tests/cassettes/weather.json")
    result = assert_cost_under(run, max_usd=0.01)
    assert result.passed, result.message
pytest tests/ -v
# 200ms, $0.00

7. Pairwise A/B comparison

from evalcraft import pairwise_compare, pairwise_rank

# Compare two agent outputs — LLM judge picks the winner
result = pairwise_compare(cassette_a, cassette_b, criteria="Which is more helpful?")
print(result.winner)      # "A", "B", or "tie"
print(result.confidence)  # 0.0-1.0

# Rank multiple agents via round-robin tournament
rankings = pairwise_rank([agent_a, agent_b, agent_c], criteria="Accuracy and helpfulness")
for entry in rankings:
    print(f"{entry.name}: {entry.wins}W/{entry.losses}L (score {entry.score:.2f})")

Position bias is mitigated by randomizing presentation order.

8. Statistical evaluation with confidence intervals

from evalcraft import eval_n, assert_output_semantic

# Run a scorer 5 times — LLM outputs are non-deterministic, one run means nothing
result = eval_n(run, assert_output_semantic, n=5, criteria="Mentions the city name")
assert result.pass_rate >= 0.8

print(f"Pass rate: {result.pass_rate:.0%} ({result.passes}/{result.n})")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

9. Auto-generate tests from cassettes

evalcraft generate-tests tests/cassettes/weather.json -o tests/test_weather.py
# Generates a complete pytest file with tool, output, cost, token, and latency assertions

10. Diagnose your setup

evalcraft doctor
#   ✓ Python 3.11.5
#   ✓ evalcraft 0.1.0
#   ✓ openai 2.30.0
#   ! anthropic not installed
#   ✓ OPENAI_API_KEY configured
#   ✓ Cassette directory: tests/cassettes/ (3 cassettes)
#   ! 1 stale cassette (>30 days old)
#   ✓ pytest plugin registered

Examples

Four complete, self-contained example projects — each with pre-recorded cassettes, working test suites, and step-by-step READMEs.

Example Scenario What it demonstrates
openai-agent/ Customer support agent (ShopEasy) OpenAIAdapter, tool call assertions, golden sets, MockLLM + MockTool unit tests
anthropic-agent/ Code review bot (PRs via Claude) AnthropicAdapter, multi-turn testing, security assertions, add_sequential_responses
langgraph-workflow/ RAG policy Q&A pipeline LangGraphAdapter, node-order assertions, SpanKind.AGENT_STEP inspection, citation validation
ci-pipeline/ GitHub Actions CI gate GitHub Actions workflow, standalone gate script, cassette refresh strategy

Run any example in 60 seconds (no API key needed)

cd examples/openai-agent
pip install -r requirements.txt
pytest tests/ -v
# 15 tests pass in ~0.3s, $0.00

All cassettes are pre-recorded and committed to the repo. Tests replay them deterministically — no API key, no network calls, no cost.


How Evalcraft compares

An honest comparison against the closest tools. ✅ first-class · ⚠️ partial / via integration · ❌ no · — not applicable.

Evalcraft DeepEval Promptfoo LangSmith Braintrust Ragas
Git-committed cassette replay
Zero-cost CI re-runs ✅ replay ✅ cache ✅ cache ⚠️
pytest-native ❌ CLI/YAML ⚠️ library
First-class Mock LLM / Tools
LLM-as-Judge scoring
RAG metrics ⚠️ ⚠️ ✅ reference
Pairwise A/B ⚠️
Statistical eval w/ confidence intervals ✅ Wilson ⚠️ ⚠️ repeat ⚠️ ⚠️
Auto-generate tests from runs
OSS / self-hostable ⚠️ enterprise ❌ enterprise
Primary focus CI / glue testing LLM eval framework eval + red-team tracing + eval eval + observability RAG metrics
Pricing Free / OSS Free / OSS (+cloud) Free / OSS Paid SaaS (free tier) Paid SaaS (free tier) Free / OSS

What's genuinely distinctive (vs. the table-stakes everyone has): git-committed, PR-diffable cassettes capturing full agent traces (LLM + tool + steps); auto-generating a pytest file from a recorded run; first-class MockLLM / MockTool; and a packaged Wilson-interval statistical helper.

Honest caveats:

  • Zero-cost CI is not unique — Promptfoo (disk cache, on by default) and DeepEval (-c) already make re-runs free. Evalcraft's angle is deterministic replay of a committed artifact, not a lower bill per se.
  • Replay only re-checks a recorded run. It does not re-execute the live model, so on its own it can't catch model/prompt/retrieval drift — see what replay does and doesn't test. For drift, re-record or run a live eval.
  • The LLM-as-Judge, RAG, and pairwise scorers make real, paid model calls at test time — they are not part of the $0 deterministic path.
  • Other strong OSS/self-hostable options not shown: Langfuse, Arize Phoenix, Inspect AI.

Evalcraft is a testing tool for your agent's deterministic glue + budgets — not an observability platform. Use Braintrust / LangSmith / Langfuse for production tracing; use Evalcraft to keep that layer of your suite fast and committed to git.

Sources for the contested rows: Promptfoo caching · DeepEval CI/CD + cache · LangSmith pairwise


Features

Feature Description
Capture Record every LLM call, tool use, and agent decision as a cassette
Replay Re-run cassettes deterministically — no API calls, zero cost
Mock LLM Substitute real LLMs with deterministic mocks (exact / pattern / wildcard)
Mock Tools Mock any tool with static, dynamic, sequential, or error-simulating responses
Scorers 19 built-in assertions: tool calls, output, cost, latency, tokens, LLM-as-Judge, RAG metrics
LLM-as-Judge Semantic evaluation, factual consistency, tone, custom criteria — via OpenAI or Anthropic
RAG Metrics Faithfulness, context relevance, answer relevance, context recall
Pairwise A/B Arena-style comparison — LLM judge picks winner with position-bias mitigation
Statistical Eval Run scorers N times, get pass rate with Wilson score confidence intervals
Diff Compare two cassette runs to detect regressions
Golden Sets Version baselines and detect regressions automatically
Auto-generate evalcraft generate-tests creates pytest files from cassettes
CLI 14 commands: replay, diff, eval, generate-tests, doctor, golden, regression, sanitize, ...
pytest plugin Native fixtures and markers — cassette, mock_llm, @pytest.mark.evalcraft
CI Gate GitHub Action with PR comments, score thresholds, regression detection
JS/TS SDK TypeScript SDK (pre-release, source-only): capture/replay, mocks, 16 scorers, OpenAI/Gemini/Vercel AI adapters

Supported frameworks

Framework Adapter Install
OpenAI SDK OpenAIAdapter — auto-records chat.completions.create (sync + async) evalcraft[openai]
Anthropic SDK AnthropicAdapter — auto-records messages.create (sync + async) evalcraft[anthropic]
Google Gemini GeminiAdapter — auto-records generate_content (sync + async) evalcraft[gemini]
Pydantic AI PydanticAIAdapter — auto-records agent.run / agent.run_sync evalcraft[pydantic-ai]
LangGraph LangGraphAdapter — callback handler for graphs and chains evalcraft[langchain]
CrewAI CrewAIAdapter — instruments Crew.kickoff() evalcraft[crewai]
AutoGen AutoGenAdapter — captures multi-agent conversations evalcraft[autogen]
LlamaIndex LlamaIndexAdapter — hooks into query/retrieval pipeline evalcraft[llamaindex]
Any agent Manual record_tool_call / record_llm_call works with any framework

OpenAI

from evalcraft.adapters import OpenAIAdapter
from evalcraft import CaptureContext
import openai

client = openai.OpenAI()

with CaptureContext(name="openai_run", save_path="tests/cassettes/openai_run.json") as ctx:
    with OpenAIAdapter():  # auto-records all LLM + tool calls
        ctx.record_input("Summarize the French Revolution")

        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[{"role": "user", "content": "Summarize the French Revolution"}],
        )

        ctx.record_output(response.choices[0].message.content)

Gemini

from evalcraft.adapters import GeminiAdapter
from evalcraft import CaptureContext
import google.generativeai as genai

genai.configure(api_key="...")
model = genai.GenerativeModel("gemini-2.0-flash")

with CaptureContext(name="gemini_run", save_path="tests/cassettes/gemini_run.json") as ctx:
    with GeminiAdapter():
        ctx.record_input("What is quantum computing?")
        response = model.generate_content("What is quantum computing?")
        ctx.record_output(response.text)

Pydantic AI

from evalcraft.adapters import PydanticAIAdapter
from evalcraft import CaptureContext
from pydantic_ai import Agent

agent = Agent("openai:gpt-4.1-mini", system_prompt="You are helpful.")

with CaptureContext(name="pydantic_run", save_path="tests/cassettes/pydantic_run.json") as ctx:
    with PydanticAIAdapter():
        ctx.record_input("What's the weather?")
        result = agent.run_sync("What's the weather?")
        ctx.record_output(result.data)

CI/CD integration

GitHub Action

# .github/workflows/evalcraft.yml
- uses: beyhangl/evalcraft@v1
  with:
    test-path: tests/
    cassette-dir: tests/cassettes
    max-cost: '0.50'
    max-regression: '10'
    post-comment: 'true'

The action runs your agent tests, checks cost/regression thresholds, and posts a results table as a PR comment. See examples/ci-pipeline/ for a complete workflow.


Catching drift: live-eval

Replay is deterministic and free because it doesn't run your model — which is exactly why it can't catch model/prompt/retrieval drift. Live-eval is the complementary layer: it runs your real agent over a golden set of inputs, scores the live output, and gates CI when quality regresses against a baseline.

from evalcraft.eval.live import LiveEvalCase, LiveEvalResult, run_live_eval, compare_to_baseline
from evalcraft import assert_output_contains

cases = [LiveEvalCase(name="paris", input="Weather in Paris?",
                      scorers=[lambda c: assert_output_contains(c, "Paris")])]

def runner(case):
    return my_agent.run(case.input)   # your REAL agent — paid, non-deterministic

result = run_live_eval(cases, runner)
comparison = compare_to_baseline(
    result, LiveEvalResult.load("live-baseline.json"), max_score_drop=0.1
)
assert comparison.passed, comparison.summary()

Run it nightly or as a release gate (not on every commit). See Live Eval.


CLI reference

evalcraft [command] [options]
Command Description
evalcraft init Scaffold a test project for your framework
evalcraft capture <script> Run a script with capture enabled
evalcraft replay <cassette> Replay a cassette (zero API calls)
evalcraft diff <old> <new> Compare two cassettes
evalcraft eval <cassette> Run assertions with thresholds
evalcraft info <cassette> Inspect cassette metadata
evalcraft generate-tests <cassette> Auto-generate a pytest file
evalcraft mock <cassette> Generate MockLLM fixtures from a cassette
evalcraft golden save <cassette> Save a golden-set baseline
evalcraft golden compare <cassette> Compare against a baseline
evalcraft regression <cassette> Detect regressions
evalcraft sanitize <cassette> Redact PII and secrets
evalcraft doctor Diagnose setup issues (deps, API keys, cassettes)
evalcraft live-eval <current> --baseline <b> Gate a live-eval run vs a baseline (catch drift)
evalcraft check-stale <cassettes> --models <set> Fail CI when a cassette's recorded model was retired or swapped

Data model

Cassette
+-- id, name, agent_name, framework
+-- input_text, output_text
+-- total_tokens, total_cost_usd, total_duration_ms
+-- llm_call_count, tool_call_count
+-- fingerprint  (SHA-256 of span content -- changes when the recording changes)
+-- spans[]
    +-- Span (llm_request / llm_response)
    |   +-- model, token_usage, cost_usd
    |   +-- input, output
    +-- Span (tool_call)
        +-- tool_name, tool_args, tool_result
        +-- duration_ms, error

Cassettes are plain JSON — check them into git, diff them in PRs.


TypeScript / JavaScript SDK

Status: pre-release (source-only). The JS/TS SDK is not yet published to npm. Until it is, build it from source from this repo:

git clone https://github.com/beyhangl/evalcraft
cd evalcraft/packages/evalcraft-js
npm install && npm run build   # emits dist/ (CJS + ESM + type defs)
import {
  CaptureContext, replay, assertToolCalled, assertCostUnder,  // Core
  assertOutputSemantic, assertTone, assertCustomCriteria,     // LLM-as-Judge
  assertFaithfulness, assertContextRelevance,                 // RAG metrics
} from 'evalcraft';
import { wrapOpenAI } from 'evalcraft/adapters/openai';
import { wrapGemini } from 'evalcraft/adapters/gemini';

The JS/TS SDK covers the core workflow — capture, replay, MockLLM/MockTool, and 16 scorers (8 core + 4 LLM-as-Judge + 4 RAG) — with OpenAI, Gemini, and Vercel AI adapters. It is not yet at full parity with the Python SDK.

Python vs JS/TS parity

Capability Python JS/TS
Capture / replay / cassettes
MockLLM / MockTool
Core scorers (tool / output / cost / latency / tokens) ✅ (8) ✅ (8)
LLM-as-Judge scorers ✅ (4) ✅ (4)
RAG metrics ✅ (4) ✅ (4)
Pairwise A/B
Statistical eval (eval_n)
Multi-judge jury / consensus
Hallucination detection
Golden sets / regression / trend
CLI + pytest plugin
Framework adapters 8 (OpenAI, Anthropic, Gemini, Pydantic AI, LangGraph, CrewAI, AutoGen, LlamaIndex) 3 (OpenAI, Gemini, Vercel AI)

Contributing

git clone https://github.com/beyhangl/evalcraft
cd evalcraft
pip install -e ".[dev]"
pytest
  • Format: ruff format .
  • Lint: ruff check .
  • Type check: mypy evalcraft/

PRs welcome. Please open an issue first for significant changes. See CONTRIBUTING.md for details.


Design Partners

We're looking for design partners. evalcraft is early (v0.1.0), and we'd like a few teams to help shape it. Partners get:

  • Hands-on setup help — we'll pair with you to get evalcraft into your CI pipeline
  • Direct access to the maintainer — not a support queue
  • Influence the roadmap — your use cases drive what we build next

Interested? Open an issue and say hi.


License

MIT © 2026 Beyhan Gul. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalcraft-0.3.0.tar.gz (956.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalcraft-0.3.0-py3-none-any.whl (150.4 kB view details)

Uploaded Python 3

File details

Details for the file evalcraft-0.3.0.tar.gz.

File metadata

  • Download URL: evalcraft-0.3.0.tar.gz
  • Upload date:
  • Size: 956.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalcraft-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3bd6eeb0062b27c452b00f446e0f9c2ce6265bbdc7740fe2f03210d240f9e2cf
MD5 27cc76645c4394ba23b64952f8e4ab4e
BLAKE2b-256 df13bd81cd4da4fa11e4f9cb7071742f5da5e872c4ec107305a2fd260b73b106

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalcraft-0.3.0.tar.gz:

Publisher: publish.yml on beyhangl/evalcraft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalcraft-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: evalcraft-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 150.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalcraft-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58ada1d5d29616498f1e9b9b3e5efce21148052472633ed053e339597245681d
MD5 f09701b019c2f3ace7dd1943ac2ca46f
BLAKE2b-256 525e6ae24bdc07b482f2a8fbf635cdffd671fa97dafd8dff1cbc664d5b397036

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalcraft-0.3.0-py3-none-any.whl:

Publisher: publish.yml on beyhangl/evalcraft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page