Record & replay for LLM API calls — like vcrpy/nock, built for LLM traffic.
Project description
promptecho
Record & replay for LLM API calls. Like vcrpy / nock, but built for the way LLM traffic actually behaves.
Your LLM tests have three problems: they're flaky (non-deterministic outputs), slow (real network round-trips), and expensive (burning tokens in CI on every run). promptecho records each real API call once to a cassette file, then replays it forever — deterministically, instantly, for free.
import promptecho
from anthropic import Anthropic
@promptecho.use_cassette("cassettes/summarize.yaml")
def test_summarize():
client = Anthropic()
msg = client.messages.create(
model="claude-opus-4-8",
max_tokens=100,
messages=[{"role": "user", "content": "Summarize: the cat sat on the mat."}],
)
assert "cat" in msg.content[0].text.lower()
First run: one real call, recorded to cassettes/summarize.yaml — this needs the provider SDK installed (pip install anthropic) and a real ANTHROPIC_API_KEY in the environment.
Every run after: replayed from disk. No network, no tokens, no API key, no flake.
Proof, not marketing. The end-to-end test that gates every release records against a local server, shuts the server down, then replays. Same response, zero network. If the response can come back with the upstream gone, the cassette is genuinely doing the work — not a partial proxy. See
tests/test_record_replay.py.
Why not just use vcrpy?
You can — at the HTTP layer, vcrpy works on LLM calls today. promptecho exists because LLM traffic breaks vcrpy's assumptions in five specific ways:
- Matching. vcrpy matches on raw request bytes. LLM bodies carry volatile fields (client-injected IDs, reordered tools, whitespace) that change the bytes without changing the meaning — so byte-matching misses on replay. promptecho matches on a normalized fingerprint of the fields that determine the response, and canonicalizes across providers: it knows
content: "hi"equalscontent: [{"type":"text","text":"hi"}], an Anthropic top-levelsystemequals an OpenAIsystem-role message, and an Anthropicinput_schematool def equals an OpenAIfunction.parameters. A raw-bytes VCR can't. - Streaming. Most LLM calls are SSE streams. promptecho records the event stream and faithfully re-emits it on replay, so
stream=Trueand token-by-token iteration work identically against a cassette — including reasoning deltas. - Binary / multimodal responses. vcrpy's text-based cassettes silently corrupt raw
image/*/audio/*/octet-streambodies. promptecho detects them byContent-Typeand base64-encodes them in the cassette, so image-out and audio-out responses round-trip byte-exact. - Debuggable CI failures. When a vcrpy cassette miss happens, you get "no match". promptecho prints the exact path that changed:
messages[1].content: recorded "summarize the cat" / incoming "summarize the dog". Test failures are actionable, not detective work. - Secrets. API keys live in headers on every call. promptecho redacts them by default — a cassette is safe to commit.
What promptecho is not
- Not a cache. Replay matching is exact/normalized and deterministic, on purpose. It does not semantically match "different prompt, close enough" — that would put non-determinism back into the harness you're using to remove it. (A separate opt-in fuzzy mode is on the roadmap as a dev-loop convenience; it will never be the default and never used in CI.)
- Not an eval. It freezes a response so your surrounding code is testable. Judging whether the response is good is a different tool (see roadmap:
toMatchLLMSnapshot()).
What it covers
promptecho intercepts at the httpx transport layer. If the SDK uses httpx, promptecho sees the call — which is almost everything modern.
| You're calling | Covered? |
|---|---|
Anthropic, OpenAI, Mistral, Cohere, google-genai SDKs |
✅ |
OpenAI SDK with custom base_url → OpenRouter, Together, Fireworks, Cerebras, Groq, DeepInfra, Perplexity |
✅ |
| Self-hosted vLLM / TGI / SGLang / LM Studio / Ollama (OpenAI-compatible mode) | ✅ |
| Your own fine-tune behind any of the above | ✅ |
| Reasoning models — o1/o3, Claude extended thinking, DeepSeek-R1 | ✅ (incl. reasoning_effort / thinking in default match-on) |
Multimodal — base64-in-JSON (vision, Claude image-out, GPT-4o) and raw binary (image/*, audio/*) |
✅ (byte-exact round-trip) |
Bedrock via boto3, HF InferenceClient, in-process transformers |
❌ (see workarounds in SUPPORT.md) |
Full matrix with caveats and workarounds: SUPPORT.md. For practical recipes by scenario (startup / enterprise / research), see TUTORIAL.md.
Hosted open-source via the OpenAI SDK
This is the dominant pattern for non-Anthropic/non-OpenAI usage, and it Just Works:
from openai import OpenAI
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...")
@promptecho.use_cassette("cassettes/openrouter.yaml")
def test_via_openrouter():
r = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "hi"}],
)
assert r.choices[0].message.content
Detection falls back to body shape when the host is unknown, so localhost gateways, in-house proxies, and self-hosted vLLM/TGI behave the same way as the brand-name hosts.
Install
pip install promptecho
Requires Python ≥ 3.9 and httpx ≥ 0.24. To work on promptecho itself:
git clone https://github.com/shwetank/promptecho && cd promptecho
pip install -e ".[dev]" && pytest
Usage
Decorator
@promptecho.use_cassette("cassettes/foo.yaml")
def test_foo(): ...
Context manager
with promptecho.use_cassette("cassettes/foo.yaml"):
client.messages.create(...)
pytest fixture (auto-named per test)
def test_bar(promptecho_cassette): # records to cassettes/test_bar.yaml
client.messages.create(...)
The fixture defaults to mode="once" locally and mode="none" when CI=true — so a forgotten recording fails the build instead of making a live call. Configure it per test with the marker:
@pytest.mark.promptecho(match_on=["model", "messages", "temperature"], mode="new_episodes")
def test_bar(promptecho_cassette): ...
Record modes
Borrowed from vcrpy, so the mental model is free:
| mode | absent cassette | present cassette | use for |
|---|---|---|---|
once (default) |
record | replay | normal dev |
none |
error | replay | CI — guarantees no live calls |
new_episodes |
record | replay + record new | evolving tests |
all |
record | re-record everything | refreshing fixtures |
@promptecho.use_cassette("cassettes/foo.yaml", mode="none")
Prompts changed and a pile of cassettes went stale? Re-record the whole suite without touching code — the env var overrides every cassette's mode:
PROMPTECHO_MODE=all pytest
Choosing what to match on
Defaults to ["model", "messages", "system", "tools", "tool_choice", "reasoning_effort", "reasoning", "thinking"] — everything that determines the response for a chat-shaped call, including reasoning-model knobs.
@promptecho.use_cassette(
"cassettes/foo.yaml",
match_on=["model", "messages", "system", "temperature"], # add temperature
)
For non-chat shapes (raw TGI /generate, embeddings) you'll want to override, e.g. match_on=["model", "input"] for an embeddings endpoint. See SUPPORT.md → Request shapes.
Async
Works identically with httpx.AsyncClient and the async surfaces of Anthropic / OpenAI / Mistral SDKs — the async transport is patched the same way as sync.
Cassette format
Human-readable YAML, designed to diff cleanly in PRs:
version: 2
match_on: [model, messages, system, tools, tool_choice, reasoning_effort, reasoning, thinking]
interactions:
- request:
method: POST
url: https://api.anthropic.com/v1/messages
match_key: 7d206bed48a0bc0c # fingerprint of method + URL path + matched fields
matched_on: [model, messages, system, tools, tool_choice]
body: # canonical (provider-normalized) body
model: claude-opus-4-8
messages:
- {role: user, content: "Summarize: the cat sat on the mat."}
response:
status: 200
headers: {content-type: application/json}
streaming: false
body:
content: [{type: text, text: "A cat sat on a mat."}]
usage: {input_tokens: 14, output_tokens: 8}
- Streamed responses store the ordered SSE events under
response.eventswithstreaming: true; replay re-emits them in order. - Binary responses (image/audio/octet-stream) get
binary: trueand the body is base64-encoded; replay decodes and returns the original bytes. - The stored body is the canonical, provider-normalized shape — not the raw provider JSON. That makes cassettes provider-agnostic and easier to skim in code review.
Auto-redacted on record: the authorization, x-api-key, openai-organization, and set-cookie headers, plus every URL query-string value (query-param auth like ?key=… never reaches disk). Configurable. Secrets inside prompt text are not auto-detected — don't put credentials in prompts.
See examples/cassettes/example.yaml for a real one.
Status
Pre-1.0, working core — on PyPI, CI-tested on Python 3.9–3.13 (see badge for the current state; CHANGELOG for what's changed).
Records and replays real httpx traffic — sync, async, SSE streaming, binary responses, cross-provider request shapes — verified end-to-end against a local server that gets shut down between record and replay. Pre-1.0 means the API can still change; breaking changes are flagged in the changelog.
Roadmap (build-in-public)
Done:
- httpx sync + async transport interception
- SSE streaming record/replay
- pytest plugin + auto-naming
- Per-provider request normalizers (Anthropic / OpenAI / generic)
- Reasoning-model match defaults (
reasoning_effort,thinking,reasoning) - Binary response round-trip (image/audio/octet-stream — base64 in cassette)
- Field-level diff on cassette miss (CI
mode=noneerrors pinpoint the changed path, not just the field name) -
on_record_errorpolicy (warn/raise/record) — prevents silently baking transient 4xx/5xx into cassettes - Cassette format v2 — method + URL path in the match key; non-JSON bodies keyed by raw-byte hash (no silent collisions)
- Secret-safe cassettes — header and URL query-string redaction
-
PROMPTECHO_MODE=all pytestsuite-wide re-record;@pytest.mark.promptechofixture config
Next:
-
requests/urllib3interception backend — unlocks boto3-Bedrock and HFInferenceClient -
promptecho lint— find un-recorded calls in a test suite -
toMatchLLMSnapshot()sibling — semantic snapshot assertions on top of recorded calls
FAQ
"If you replay a frozen response, aren't you testing nothing? The model is the risky part."
You're testing everything except the model — which is most of your code: response parsing, tool-call dispatch, streaming UI rendering, retry/fallback logic, prompt construction (a changed prompt is a cassette miss, so drift gets caught, not masked). That layer is deterministic and belongs in fast, free CI. Judging whether the model's output is good is an eval — a genuinely different job, run on a different cadence with a different budget (see deepeval, promptfoo, braintrust). You need both; promptecho is deliberately only the first. The roadmap toMatchLLMSnapshot() is the bridge between them.
Why does CassetteMiss inherit from BaseException?
Because the OpenAI / Anthropic / Mistral SDKs all wrap any Exception raised inside their transport into their own connection-error type (openai.APIConnectionError("Connection error.")), which would bury the field-level diff — the most useful thing promptecho produces — under a generic message at the top of your pytest failure. Inheriting from BaseException (the same trick pytest.fail's internal exception uses) lets the diagnostic pass through except Exception: blocks intact. The trade-off is deliberate: your own except Exception: won't catch it either — but a test-fixture failure should never be silently swallowed. except CassetteMiss: and pytest.raises(CassetteMiss) both still work. Full rationale in DESIGN.md.
Can I run cassettes concurrently?
One cassette at a time per process — promptecho patches httpx process-wide, and a nested or concurrent use_cassette raises RuntimeError immediately rather than interleaving recordings. pytest-xdist is fine (workers are separate processes). Note that while a cassette is active it intercepts all httpx traffic in the process, not just LLM calls.
Design
For the why-not-the-other-way decisions — fingerprint vs raw bytes, why semantic matching is fenced off, how SSE re-emission works, how cross-provider normalization is structured — see DESIGN.md.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promptecho-0.1.3.tar.gz.
File metadata
- Download URL: promptecho-0.1.3.tar.gz
- Upload date:
- Size: 49.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c45ef54bd22a8d48a5450de136f304ed2640fc6d374bea8bbe0a82bfd840796
|
|
| MD5 |
0e73f399bda6603c4e80ad6574a408a9
|
|
| BLAKE2b-256 |
95c89f379dc3c5952d7dcd2ca6a1cbb05d1b629ad65493d431dc877641359162
|
File details
Details for the file promptecho-0.1.3-py3-none-any.whl.
File metadata
- Download URL: promptecho-0.1.3-py3-none-any.whl
- Upload date:
- Size: 24.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65d5e4f1ac1697f72174f382276c6dcd881e10b41c57f870d64f4138f71514da
|
|
| MD5 |
2935831b5595e7364cc35cb5976fbb58
|
|
| BLAKE2b-256 |
8a8488d4b49efc9b1d46e4d24cbb779d3e617b227114377b37f9be462359f432
|