Skip to main content

Replay LLM API calls in tests. Zero cost. Zero flakes. Like vcr.py but for LLM SDKs.

Project description

cuesheet

cuesheet

Record once. Replay forever. Test LLM-calling code without burning the API.

PyPI Python CI License: MIT


The problem

If you've ever tried to write tests for code that calls an LLM, you know the drill.

  • Hitting Anthropic or OpenAI in CI is slow (multi-second per call), flaky (rate limits, transient errors, sampling drift), and expensive (every PR run bleeds tokens).
  • Hand-rolled mocks rot the moment the SDK ships a breaking change, and they never quite match what the real API returns.
  • Existing HTTP fixtures (vcr.py, respx, pytest-vcr) work, but they don't understand LLM payloads, don't replay streamed responses faithfully, and don't scrub API keys for you.

So most teams settle for one of three bad options: skip the tests, mark them slow and skip them in CI, or write a brittle mock and pray.

What cuesheet does

cuesheet is a test-fixture library for any Python LLM SDK that uses httpx under the hood (Anthropic, OpenAI, Mistral, Gemini, Cohere, Groq, DeepSeek, Together, LiteLLM, and anything else built on the standard transport).

You wrap your test in @cuesheet.cassette(...). The first time it runs, cuesheet hits the real API and saves the request/response pair to a YAML file you commit to your repo. Every run after that replays from the file. Same response, byte-for-byte. No network calls. No flakes. No cost.

import cuesheet

@cuesheet.cassette("tests/cassettes/test_summarizer.yaml")
def test_summarizer():
    from anthropic import Anthropic
    client = Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": "Summarize: ..."}],
    )

    assert "key point" in response.content[0].text

That's the whole API. One decorator. One YAML file per test. Drop it in.

Features

  • Sync and async clients, both supported.
  • Streaming responses recorded as raw SSE chunks and replayed in order at configurable speed.
  • One cassette can hold multiple interactions across multiple providers.
  • YAML format chosen for git-friendly diffs during code review.
  • Auto-scrubs Anthropic and OpenAI keys, JWTs, bearer tokens, and common email regexes before writing to disk.
  • Composable matchers (method, URL, model, messages, tools, temperature, ...) overridable per-cassette or globally.
  • pytest plugin: zero-config fixture auto-discovers tests/cassettes/<test_name>.yaml.
  • Local web UI with live updates as tests record (FastAPI, HTMX, SSE).
  • Strictly typed: mypy --strict clean on the public surface.

Install

pip install cuesheet               # SDK + CLI
pip install "cuesheet[web]"        # also installs the web UI
pip install "cuesheet[all]"        # everything

Python 3.10+.

Common patterns

Decorator (simplest)

@cuesheet.cassette("test_x.yaml")
def test_x():
    ...

Context manager

with cuesheet.cassette("my_run.yaml"):
    response = client.messages.create(...)

pytest fixture (zero-config)

def test_my_agent(cuesheet_cassette):
    # auto-uses tests/cassettes/test_my_agent.yaml
    ...

CI: forbid recording, fail on missing cassettes

@cuesheet.cassette("test_x.yaml", mode="replay_only")
def test_x():
    ...

Or globally:

CUESHEET_DEFAULT_MODE=replay_only pytest

This is the mode you want in CI. It guarantees no test ever silently records a new cassette against the real API on the build server.

Recording modes

Mode Behavior When to use
record_new (default) Replay if cassette exists; record and save if missing Local dev
record_once Record only if file empty; never re-record First-run fixtures
record_always Always hit the real API; overwrite the cassette Refresh after API changes
replay_only Never call the network; fail if cassette missing CI
bypass Ignore cassette entirely Disable in one place

Matchers

Two requests match if they're identical on:

  • HTTP method and URL
  • Model
  • Messages list (semantic, order-preserving)
  • Tools schema
  • Temperature, max_tokens, and other generation params

Override per cassette:

@cuesheet.cassette("x.yaml", match_on=["method", "url", "model", "messages"])
def test_x():
    ...

Or write a custom matcher:

@cuesheet.matcher
def ignore_user_id(req_a, req_b):
    a, b = req_a.body.copy(), req_b.body.copy()
    a.pop("user", None); b.pop("user", None)
    return a == b

Secret scrubbing

Cassettes get committed to your repo. Anything you didn't redact will end up on GitHub. cuesheet strips API keys, JWTs, and emails before write. Built-in patterns:

  • Anthropic keys (sk-ant-...)
  • OpenAI keys (sk-..., sk-proj-...)
  • Generic bearer tokens
  • JWTs (eyJ... triplets)
  • Common email regex

Add your own:

cuesheet.add_scrubber(r"INTERNAL-[A-Z0-9]{16}")

If you find a secret pattern that should be in the default set, please open a PR.

CLI

cuesheet list                              # all cassettes in cwd
cuesheet inspect tests/cassettes/x.yaml    # pretty-print one cassette
cuesheet stats                             # interaction + size totals
cuesheet scrub tests/cassettes/            # re-apply scrubbers in place
cuesheet web                               # open the local web UI

Web UI

cuesheet web                               # opens http://127.0.0.1:8095

Dark plus ochre, mobile-responsive, zero auth. The dashboard watches the filesystem and pushes change events over SSE, so the index and cassette detail pages update in real time as your tests run in another terminal. The pulsing live pill in the header confirms the watcher is connected. No daemon, no persistence; it just renders the files on disk.

Maturity

cuesheet is at v0.1.0. The public API (cuesheet.cassette, cuesheet.matcher, cuesheet.add_scrubber) is stable; internals may shift between 0.x minors. The interception logic hooks at the httpx transport layer, not at the SDK layer, so it's provider-agnostic by construction. Each SDK has quirks though, so if yours misbehaves, please file an issue with a minimal repro.

Supported providers

Any Python SDK that calls an LLM provider over httpx works. The providers below are explicitly detected and tagged in the web UI:

  • Anthropic
  • OpenAI
  • Google (Gemini)
  • Mistral
  • Cohere
  • Groq
  • DeepSeek
  • Together
  • LiteLLM (passes through to the underlying provider URL)

If your provider isn't in the list, cuesheet still records and replays it; you just won't get the coloured provider pill in the UI.

Comparison

vcr.py pytest-vcr RESPX cuesheet
HTTP-level
LLM-payload aware
Streaming response replay ⚠️ partial ⚠️ partial
Provider-agnostic
Auto API-key scrubbing ⚠️ manual ⚠️ manual
pytest plugin ⚠️ manual
Web UI with live updates

License

MIT. Built by George Moustakas in Greece.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cuesheet-0.1.0.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cuesheet-0.1.0-py3-none-any.whl (44.8 kB view details)

Uploaded Python 3

File details

Details for the file cuesheet-0.1.0.tar.gz.

File metadata

  • Download URL: cuesheet-0.1.0.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cuesheet-0.1.0.tar.gz
Algorithm Hash digest
SHA256 be23486938e8cad6f9761e7f31f3b89284ed4825fe012d2d2a6d494ce9374768
MD5 b1f82159d455c88683135f634d8ed4ee
BLAKE2b-256 15f37ee6f97ce2a1bd2db0f57d69aaa1514967b728c1e0f8a1768de8195164b5

See more details on using hashes here.

File details

Details for the file cuesheet-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cuesheet-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cuesheet-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94d694e1034e0479efef18c33a456bfea1d50fdf40df0b1f0e1ed7ae97ce8100
MD5 b1834663ed9e24aae9779a16ca5113f4
BLAKE2b-256 c01c2d1a2fa53bc7c70e4fc0139b70dbae993ae4f1e8108e2d46c76303b87db8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page