Deterministic semantic assertions for LLM tests. Freeze a judge's verdict once; replay it forever.
Project description
fakellm-assert
Deterministic semantic assertions for LLM tests. Part of the fakellm family.
Mocking the transport is the easy part of testing LLM apps — fakellm already does that. The hard part is asserting on output that's fuzzy by nature ("does the response apologize?", "did it call the right tool?"). fakellm-assert gives you a matcher API for exactly that, and keeps your test suite deterministic and offline by freezing any judge verdict once and replaying it forever.
from fakellm_assert import expect
expect(resp).contains("refund")
expect(resp).called_tool("issue_refund")
expect(resp).json_path("$.status").equals("ok")
expect(resp).satisfies("apologizes for the delay and offers a solution")
resp can be a raw string, an OpenAI ChatCompletion, or an Anthropic Message — they're normalized automatically, with no SDK dependency.
A note on determinism
fakellm-assert freezes verdicts about a specific response. That only buys you a deterministic test suite if the response itself is deterministic — otherwise every run produces new text, the fingerprint never matches, and replay mode hard-errors on a permanent cache miss.
So the response under test must be stable across runs. That's the job of the rest of the family: point your system-under-test at fakellm so each call replays a fixed response, and fakellm-assert freezes a verdict about that. Mock the transport with fakellm, freeze the judgment with fakellm-assert — together they make a fuzzy LLM pipeline reproducible end to end.
If your SUT calls a live model, expect misses. --fakellm-update will still let you judge and freeze a one-off verdict, but it'll go stale the next time the model's output drifts — which is the tool working as intended, not a bug.
The matcher cascade
Climb only as high as you need. Lower rungs are cheaper and more deterministic; most assertions resolve on the bottom one.
Tier 1 — deterministic matchers. Pure functions over the response. contains, not_contains, contains_all, matches (regex), equals, has_length, is_valid_json, json_path(...).equals(...), called_tool, tool_args. Free, instant, 100% deterministic, zero snapshot machinery. Use these by default — a surprising amount of "semantic" checking is really structural.
Tier 3 — frozen judgment. satisfies("natural-language criterion") is the escape hatch for genuinely fuzzy assertions. The verdict comes from a judge model, but only once, during an explicit update run — then it's frozen to disk and replayed deterministically. Every .satisfies() is a snapshot someone maintains, so reach for it sparingly.
(Tier 2, embedding similarity, is intentionally omitted from v0 — Tiers 1 and 3 straddle the middle ground without the extra dependency and determinism caveats.)
How freezing works
Each .satisfies() assertion has a fingerprint — a hash of the response text, the criterion, the judge model, and the prompt template. The verdict is stored under that fingerprint in .fakellm/judgments/. Change the response and the fingerprint changes, the old verdict no longer applies, and the test fails until a human re-judges. That failure is the feature — verdicts stay valid only for the exact output they were made about.
Three run modes:
| Mode | Cache hit | Cache miss | Network |
|---|---|---|---|
| replay (default) | replay verdict | hard error → run update | never |
| update | replay verdict | judge live, freeze, assert | judge only |
| strict | replay verdict | hard error, live judging impossible | never, by construction |
A miss in replay never silently calls a model. That's what guarantees CI is deterministic and offline.
pytest usage
The plugin auto-activates on install.
pytest # replay: frozen verdicts only
pytest --fakellm-update # judge & freeze any missing verdicts (review the diff!)
pytest --fakellm-strict # belt-and-suspenders: fail rather than ever judge live
Wire up a judge once (in conftest.py). It's just a callable returning the model's raw text — bring your own SDK, or point it at fakellm:
from fakellm_assert import configure, CallableJudge
from openai import OpenAI
client = OpenAI()
def run_judge(prompt: str) -> str:
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
configure(judge=CallableJudge(run_judge, model_name="gpt-4o-mini"))
What this is (and isn't)
This gives you deterministic regression detection: it freezes a human-approved verdict and alerts you when output drifts away from it. It does not tell you whether your LLM is correct — a frozen wrong verdict is still wrong, just consistently so. The judge's reasoning is stored in every snapshot so the git diff tells you why a verdict is what it is; read those diffs.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fakellm_assert-0.1.1.tar.gz.
File metadata
- Download URL: fakellm_assert-0.1.1.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a898776c40b0e61a76a2f21e34f000f71cc05af0d0f99278fc38409ff9be846a
|
|
| MD5 |
e3b0410c84a40eac265774820ff92857
|
|
| BLAKE2b-256 |
e2b305d4f1ed8205c57925f0c2a0ff350ada7d524337779ea71609567dc1c3ff
|
File details
Details for the file fakellm_assert-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fakellm_assert-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58b621ead8bec25f88523ca1c479f99b8dca042a03f03cb77a6f6708ce55b59d
|
|
| MD5 |
d5251826ee75a24614cd0a6e92583599
|
|
| BLAKE2b-256 |
f14dd2613c11cd3930eb4ec419199ce5cab1d171566c2d25c3bafa2c5dee3e97
|