Deterministic semantic assertions for LLM tests. Freeze a judge's verdict once; replay it forever.

These details have not been verified by PyPI

Project links

Project description

fakellm-assert

Deterministic semantic assertions for LLM tests. Part of the fakellm family.

Mocking the transport is the easy part of testing LLM apps — fakellm already does that. The hard part is asserting on output that's fuzzy by nature ("does the response apologize?", "did it call the right tool?"). fakellm-assert gives you a matcher API for exactly that, and keeps your test suite deterministic and offline by freezing any judge verdict once and replaying it forever.

from fakellm_assert import expect

expect(resp).contains("refund")
expect(resp).called_tool("issue_refund")
expect(resp).json_path("$.status").equals("ok")
expect(resp).satisfies("apologizes for the delay and offers a solution")

resp can be a raw string, an OpenAI ChatCompletion, or an Anthropic Message — they're normalized automatically, with no SDK dependency.

A note on determinism

fakellm-assert freezes verdicts about a specific response. That only buys you a deterministic test suite if the response itself is deterministic — otherwise every run produces new text, the fingerprint never matches, and replay mode hard-errors on a permanent cache miss.

So the response under test must be stable across runs. That's the job of the rest of the family: point your system-under-test at fakellm so each call replays a fixed response, and fakellm-assert freezes a verdict about that. Mock the transport with fakellm, freeze the judgment with fakellm-assert — together they make a fuzzy LLM pipeline reproducible end to end.

If your SUT calls a live model, expect misses. --fakellm-update will still let you judge and freeze a one-off verdict, but it'll go stale the next time the model's output drifts — which is the tool working as intended, not a bug.

The matcher cascade

Climb only as high as you need. Lower rungs are cheaper and more deterministic; most assertions resolve on the bottom one.

Tier 1 — deterministic matchers. Pure functions over the response. contains, not_contains, contains_all, matches (regex), equals, has_length, is_valid_json, json_path(...).equals(...), called_tool, tool_args. Free, instant, 100% deterministic, zero snapshot machinery. Use these by default — a surprising amount of "semantic" checking is really structural.

Tier 3 — frozen judgment. satisfies("natural-language criterion") is the escape hatch for genuinely fuzzy assertions. The verdict comes from a judge model, but only once, during an explicit update run — then it's frozen to disk and replayed deterministically. Every .satisfies() is a snapshot someone maintains, so reach for it sparingly.

(Tier 2, embedding similarity, is intentionally omitted from v0 — Tiers 1 and 3 straddle the middle ground without the extra dependency and determinism caveats.)

How freezing works

Each .satisfies() assertion has a fingerprint — a hash of the response text, the criterion, the judge model, and the prompt template. The verdict is stored under that fingerprint in .fakellm/judgments/. Change the response and the fingerprint changes, the old verdict no longer applies, and the test fails until a human re-judges. That failure is the feature — verdicts stay valid only for the exact output they were made about.

Three run modes:

Mode	Cache hit	Cache miss	Network
replay (default)	replay verdict	hard error → run update	never
update	replay verdict	judge live, freeze, assert	judge only
strict	replay verdict	hard error, live judging impossible	never, by construction

A miss in replay never silently calls a model. That's what guarantees CI is deterministic and offline.

pytest usage

The plugin auto-activates on install.

pytest                    # replay: frozen verdicts only
pytest --fakellm-update   # judge & freeze any missing verdicts (review the diff!)
pytest --fakellm-strict   # belt-and-suspenders: fail rather than ever judge live

Wire up a judge once (in conftest.py). It's just a callable returning the model's raw text — bring your own SDK, or point it at fakellm:

from fakellm_assert import configure, CallableJudge
from openai import OpenAI

client = OpenAI()

def run_judge(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

configure(judge=CallableJudge(run_judge, model_name="gpt-4o-mini"))

What this is (and isn't)

This gives you deterministic regression detection: it freezes a human-approved verdict and alerts you when output drifts away from it. It does not tell you whether your LLM is correct — a frozen wrong verdict is still wrong, just consistently so. The judge's reasoning is stored in every snapshot so the git diff tells you why a verdict is what it is; read those diffs.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fakellm_assert-0.1.1.tar.gz (12.7 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fakellm_assert-0.1.1-py3-none-any.whl (14.3 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file fakellm_assert-0.1.1.tar.gz.

File metadata

Download URL: fakellm_assert-0.1.1.tar.gz
Upload date: May 24, 2026
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for fakellm_assert-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a898776c40b0e61a76a2f21e34f000f71cc05af0d0f99278fc38409ff9be846a`
MD5	`e3b0410c84a40eac265774820ff92857`
BLAKE2b-256	`e2b305d4f1ed8205c57925f0c2a0ff350ada7d524337779ea71609567dc1c3ff`

See more details on using hashes here.

File details

Details for the file fakellm_assert-0.1.1-py3-none-any.whl.

File metadata

Download URL: fakellm_assert-0.1.1-py3-none-any.whl
Upload date: May 24, 2026
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for fakellm_assert-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58b621ead8bec25f88523ca1c479f99b8dca042a03f03cb77a6f6708ce55b59d`
MD5	`d5251826ee75a24614cd0a6e92583599`
BLAKE2b-256	`f14dd2613c11cd3930eb4ec419199ce5cab1d171566c2d25c3bafa2c5dee3e97`

See more details on using hashes here.

fakellm-assert 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fakellm-assert

A note on determinism

The matcher cascade

How freezing works

pytest usage

What this is (and isn't)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes