Skip to main content

Deterministic semantic assertions for LLM tests. Freeze a judge's verdict once; replay it forever.

Project description

fakellm-assert

Deterministic semantic assertions for LLM tests. Part of the fakellm family.

Mocking the transport is the easy part of testing LLM apps — fakellm already does that. The hard part is asserting on output that's fuzzy by nature ("does the response apologize?", "did it call the right tool?"). fakellm-assert gives you a matcher API for exactly that, and keeps your test suite deterministic and offline by freezing any judge verdict once and replaying it forever.

from fakellm_assert import expect

expect(resp).contains("refund")
expect(resp).called_tool("issue_refund")
expect(resp).json_path("$.status").equals("ok")
expect(resp).satisfies("apologizes for the delay and offers a solution")

resp can be a raw string, an OpenAI ChatCompletion, or an Anthropic Message — they're normalized automatically, with no SDK dependency.

A note on determinism

fakellm-assert freezes verdicts about a specific response. That only buys you a deterministic test suite if the response itself is deterministic — otherwise every run produces new text, the fingerprint never matches, and replay mode hard-errors on a permanent cache miss.

So the response under test must be stable across runs. That's the job of the rest of the family: point your system-under-test at fakellm so each call replays a fixed response, and fakellm-assert freezes a verdict about that. Mock the transport with fakellm, freeze the judgment with fakellm-assert — together they make a fuzzy LLM pipeline reproducible end to end.

If your SUT calls a live model, expect misses. --fakellm-update will still let you judge and freeze a one-off verdict, but it'll go stale the next time the model's output drifts — which is the tool working as intended, not a bug.

The matcher cascade

Climb only as high as you need. Lower rungs are cheaper and more deterministic; most assertions resolve on the bottom one.

Tier 1 — deterministic matchers. Pure functions over the response. contains, not_contains, contains_all, matches (regex), equals, has_length, is_valid_json, json_path(...).equals(...), called_tool, tool_args. Free, instant, 100% deterministic, zero snapshot machinery. Use these by default — a surprising amount of "semantic" checking is really structural.

Tier 3 — frozen judgment. satisfies("natural-language criterion") is the escape hatch for genuinely fuzzy assertions. The verdict comes from a judge model, but only once, during an explicit update run — then it's frozen to disk and replayed deterministically. Every .satisfies() is a snapshot someone maintains, so reach for it sparingly.

(Tier 2, embedding similarity, is intentionally omitted from v0 — Tiers 1 and 3 straddle the middle ground without the extra dependency and determinism caveats.)

How freezing works

Each .satisfies() assertion has a fingerprint — a hash of the response text, the criterion, the judge model, and the prompt template. The verdict is stored under that fingerprint in .fakellm/judgments/. Change the response and the fingerprint changes, the old verdict no longer applies, and the test fails until a human re-judges. That failure is the feature — verdicts stay valid only for the exact output they were made about.

Three run modes:

Mode Cache hit Cache miss Network
replay (default) replay verdict hard error → run update never
update replay verdict judge live, freeze, assert judge only
strict replay verdict hard error, live judging impossible never, by construction

A miss in replay never silently calls a model. That's what guarantees CI is deterministic and offline.

pytest usage

The plugin auto-activates on install.

pytest                    # replay: frozen verdicts only
pytest --fakellm-update   # judge & freeze any missing verdicts (review the diff!)
pytest --fakellm-strict   # belt-and-suspenders: fail rather than ever judge live

Wire up a judge once (in conftest.py). It's just a callable returning the model's raw text — bring your own SDK, or point it at fakellm:

from fakellm_assert import configure, CallableJudge
from openai import OpenAI

client = OpenAI()

def run_judge(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

configure(judge=CallableJudge(run_judge, model_name="gpt-4o-mini"))

What this is (and isn't)

This gives you deterministic regression detection: it freezes a human-approved verdict and alerts you when output drifts away from it. It does not tell you whether your LLM is correct — a frozen wrong verdict is still wrong, just consistently so. The judge's reasoning is stored in every snapshot so the git diff tells you why a verdict is what it is; read those diffs.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fakellm_assert-0.1.1.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fakellm_assert-0.1.1-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file fakellm_assert-0.1.1.tar.gz.

File metadata

  • Download URL: fakellm_assert-0.1.1.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for fakellm_assert-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a898776c40b0e61a76a2f21e34f000f71cc05af0d0f99278fc38409ff9be846a
MD5 e3b0410c84a40eac265774820ff92857
BLAKE2b-256 e2b305d4f1ed8205c57925f0c2a0ff350ada7d524337779ea71609567dc1c3ff

See more details on using hashes here.

File details

Details for the file fakellm_assert-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fakellm_assert-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for fakellm_assert-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 58b621ead8bec25f88523ca1c479f99b8dca042a03f03cb77a6f6708ce55b59d
MD5 d5251826ee75a24614cd0a6e92583599
BLAKE2b-256 f14dd2613c11cd3930eb4ec419199ce5cab1d171566c2d25c3bafa2c5dee3e97

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page