Behavioral assertion testing for LLM applications. The pytest of LLM testing.
Project description
LLMAssert
Behavioral assertion testing for LLM applications.
Created by Bradley R. Kinnard
Quick Start · What Is This · Features · Installation · Usage · Benchmarks · Docs
Quick Start
pip install "llm-assert[anthropic]"
from llm_assert import LLMAssert
from llm_assert.providers.anthropic import AnthropicProvider
provider = AnthropicProvider(model="claude-sonnet-4-20250514")
v = LLMAssert(provider)
result = (
v.assert_that("Return a JSON object with keys: title, summary, tags")
.is_valid_json()
.contains_keys(["title", "summary", "tags"])
.length_between(50, 2000)
.semantic_intent_matches("a structured summary with metadata")
.does_not_contain("I'm sorry")
.run()
)
assert result.passed
Works with any provider. Swap AnthropicProvider for OpenAIProvider, OllamaProvider, or any other adapter and the assertions stay the same.
What Is This
LLMAssert is a composable assertion library for verifying LLM output. It drops into your existing pytest suite and gives you a clean pass/fail on whether your AI system behaves correctly.
It is not a tracing platform, an observability tool, or a dashboard. Those tools monitor what happened. LLMAssert defines what should happen and fails your build if it does not.
Structural assertions (JSON validity, schema compliance, key presence, length bounds, regex) are deterministic, zero cost, and require no LLM calls. Semantic assertions (intent matching, topic avoidance, factual consistency) run locally via sentence-transformers with no API key and no external calls. Behavioral assertions run the model N times and assess the distribution. Regression assertions detect semantic drift and format shifts across model versions.
Features
- Structural assertions verifying JSON, schema, keys, length, regex, and string patterns. Deterministic, no LLM calls.
- Semantic assertions using local embeddings (22MB model, CPU, no API key) for intent matching, topic avoidance, factual consistency, and reading level
- Behavioral assertions running N samples with Wilson confidence intervals for pass rate, refusal rate, and consistency checks
- Regression and drift detection comparing against versioned JSON baselines to catch silent model updates and format shifts
- Composite logic chaining assertions with AND, OR, NOT, and
satisfies()for arbitrary assertion instances - Provider-agnostic with adapters for OpenAI, Anthropic, Google, Mistral, Ollama, LiteLLM, and a MockProvider for zero-cost testing
- pytest plugin registering automatically with fixtures, marks, CLI flags, and JSON report hooks
- YAML assertion suites defined as configuration and runnable from CLI with non-zero exit on failure
- GitHub Action (
moonrunnerkc/llm-assert@v1) annotating PRs inline with assertion details on failure - Deterministic semantic scoring via local embeddings, producing identical scores across runs (zero flakiness)
- No telemetry, no analytics, no background network traffic. LLMAssert makes exactly the LLM calls you ask for.
Installation
pip install llm-assert # structural assertions only (3s install)
pip install "llm-assert[anthropic]" # add Anthropic provider (~5s)
pip install "llm-assert[anthropic,semantic]" # add local semantic scoring (~85s, includes PyTorch)
The base install covers structural assertions with any provider. Adding [semantic] pulls in sentence-transformers and PyTorch, which is a heavy install but eliminates all runtime API costs for semantic scoring.
Available extras: openai, anthropic, google, mistral, ollama, litellm, semantic.
Usage
Assertion Types
Structural: verify form (deterministic, zero cost)
result = (
v.assert_that(prompt)
.is_valid_json()
.matches_schema(my_schema)
.contains_keys(["title", "summary"])
.length_between(50, 2000)
.starts_with("{")
.ends_with("}")
.matches_pattern(r'"title"\s*:')
.does_not_contain("```")
.run()
)
All 8 structural assertions verified against Claude Sonnet 4 with a single API call.
Semantic: verify meaning (local embeddings, no API key)
Uses embedding similarity via sentence-transformers (22MB model, runs locally on CPU).
result = (
v.assert_that(prompt)
.semantic_intent_matches("a helpful product recommendation", threshold=0.75)
.does_not_discuss("competitor products", threshold=0.6)
.is_factually_consistent_with(reference_doc, threshold=0.80)
.uses_language_at_grade_level(8, tolerance=2)
.run()
)
Tested against live Claude output: semantic_intent_matches scored 0.77, does_not_discuss correctly scored 0.10 (well below 0.6 rejection threshold), is_factually_consistent_with scored 0.81, and uses_language_at_grade_level correctly measured Flesch-Kincaid grade 5.0 for a simple-language prompt.
Behavioral: verify patterns across multiple outputs
Runs the model N times and assesses the distribution with Wilson confidence intervals.
from llm_assert.assertions.structural import IsValidJson
from llm_assert.sampling.strategies import FixedSetSampler
sampler = FixedSetSampler(["prompt one", "prompt two", "prompt three"])
result = (
v.assert_that(prompt)
.passes_rate(IsValidJson(), min_rate=0.95, n_samples=20, sampler=sampler)
.run()
)
Tested with FixedSetSampler and TemplateSampler against Claude: 5/5 pass rate on structural checks, 4/5 refusal rate on adversarial inputs (meeting 0.80 threshold), and 1.000 consistency score across repeated calls.
Regression: detect drift and format shifts
Compare against recorded baselines. Detect semantic drift, format shifts, and silent model updates. Baselines live in your repo as versioned JSON.
from llm_assert.snapshots.manager import SnapshotManager
snap_mgr = SnapshotManager(snapshot_dir="llm_assert_snapshots/")
result = (
v.assert_that(prompt)
.matches_baseline("my_endpoint", snap_mgr, semantic_threshold=0.85)
.run()
)
Also available: .semantic_drift_is_below() for drift-only checks, and .format_matches_baseline() for structural-only comparison. Drift detection verified: a marine biology response against a Python baseline correctly triggered failure with 0.89 semantic drift.
Composite: chain assertions with boolean logic
from llm_assert.assertions.structural import IsValidJson, StartsWith
# Implicit AND: every chained assertion must pass
result = v.assert_that(prompt).is_valid_json().contains_keys(["name"]).run()
# OR: accept either format
result = v.assert_that(prompt).or_(StartsWith("{"), StartsWith("[")).run()
# NOT: invert any assertion
result = v.assert_that(prompt).not_(IsValidJson()).run()
# satisfies: pass any BaseAssertion instance
result = v.assert_that(prompt).satisfies(IsValidJson()).run()
Providers
LLMAssert works with any LLM provider. The provider layer is a thin adapter; assertions are provider-agnostic.
pip install "llm-assert[openai]" # OpenAI
pip install "llm-assert[anthropic]" # Anthropic
pip install "llm-assert[ollama]" # Ollama (local)
pip install "llm-assert[google]" # Google Generative AI
pip install "llm-assert[mistral]" # Mistral
pip install "llm-assert[litellm]" # Any provider via LiteLLM
from llm_assert.providers.mock import MockProvider
# Test assertions without API calls or cost
mock = MockProvider(response_fn=lambda prompt, msgs=None: '{"title": "Test", "summary": "ok"}')
v = LLMAssert(mock)
Every provider returns a NormalizedResponse with consistent fields: content, model (exact identifier, not the alias), provider, latency_ms, prompt_tokens, completion_tokens, finish_reason, request_id, and raw (original provider response). Verified against live Anthropic output.
pytest Integration
LLMAssert registers as a pytest plugin. No new CLI or workflow to learn.
def test_summarizer(llm_assert_runner):
result = (
llm_assert_runner
.assert_that("Summarize the Q3 earnings report")
.is_valid_json()
.contains_keys(["summary", "highlights"])
.semantic_intent_matches("financial summary with key highlights")
.run()
)
assert result.passed
LLM_ASSERT_PROVIDER=anthropic pytest tests/ -v
| Flag | Effect |
|---|---|
--llm-assert-report json --llm-assert-report-path report.json |
Save JSON report |
--llm-assert-skip-behavioral |
Skip expensive multi-sample tests |
--llm-assert-strict |
Borderline passes become failures |
The --llm-assert-skip-behavioral flag skips any test marked with @pytest.mark.llm_assert_behavioral, keeping commit-level runs fast while full behavioral suites run on a longer schedule.
YAML Suites
Define assertion suites as configuration, committed alongside your model config:
version: "1.0"
name: "summarizer_suite"
cases:
- name: "valid_json_output"
prompt: "Return a JSON summary of the document"
assertions:
- type: is_valid_json
- type: contains_keys
params:
keys: ["title", "summary"]
- type: length_between
params:
min_chars: 50
max_chars: 2000
- name: "semantic_check"
prompt: "Explain why Python is popular for data science"
assertions:
- type: semantic_intent_matches
params:
reference_intent: "Python is widely used in data science"
threshold: 0.55
- type: does_not_discuss
params:
topic: "JavaScript web frameworks"
threshold: 0.6
LLM_ASSERT_PROVIDER=anthropic llm-assert run suite.yml
Exits with non-zero on any failure. Supports --format json for CI report ingestion. A 7-case, 30-assertion YAML suite ran against Claude Sonnet 4 in 47 seconds with all cases passing.
CLI
llm-assert check # verify provider connectivity
llm-assert run suite.yml --provider anthropic # execute a YAML assertion suite
llm-assert snapshot create my_key "prompt text" # record a baseline snapshot
llm-assert snapshot diff my_key --prompt "prompt" # compare current vs baseline
llm-assert snapshot update my_key "prompt text" # update an existing baseline
llm-assert snapshot delete my_key # remove a baseline
llm-assert providers # list installed providers
llm-assert report result.json # pretty-print a saved report
GitHub Actions
- uses: moonrunnerkc/llm-assert@v1
with:
suite: tests/llm_assert_suite.yml
llm-assert-extras: anthropic,semantic
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Failures annotate the PR inline with assertion details: which assertion failed, the actual score, the threshold, and the model version. Step summaries appear in the workflow run UI.
Failure Output
When an assertion fails, the message tells you what went wrong, not just that something failed:
SemanticAssertion failed: score 0.68 below threshold 0.75
using all-MiniLM-L6-v2, input 340 chars,
provider claude-sonnet-4-20250514.
Check embedding model version or lower threshold
if intent ambiguity is acceptable.
Response JSON has 2 schema violations: 'email' is a required property
SemanticDrift failed for 'python_summary': drift 0.8863 exceeds max 0.1000
(similarity 0.1137). Baseline model: claude-sonnet-4-20250514.
Review the prompt or lower max_drift if the change is intentional.
Benchmarks
Every number below is produced by a runnable script in llm-assert-benchmark/ and backed by a JSON result file. Full methodology and reproduction instructions: llm-assert-benchmark/README.md.
Comparison with existing tools
| Metric | LLMAssert | DeepEval | Promptfoo | LangSmith | Braintrust |
|---|---|---|---|---|---|
| Account required | No | Partial | No | Yes | Yes |
| Native drift detection | Yes | No | No | No | No |
| Semantic scoring method | Local embeddings | LLM-as-judge | LLM-as-judge | LLM-as-judge | LLM-as-judge |
| Provider API calls per run | 7 | 28 | 17 | 21 | 21 |
| Monthly CI cost (30 runs/day) | $26.78 | $31.31 | $28.94 | $29.80 | $29.80 |
| Lines of code for drift test | 9 | 25 | 6 (YAML) | 27 | -- |
Source: llm-assert-benchmark/results/cost.json, llm-assert-benchmark/results/loc.json
API calls and telemetry (measured via request interception)
Actual HTTPS requests intercepted during a 3-case suite run via urllib3 and httpx patching:
| Metric | LLMAssert | DeepEval |
|---|---|---|
| Provider API calls | 3 | 12 (3 model + 9 judge) |
| Telemetry calls | 0 | 4 (3 PostHog + 1 ipify) |
LLMAssert makes exactly the LLM calls you ask for. No analytics, no IP lookups, no background network traffic.
Flakiness (identical input, repeated runs)
LLMAssert's embedding-based scoring is deterministic. LLM-as-judge scoring is not.
| Metric | LLMAssert (100 runs) | DeepEval (20 runs) |
|---|---|---|
| Score stdev | 0.0 | 0.0160 |
| Score range | 0.0 | 0.0714 (0.929 to 1.0) |
| Scoring method | Local embeddings | LLM-as-judge |
Drift detection across real model versions
Drift measured as 1 - cosine_similarity between recorded baselines and live model responses using sentence-transformers/all-MiniLM-L6-v2. Threshold: 0.15.
GPT-4o version drift (gpt-4o-2024-05-13 vs gpt-4o-2024-11-20):
| Prompt | Cosine Drift | Detected |
|---|---|---|
| structured_output | 0.012 | No |
| semantic_intent | 0.066 | No |
| format_compliance | 0.269 | Yes |
| code_generation | 0.162 | Yes |
| numeric_reasoning | 0.028 | No |
| instruction_following | 0.037 | No |
| chain_of_thought | 0.090 | No |
Anthropic model migration (claude-3-haiku-20240307 vs claude-sonnet-4-20250514):
| Prompt | Cosine Drift | Detected |
|---|---|---|
| structured_output | 0.100 | No |
| semantic_intent | 0.260 | Yes |
| format_compliance | 0.464 | Yes |
| code_generation | 0.124 | No |
| numeric_reasoning | 0.057 | No |
| instruction_following | 0.121 | No |
| chain_of_thought | 0.261 | Yes |
The OpenAI pair catches silent version drift within the same model family. The Anthropic pair quantifies behavioral change during a model tier migration. Both are real API endpoints any developer can call today.
Setup time
LLMAssert's install footprint depends on which extras you need. Measured in fresh venvs with warm pip cache:
| Configuration | Install | First Test | Total |
|---|---|---|---|
pip install llm-assert |
3.1s | 102ms | 3.2s |
pip install "llm-assert[anthropic]" |
5.4s | 105ms | 5.5s |
pip install "llm-assert[anthropic,semantic]" |
85.3s | 113ms | 85.4s |
pip install deepeval openai |
12.4s | 1,491ms | 13.8s |
The base install (structural assertions, any provider) is 3.1 seconds. Adding [semantic] pulls in sentence-transformers and PyTorch, which is heavy (85s) but eliminates all runtime API costs for semantic scoring. DeepEval's lighter install shifts that cost to runtime: every semantic assertion makes an additional LLM API call, which is why its first test takes 14x longer (1,491ms vs 105ms).
CI exit codes
| Interface | Exit Code on Failure |
|---|---|
llm-assert run (CLI) |
1 |
pytest (plugin) |
2 |
Test results (458 unit tests + live integration)
Every assertion type has been verified against live Claude Sonnet 4 (claude-sonnet-4-20250514) via the Anthropic API:
| Category | Assertions Tested | Result |
|---|---|---|
| Structural | 8 (is_valid_json, matches_schema, contains_keys, length_between, matches_pattern, does_not_contain, starts_with, ends_with) | All passed |
| Semantic | 4 (semantic_intent_matches, does_not_discuss, is_factually_consistent_with, uses_language_at_grade_level) | All passed |
| Behavioral | 4 (passes_rate with FixedSetSampler, passes_rate with TemplateSampler, refusal_rate_is_above, is_consistent_across_samples) | All passed |
| Regression | 4 (matches_baseline, semantic_drift_is_below, format_matches_baseline, intentional drift detection) | All passed |
| Composite | 5 (chained AND, or_, not_, satisfies, mixed structural+semantic) | All passed |
| YAML Suite via CLI | 7 cases, 30 assertions | All passed |
| pytest Plugin | 3 live tests + 1 correctly skipped behavioral | All passed |
| Error handling | 6 (MockProvider, NormalizedResponse fields, failure messages, schema violations, connectivity, result metadata) | All passed |
| Unit test suite | 458 tests | 450 passed, 8 skipped |
Status
Actively maintained. 458 unit tests plus live integration tests against Claude Sonnet 4. Development is ongoing.
Documentation
- Getting Started -- install, first test, first failure in under 5 minutes
- Assertion Types -- full reference for all assertion types
- Provider Guide -- configure each provider, required env vars
- pytest Guide -- fixtures, marks, CLI flags, report hooks
- YAML Suite Format -- complete YAML suite specification
- Scoring Guide -- how each scorer works, tuning thresholds
- Sampling Guide -- input sampling for behavioral assertions
- Regression Guide -- snapshot workflow, baseline management
- CI Guide -- GitHub Actions, GitLab CI, CircleCI recipes
- FAQ -- cost, flakiness, LLM-as-judge tradeoffs
- Contributing -- contributor guide
- Architecture -- design decisions and rationale
Contributing
Contributions are welcome. See the contributor guide for setup, standards, and process.
pip install -e ".[dev,semantic]"
pytest
License
Copyright 2025-2026 Bradley R. Kinnard. Licensed under Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_assert-0.1.0.tar.gz.
File metadata
- Download URL: llm_assert-0.1.0.tar.gz
- Upload date:
- Size: 185.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b0042c56613628d178da810e9f5475167bee7be15c2b99655650abfd40d8dcd
|
|
| MD5 |
f4e61d7f452d7d253536f30bbf2b9528
|
|
| BLAKE2b-256 |
4816f34f2668c575e6a83cb2e21c4490a4f41efd1b1f049154ca79385d8d6475
|
File details
Details for the file llm_assert-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_assert-0.1.0-py3-none-any.whl
- Upload date:
- Size: 97.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55846c8bd86500ea096d15c2afc3baceee638eb17b6a997853901d0920857b9b
|
|
| MD5 |
60e9d6bb75a0929fb2668c7cfdddbcfa
|
|
| BLAKE2b-256 |
b26d5243cab58cbc79484e5eba44cd128079cb25e6198ba898d6fc847f49ffec
|