Behavioral assertion testing for LLM applications. The pytest of LLM testing.

These details have not been verified by PyPI

Project links

Project description

LLMAssert

Behavioral assertion testing for LLM applications.

Created by Bradley R. Kinnard

Version Python Tests

Quick Start · What Is This · Features · Installation · Usage · Benchmarks · Docs

Quick Start

pip install "llm-assert[anthropic]"

from llm_assert import LLMAssert
from llm_assert.providers.anthropic import AnthropicProvider

provider = AnthropicProvider(model="claude-sonnet-4-20250514")
v = LLMAssert(provider)

result = (
    v.assert_that("Return a JSON object with keys: title, summary, tags")
    .is_valid_json()
    .contains_keys(["title", "summary", "tags"])
    .length_between(50, 2000)
    .semantic_intent_matches("a structured summary with metadata")
    .does_not_contain("I'm sorry")
    .run()
)

assert result.passed

Works with any provider. Swap AnthropicProvider for OpenAIProvider, OllamaProvider, or any other adapter and the assertions stay the same.

What Is This

LLMAssert is a composable assertion library for verifying LLM output. It drops into your existing pytest suite and gives you a clean pass/fail on whether your AI system behaves correctly.

It is not a tracing platform, an observability tool, or a dashboard. Those tools monitor what happened. LLMAssert defines what should happen and fails your build if it does not.

Structural assertions (JSON validity, schema compliance, key presence, length bounds, regex) are deterministic, zero cost, and require no LLM calls. Semantic assertions (intent matching, topic avoidance, factual consistency) run locally via sentence-transformers with no API key and no external calls. Behavioral assertions run the model N times and assess the distribution. Regression assertions detect semantic drift and format shifts across model versions.

Features

Structural assertions verifying JSON, schema, keys, length, regex, and string patterns. Deterministic, no LLM calls.
Semantic assertions using local embeddings (22MB model, CPU, no API key) for intent matching, topic avoidance, factual consistency, and reading level
Behavioral assertions running N samples with Wilson confidence intervals for pass rate, refusal rate, and consistency checks
Regression and drift detection comparing against versioned JSON baselines to catch silent model updates and format shifts
Composite logic chaining assertions with AND, OR, NOT, and satisfies() for arbitrary assertion instances
Provider-agnostic with adapters for OpenAI, Anthropic, Google, Mistral, Ollama, LiteLLM, and a MockProvider for zero-cost testing
pytest plugin registering automatically with fixtures, marks, CLI flags, and JSON report hooks
YAML assertion suites defined as configuration and runnable from CLI with non-zero exit on failure
GitHub Action (moonrunnerkc/llm-assert@v1) annotating PRs inline with assertion details on failure
Deterministic semantic scoring via local embeddings, producing identical scores across runs (zero flakiness)
No telemetry, no analytics, no background network traffic. LLMAssert makes exactly the LLM calls you ask for.

Installation

pip install llm-assert                            # structural assertions only (3s install)
pip install "llm-assert[anthropic]"               # add Anthropic provider (~5s)
pip install "llm-assert[anthropic,semantic]"       # add local semantic scoring (~85s, includes PyTorch)

The base install covers structural assertions with any provider. Adding [semantic] pulls in sentence-transformers and PyTorch, which is a heavy install but eliminates all runtime API costs for semantic scoring.

Available extras: openai, anthropic, google, mistral, ollama, litellm, semantic.

Usage

Assertion Types

Structural: verify form (deterministic, zero cost)

result = (
    v.assert_that(prompt)
    .is_valid_json()
    .matches_schema(my_schema)
    .contains_keys(["title", "summary"])
    .length_between(50, 2000)
    .starts_with("{")
    .ends_with("}")
    .matches_pattern(r'"title"\s*:')
    .does_not_contain("```")
    .run()
)

All 8 structural assertions verified against Claude Sonnet 4 with a single API call.

Semantic: verify meaning (local embeddings, no API key)

Uses embedding similarity via sentence-transformers (22MB model, runs locally on CPU).

result = (
    v.assert_that(prompt)
    .semantic_intent_matches("a helpful product recommendation", threshold=0.75)
    .does_not_discuss("competitor products", threshold=0.6)
    .is_factually_consistent_with(reference_doc, threshold=0.80)
    .uses_language_at_grade_level(8, tolerance=2)
    .run()
)

Tested against live Claude output: semantic_intent_matches scored 0.77, does_not_discuss correctly scored 0.10 (well below 0.6 rejection threshold), is_factually_consistent_with scored 0.81, and uses_language_at_grade_level correctly measured Flesch-Kincaid grade 5.0 for a simple-language prompt.

Behavioral: verify patterns across multiple outputs

Runs the model N times and assesses the distribution with Wilson confidence intervals.

from llm_assert.assertions.structural import IsValidJson
from llm_assert.sampling.strategies import FixedSetSampler

sampler = FixedSetSampler(["prompt one", "prompt two", "prompt three"])

result = (
    v.assert_that(prompt)
    .passes_rate(IsValidJson(), min_rate=0.95, n_samples=20, sampler=sampler)
    .run()
)

Tested with FixedSetSampler and TemplateSampler against Claude: 5/5 pass rate on structural checks, 4/5 refusal rate on adversarial inputs (meeting 0.80 threshold), and 1.000 consistency score across repeated calls.

Regression: detect drift and format shifts

Compare against recorded baselines. Detect semantic drift, format shifts, and silent model updates. Baselines live in your repo as versioned JSON.

from llm_assert.snapshots.manager import SnapshotManager

snap_mgr = SnapshotManager(snapshot_dir="llm_assert_snapshots/")

result = (
    v.assert_that(prompt)
    .matches_baseline("my_endpoint", snap_mgr, semantic_threshold=0.85)
    .run()
)

Also available: .semantic_drift_is_below() for drift-only checks, and .format_matches_baseline() for structural-only comparison. Drift detection verified: a marine biology response against a Python baseline correctly triggered failure with 0.89 semantic drift.

Composite: chain assertions with boolean logic

from llm_assert.assertions.structural import IsValidJson, StartsWith

# Implicit AND: every chained assertion must pass
result = v.assert_that(prompt).is_valid_json().contains_keys(["name"]).run()

# OR: accept either format
result = v.assert_that(prompt).or_(StartsWith("{"), StartsWith("[")).run()

# NOT: invert any assertion
result = v.assert_that(prompt).not_(IsValidJson()).run()

# satisfies: pass any BaseAssertion instance
result = v.assert_that(prompt).satisfies(IsValidJson()).run()

Providers

LLMAssert works with any LLM provider. The provider layer is a thin adapter; assertions are provider-agnostic.

pip install "llm-assert[openai]"       # OpenAI
pip install "llm-assert[anthropic]"    # Anthropic
pip install "llm-assert[ollama]"       # Ollama (local)
pip install "llm-assert[google]"       # Google Generative AI
pip install "llm-assert[mistral]"      # Mistral
pip install "llm-assert[litellm]"      # Any provider via LiteLLM

from llm_assert.providers.mock import MockProvider

# Test assertions without API calls or cost
mock = MockProvider(response_fn=lambda prompt, msgs=None: '{"title": "Test", "summary": "ok"}')
v = LLMAssert(mock)

Every provider returns a NormalizedResponse with consistent fields: content, model (exact identifier, not the alias), provider, latency_ms, prompt_tokens, completion_tokens, finish_reason, request_id, and raw (original provider response). Verified against live Anthropic output.

pytest Integration

LLMAssert registers as a pytest plugin. No new CLI or workflow to learn.

def test_summarizer(llm_assert_runner):
    result = (
        llm_assert_runner
        .assert_that("Summarize the Q3 earnings report")
        .is_valid_json()
        .contains_keys(["summary", "highlights"])
        .semantic_intent_matches("financial summary with key highlights")
        .run()
    )
    assert result.passed

LLM_ASSERT_PROVIDER=anthropic pytest tests/ -v

Flag	Effect
`--llm-assert-report json --llm-assert-report-path report.json`	Save JSON report
`--llm-assert-skip-behavioral`	Skip expensive multi-sample tests
`--llm-assert-strict`	Borderline passes become failures

The --llm-assert-skip-behavioral flag skips any test marked with @pytest.mark.llm_assert_behavioral, keeping commit-level runs fast while full behavioral suites run on a longer schedule.

YAML Suites

Define assertion suites as configuration, committed alongside your model config:

version: "1.0"
name: "summarizer_suite"
cases:
  - name: "valid_json_output"
    prompt: "Return a JSON summary of the document"
    assertions:
      - type: is_valid_json
      - type: contains_keys
        params:
          keys: ["title", "summary"]
      - type: length_between
        params:
          min_chars: 50
          max_chars: 2000
  - name: "semantic_check"
    prompt: "Explain why Python is popular for data science"
    assertions:
      - type: semantic_intent_matches
        params:
          reference_intent: "Python is widely used in data science"
          threshold: 0.55
      - type: does_not_discuss
        params:
          topic: "JavaScript web frameworks"
          threshold: 0.6

LLM_ASSERT_PROVIDER=anthropic llm-assert run suite.yml

Exits with non-zero on any failure. Supports --format json for CI report ingestion. A 7-case, 30-assertion YAML suite ran against Claude Sonnet 4 in 47 seconds with all cases passing.

CLI

llm-assert check                                  # verify provider connectivity
llm-assert run suite.yml --provider anthropic      # execute a YAML assertion suite
llm-assert snapshot create my_key "prompt text"    # record a baseline snapshot
llm-assert snapshot diff my_key --prompt "prompt"  # compare current vs baseline
llm-assert snapshot update my_key "prompt text"    # update an existing baseline
llm-assert snapshot delete my_key                  # remove a baseline
llm-assert providers                              # list installed providers
llm-assert report result.json                     # pretty-print a saved report

GitHub Actions

- uses: moonrunnerkc/llm-assert@v1
  with:
    suite: tests/llm_assert_suite.yml
    llm-assert-extras: anthropic,semantic
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Failures annotate the PR inline with assertion details: which assertion failed, the actual score, the threshold, and the model version. Step summaries appear in the workflow run UI.

Failure Output

When an assertion fails, the message tells you what went wrong, not just that something failed:

SemanticAssertion failed: score 0.68 below threshold 0.75
  using all-MiniLM-L6-v2, input 340 chars,
  provider claude-sonnet-4-20250514.
  Check embedding model version or lower threshold
  if intent ambiguity is acceptable.

Response JSON has 2 schema violations: 'email' is a required property

SemanticDrift failed for 'python_summary': drift 0.8863 exceeds max 0.1000
  (similarity 0.1137). Baseline model: claude-sonnet-4-20250514.
  Review the prompt or lower max_drift if the change is intentional.

Benchmarks

Every number below is produced by a runnable script in llm-assert-benchmark/ and backed by a JSON result file. Full methodology and reproduction instructions: llm-assert-benchmark/README.md.

Comparison with existing tools

Metric	LLMAssert	DeepEval	Promptfoo	LangSmith	Braintrust
Account required	No	Partial	No	Yes	Yes
Native drift detection	Yes	No	No	No	No
Semantic scoring method	Local embeddings	LLM-as-judge	LLM-as-judge	LLM-as-judge	LLM-as-judge
Provider API calls per run	7	28	17	21	21
Monthly CI cost (30 runs/day)	$26.78	$31.31	$28.94	$29.80	$29.80
Lines of code for drift test	9	25	6 (YAML)	27	--

Source: llm-assert-benchmark/results/cost.json, llm-assert-benchmark/results/loc.json

API calls and telemetry (measured via request interception)

Actual HTTPS requests intercepted during a 3-case suite run via urllib3 and httpx patching:

Metric	LLMAssert	DeepEval
Provider API calls	3	12 (3 model + 9 judge)
Telemetry calls	0	4 (3 PostHog + 1 ipify)

LLMAssert makes exactly the LLM calls you ask for. No analytics, no IP lookups, no background network traffic.

Source: llm-assert-benchmark/results/api_call_counts.json

Flakiness (identical input, repeated runs)

LLMAssert's embedding-based scoring is deterministic. LLM-as-judge scoring is not.

Metric	LLMAssert (100 runs)	DeepEval (20 runs)
Score stdev	0.0	0.0160
Score range	0.0	0.0714 (0.929 to 1.0)
Scoring method	Local embeddings	LLM-as-judge

Source: llm-assert-benchmark/results/flakiness.json

Drift detection across real model versions

Drift measured as 1 - cosine_similarity between recorded baselines and live model responses using sentence-transformers/all-MiniLM-L6-v2. Threshold: 0.15.

GPT-4o version drift (gpt-4o-2024-05-13 vs gpt-4o-2024-11-20):

Prompt	Cosine Drift	Detected
structured_output	0.012	No
semantic_intent	0.066	No
format_compliance	0.269	Yes
code_generation	0.162	Yes
numeric_reasoning	0.028	No
instruction_following	0.037	No
chain_of_thought	0.090	No

Anthropic model migration (claude-3-haiku-20240307 vs claude-sonnet-4-20250514):

Prompt	Cosine Drift	Detected
structured_output	0.100	No
semantic_intent	0.260	Yes
format_compliance	0.464	Yes
code_generation	0.124	No
numeric_reasoning	0.057	No
instruction_following	0.121	No
chain_of_thought	0.261	Yes

The OpenAI pair catches silent version drift within the same model family. The Anthropic pair quantifies behavioral change during a model tier migration. Both are real API endpoints any developer can call today.

Source: llm-assert-benchmark/results/drift_detection.json

Setup time

LLMAssert's install footprint depends on which extras you need. Measured in fresh venvs with warm pip cache:

Configuration	Install	First Test	Total
`pip install llm-assert`	3.1s	102ms	3.2s
`pip install "llm-assert[anthropic]"`	5.4s	105ms	5.5s
`pip install "llm-assert[anthropic,semantic]"`	85.3s	113ms	85.4s
`pip install deepeval openai`	12.4s	1,491ms	13.8s

The base install (structural assertions, any provider) is 3.1 seconds. Adding [semantic] pulls in sentence-transformers and PyTorch, which is heavy (85s) but eliminates all runtime API costs for semantic scoring. DeepEval's lighter install shifts that cost to runtime: every semantic assertion makes an additional LLM API call, which is why its first test takes 14x longer (1,491ms vs 105ms).

Source: llm-assert-benchmark/results/setup_time.json

CI exit codes

Interface	Exit Code on Failure
`llm-assert run` (CLI)	1
`pytest` (plugin)	2

Source: llm-assert-benchmark/results/exit_codes.json

Test results (458 unit tests + live integration)

Every assertion type has been verified against live Claude Sonnet 4 (claude-sonnet-4-20250514) via the Anthropic API:

Category	Assertions Tested	Result
Structural	8 (is_valid_json, matches_schema, contains_keys, length_between, matches_pattern, does_not_contain, starts_with, ends_with)	All passed
Semantic	4 (semantic_intent_matches, does_not_discuss, is_factually_consistent_with, uses_language_at_grade_level)	All passed
Behavioral	4 (passes_rate with FixedSetSampler, passes_rate with TemplateSampler, refusal_rate_is_above, is_consistent_across_samples)	All passed
Regression	4 (matches_baseline, semantic_drift_is_below, format_matches_baseline, intentional drift detection)	All passed
Composite	5 (chained AND, or_, not_, satisfies, mixed structural+semantic)	All passed
YAML Suite via CLI	7 cases, 30 assertions	All passed
pytest Plugin	3 live tests + 1 correctly skipped behavioral	All passed
Error handling	6 (MockProvider, NormalizedResponse fields, failure messages, schema violations, connectivity, result metadata)	All passed
Unit test suite	458 tests	450 passed, 8 skipped

Status

Actively maintained. 458 unit tests plus live integration tests against Claude Sonnet 4. Development is ongoing.

Documentation

Getting Started -- install, first test, first failure in under 5 minutes
Assertion Types -- full reference for all assertion types
Provider Guide -- configure each provider, required env vars
pytest Guide -- fixtures, marks, CLI flags, report hooks
YAML Suite Format -- complete YAML suite specification
Scoring Guide -- how each scorer works, tuning thresholds
Sampling Guide -- input sampling for behavioral assertions
Regression Guide -- snapshot workflow, baseline management
CI Guide -- GitHub Actions, GitLab CI, CircleCI recipes
FAQ -- cost, flakiness, LLM-as-judge tradeoffs
Contributing -- contributor guide
Architecture -- design decisions and rationale

Contributing

Contributions are welcome. See the contributor guide for setup, standards, and process.

pip install -e ".[dev,semantic]"
pytest

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Mar 9, 2026

This version

0.1.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_assert-0.1.0.tar.gz (185.5 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_assert-0.1.0-py3-none-any.whl (97.8 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file llm_assert-0.1.0.tar.gz.

File metadata

Download URL: llm_assert-0.1.0.tar.gz
Upload date: Mar 8, 2026
Size: 185.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_assert-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6b0042c56613628d178da810e9f5475167bee7be15c2b99655650abfd40d8dcd`
MD5	`f4e61d7f452d7d253536f30bbf2b9528`
BLAKE2b-256	`4816f34f2668c575e6a83cb2e21c4490a4f41efd1b1f049154ca79385d8d6475`

See more details on using hashes here.

File details

Details for the file llm_assert-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_assert-0.1.0-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 97.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_assert-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55846c8bd86500ea096d15c2afc3baceee638eb17b6a997853901d0920857b9b`
MD5	`60e9d6bb75a0929fb2668c7cfdddbcfa`
BLAKE2b-256	`b26d5243cab58cbc79484e5eba44cd128079cb25e6198ba898d6fc847f49ffec`

See more details on using hashes here.

llm-assert 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLMAssert

Quick Start

What Is This

Features

Installation

Usage

Assertion Types

Providers

pytest Integration

YAML Suites

CLI

GitHub Actions

Failure Output

Benchmarks

Status

Documentation

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes