Test LLM-powered applications with the same rigor as traditional software.
Project description
checkllm
Test LLM-powered applications with the same rigor as traditional software.
checkllm is a pytest plugin and CLI that lets you write assertions for LLM outputs using deterministic checks, LLM-as-judge evaluation, and statistical regression detection.
Why checkllm?
- Works with pytest - no new test runner to learn, just add a
checkfixture - Free deterministic checks run instantly with zero API calls
- LLM-as-judge for subjective quality (hallucination, relevance, toxicity, custom rubrics)
- Statistical regression detection using Welch's t-test, not just "did it change?"
- Multiple judge backends - OpenAI and Anthropic, or bring your own
- One command to snapshot, report, or diff your test results
Installation
pip install checkllm
For Anthropic Claude support:
pip install checkllm[anthropic]
Quick Start
1. Write a test
# tests/test_my_agent.py
def test_output_quality(check):
output = my_agent("What is Python?")
# Deterministic checks (free, instant)
check.contains(output, "programming language")
check.not_contains(output, "JavaScript")
check.max_tokens(output, limit=200)
# LLM-as-judge checks (requires OPENAI_API_KEY)
check.hallucination(output, context="Python is a high-level programming language.")
check.relevance(output, query="What is Python?")
check.toxicity(output)
2. Run it
export OPENAI_API_KEY=sk-...
pytest tests/test_my_agent.py -v
# Or use the CLI
checkllm run tests/test_my_agent.py
3. Track regressions
checkllm snapshot tests/ --output .checkllm/snapshots/baseline.json
# After changes, compare
checkllm snapshot tests/ --output .checkllm/snapshots/current.json
checkllm diff --baseline .checkllm/snapshots/baseline.json \
--current .checkllm/snapshots/current.json
Deterministic Checks
Zero-cost, zero-latency checks that run locally:
def test_deterministic(check):
output = my_agent("...")
check.contains(output, "expected substring")
check.not_contains(output, "forbidden text")
check.exact_match(output, "exact expected output")
check.exact_match(output, "EXPECTED", ignore_case=True)
check.starts_with(output, "Python")
check.ends_with(output, "language.")
check.regex(output, pattern=r"\d{3}-\d{4}")
check.max_tokens(output, limit=500)
check.latency(response_time_ms, max_ms=2000)
check.cost(api_cost_usd, max_usd=0.05)
# Validate JSON structure
from pydantic import BaseModel
class Response(BaseModel):
answer: str
confidence: float
check.json_schema(output, schema=Response)
LLM-as-Judge Metrics
Use GPT-4o (or Claude) as an automated judge:
def test_llm_quality(check):
output = my_agent("Summarize this article about climate change.")
article = "..."
check.hallucination(output, context=article)
check.relevance(output, query="Summarize the article")
check.toxicity(output)
check.rubric(output, criteria="concise, under 3 sentences, mentions key findings")
Each check records a score (0.0-1.0), pass/fail status, reasoning, cost, and latency.
Custom Thresholds
check.hallucination(output, context=ctx, threshold=0.9) # stricter
check.relevance(output, query=q, threshold=0.6) # more lenient
Multiple Runs
check.hallucination(output, context=ctx, runs=5)
Or set globally:
[tool.checkllm]
runs_per_test = 3
Dataset-Driven Testing
# tests/fixtures/cases.yaml
- input: "What is Python?"
expected: "Python is a programming language"
query: "Explain Python"
context: "Python was created by Guido van Rossum in 1991."
criteria: "accurate, mentions creator"
- input: "What is 2+2?"
expected: "4"
criteria: "correct, concise"
from checkllm import dataset
@dataset("tests/fixtures/cases.yaml")
def test_across_cases(check, case):
output = my_agent(case.input)
check.contains(output, case.expected)
if case.context:
check.hallucination(output, context=case.context)
Or use a Python generator:
from checkllm import Case, dataset
def my_cases():
yield Case(input="Hello", expected="greeting", criteria="friendly")
yield Case(input="Goodbye", expected="farewell", criteria="polite")
@dataset(my_cases)
def test_generated(check, case):
output = my_agent(case.input)
check.rubric(output, criteria=case.criteria)
Custom Metrics
import checkllm
from checkllm import CheckResult
@checkllm.metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
word_count = len(output.split())
return CheckResult(
passed=word_count <= max_words,
score=min(1.0, max_words / max(word_count, 1)),
reasoning=f"{word_count} words (limit: {max_words})",
cost=0.0,
latency_ms=0,
metric_name="brevity",
)
def test_brevity(check):
output = my_agent("Explain quantum physics")
check.run_metric("brevity", output=output, max_words=100)
Async Tests
import pytest
@pytest.mark.asyncio
async def test_async_quality(check):
output = await my_async_agent("What is Python?")
await check.ahallucination(output, context="...")
await check.arelevance(output, query="What is Python?")
await check.atoxicity(output)
await check.arubric(output, criteria="concise and accurate")
# Deterministic checks are always sync (instant, no I/O)
check.contains(output, "Python")
Separating Fast and Slow Tests
Mark LLM tests so you can skip them in fast CI runs:
import pytest
@pytest.mark.llm
def test_with_llm(check):
check.hallucination(output, context=ctx)
def test_fast(check):
check.contains(output, "Python")
# Run only fast deterministic tests
pytest -m "not llm"
# Run only LLM tests
pytest -m llm
If OPENAI_API_KEY is not set, LLM checks automatically skip instead of crashing.
Regression Detection
checkllm uses Welch's t-test to detect statistically significant score regressions.
checkllm snapshot tests/ --output .checkllm/snapshots/v1.json
# ... make changes ...
checkllm snapshot tests/ --output .checkllm/snapshots/v2.json
checkllm diff -b .checkllm/snapshots/v1.json -c .checkllm/snapshots/v2.json
# Fail CI on regression
checkllm diff -b v1.json -c v2.json --fail-on-regression
Reporting
# HTML report
checkllm report tests/ --output report.html
# JUnit XML for CI/CD
checkllm run tests/ --junit-xml results.xml
# pytest flags work directly
pytest tests/ --checkllm-snapshot=snap.json --checkllm-report=report.html
CLI Reference
| Command | Description |
|---|---|
checkllm run <path> |
Run tests with --snapshot, --html-report, --junit-xml, --compare, --fail-on-regression |
checkllm snapshot <path> |
Save test results as baseline (--output PATH) |
checkllm report <path> |
Generate HTML report (--output PATH, --junit-xml PATH) |
checkllm diff |
Compare snapshots (--baseline, --current, --fail-on-regression) |
checkllm eval |
Evaluate prompt template (--prompt, --dataset, --metric, --threshold) |
checkllm init [path] |
Scaffold a new project |
checkllm list-metrics |
List available metrics |
checkllm --version |
Show version |
Configuration
[tool.checkllm]
judge_backend = "openai" # "openai" or "anthropic"
judge_model = "gpt-4o" # Model for LLM-as-judge
default_threshold = 0.8 # Pass/fail threshold (0.0-1.0)
runs_per_test = 1 # Repeat LLM checks N times
snapshot_dir = ".checkllm/snapshots"
confidence_level = 0.95
p_value_threshold = 0.05
Environment variable overrides: CHECKLLM_JUDGE_BACKEND, CHECKLLM_JUDGE_MODEL, CHECKLLM_DEFAULT_THRESHOLD, CHECKLLM_RUNS_PER_TEST.
Custom Judge Backends
Anthropic Claude
[tool.checkllm]
judge_backend = "anthropic"
judge_model = "claude-sonnet-4-6"
Your Own Backend
Implement the JudgeBackend protocol:
from checkllm import JudgeBackend, JudgeResponse
from checkllm.check import CheckCollector
from checkllm.config import CheckllmConfig
class MyJudge:
async def evaluate(self, prompt: str, system_prompt: str | None = None) -> JudgeResponse:
return JudgeResponse(score=0.9, reasoning="Looks good", cost=0.0)
config = CheckllmConfig()
collector = CheckCollector(config=config, judge=MyJudge())
Configuring the Judge in conftest.py
To use a cheaper model or a custom backend for all tests:
# tests/conftest.py
import pytest
from checkllm.check import CheckCollector
from checkllm.config import load_config
from checkllm.judge import OpenAIJudge
from checkllm.pytest_plugin import _CHECKLLM_KEY
@pytest.fixture
def check(request):
config = load_config()
judge = OpenAIJudge(model="gpt-4o-mini") # cheaper model for dev
collector = CheckCollector(config=config, judge=judge)
request.node.stash[_CHECKLLM_KEY] = collector
return collector
Project Setup
checkllm init
Creates pyproject.toml, tests/conftest.py, sample test file, sample dataset, and .checkllm/snapshots/ directory.
Examples
See the examples/ directory for working code:
- test_basic.py - Deterministic checks (no API key needed)
- test_dataset_driven.py - YAML and generator datasets
- test_custom_metrics.py - Register domain-specific metrics
- test_llm_judge.py - LLM-as-judge evaluation
- test_regression_workflow.py - Snapshot and regression detection
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file checkllm-0.1.0.tar.gz.
File metadata
- Download URL: checkllm-0.1.0.tar.gz
- Upload date:
- Size: 40.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e54b61739dadfc5d76add184d1335bd2997aec2b46b9a1c97cb28436a2df3f5a
|
|
| MD5 |
bbd9ae8a669c2099eb8aae5b9f9e24ca
|
|
| BLAKE2b-256 |
fa84cd12f84ccf6548710f0f965a6dff4e0faa057a94bd6fc984cf213f8b7bd4
|
Provenance
The following attestation bundles were made for checkllm-0.1.0.tar.gz:
Publisher:
publish.yml on javierdejesusda/checkllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
checkllm-0.1.0.tar.gz -
Subject digest:
e54b61739dadfc5d76add184d1335bd2997aec2b46b9a1c97cb28436a2df3f5a - Sigstore transparency entry: 1191584127
- Sigstore integration time:
-
Permalink:
javierdejesusda/checkllm@e90307dfc97b188643494b1eba789caafdb3c4df -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/javierdejesusda
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e90307dfc97b188643494b1eba789caafdb3c4df -
Trigger Event:
release
-
Statement type:
File details
Details for the file checkllm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: checkllm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d8d1fb190986c149eb613c3a6eb72f2a1fc647a3a9f6d868912dc7406bf5d27
|
|
| MD5 |
8c7c077ff512c2a53ca88f6b2422f093
|
|
| BLAKE2b-256 |
6c895687aff0a5ba0633bcb8a54de61cacbcda5d20c7f7d9406456905f46774a
|
Provenance
The following attestation bundles were made for checkllm-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on javierdejesusda/checkllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
checkllm-0.1.0-py3-none-any.whl -
Subject digest:
3d8d1fb190986c149eb613c3a6eb72f2a1fc647a3a9f6d868912dc7406bf5d27 - Sigstore transparency entry: 1191584128
- Sigstore integration time:
-
Permalink:
javierdejesusda/checkllm@e90307dfc97b188643494b1eba789caafdb3c4df -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/javierdejesusda
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e90307dfc97b188643494b1eba789caafdb3c4df -
Trigger Event:
release
-
Statement type: