Test LLM-powered applications with the same rigor as traditional software.

These details have not been verified by PyPI

Project description

checkllm

Test LLM-powered applications with the same rigor as traditional software.

checkllm is a pytest plugin and CLI that lets you write assertions for LLM outputs using deterministic checks, LLM-as-judge evaluation, and statistical regression detection.

Why checkllm?

Works with pytest - no new test runner to learn, just add a check fixture
Free deterministic checks run instantly with zero API calls
LLM-as-judge for subjective quality (hallucination, relevance, toxicity, custom rubrics)
Statistical regression detection using Welch's t-test, not just "did it change?"
Multiple judge backends - OpenAI and Anthropic, or bring your own
One command to snapshot, report, or diff your test results

Installation

pip install checkllm

For Anthropic Claude support:

pip install checkllm[anthropic]

Quick Start

1. Write a test

# tests/test_my_agent.py

def test_output_quality(check):
    output = my_agent("What is Python?")

    # Deterministic checks (free, instant)
    check.contains(output, "programming language")
    check.not_contains(output, "JavaScript")
    check.max_tokens(output, limit=200)

    # LLM-as-judge checks (requires OPENAI_API_KEY)
    check.hallucination(output, context="Python is a high-level programming language.")
    check.relevance(output, query="What is Python?")
    check.toxicity(output)

2. Run it

export OPENAI_API_KEY=sk-...

pytest tests/test_my_agent.py -v

# Or use the CLI
checkllm run tests/test_my_agent.py

3. Track regressions

checkllm snapshot tests/ --output .checkllm/snapshots/baseline.json

# After changes, compare
checkllm snapshot tests/ --output .checkllm/snapshots/current.json
checkllm diff --baseline .checkllm/snapshots/baseline.json \
              --current .checkllm/snapshots/current.json

Deterministic Checks

Zero-cost, zero-latency checks that run locally:

def test_deterministic(check):
    output = my_agent("...")

    check.contains(output, "expected substring")
    check.not_contains(output, "forbidden text")
    check.exact_match(output, "exact expected output")
    check.exact_match(output, "EXPECTED", ignore_case=True)
    check.starts_with(output, "Python")
    check.ends_with(output, "language.")
    check.regex(output, pattern=r"\d{3}-\d{4}")
    check.max_tokens(output, limit=500)
    check.latency(response_time_ms, max_ms=2000)
    check.cost(api_cost_usd, max_usd=0.05)

    # Validate JSON structure
    from pydantic import BaseModel

    class Response(BaseModel):
        answer: str
        confidence: float

    check.json_schema(output, schema=Response)

LLM-as-Judge Metrics

Use GPT-4o (or Claude) as an automated judge:

def test_llm_quality(check):
    output = my_agent("Summarize this article about climate change.")
    article = "..."

    check.hallucination(output, context=article)
    check.relevance(output, query="Summarize the article")
    check.toxicity(output)
    check.rubric(output, criteria="concise, under 3 sentences, mentions key findings")

Each check records a score (0.0-1.0), pass/fail status, reasoning, cost, and latency.

Custom Thresholds

check.hallucination(output, context=ctx, threshold=0.9)  # stricter
check.relevance(output, query=q, threshold=0.6)           # more lenient

Multiple Runs

check.hallucination(output, context=ctx, runs=5)

Or set globally:

[tool.checkllm]
runs_per_test = 3

Dataset-Driven Testing

# tests/fixtures/cases.yaml
- input: "What is Python?"
  expected: "Python is a programming language"
  query: "Explain Python"
  context: "Python was created by Guido van Rossum in 1991."
  criteria: "accurate, mentions creator"

- input: "What is 2+2?"
  expected: "4"
  criteria: "correct, concise"

from checkllm import dataset

@dataset("tests/fixtures/cases.yaml")
def test_across_cases(check, case):
    output = my_agent(case.input)
    check.contains(output, case.expected)
    if case.context:
        check.hallucination(output, context=case.context)

Or use a Python generator:

from checkllm import Case, dataset

def my_cases():
    yield Case(input="Hello", expected="greeting", criteria="friendly")
    yield Case(input="Goodbye", expected="farewell", criteria="polite")

@dataset(my_cases)
def test_generated(check, case):
    output = my_agent(case.input)
    check.rubric(output, criteria=case.criteria)

Custom Metrics

import checkllm
from checkllm import CheckResult

@checkllm.metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
    word_count = len(output.split())
    return CheckResult(
        passed=word_count <= max_words,
        score=min(1.0, max_words / max(word_count, 1)),
        reasoning=f"{word_count} words (limit: {max_words})",
        cost=0.0,
        latency_ms=0,
        metric_name="brevity",
    )

def test_brevity(check):
    output = my_agent("Explain quantum physics")
    check.run_metric("brevity", output=output, max_words=100)

Async Tests

import pytest

@pytest.mark.asyncio
async def test_async_quality(check):
    output = await my_async_agent("What is Python?")

    await check.ahallucination(output, context="...")
    await check.arelevance(output, query="What is Python?")
    await check.atoxicity(output)
    await check.arubric(output, criteria="concise and accurate")

    # Deterministic checks are always sync (instant, no I/O)
    check.contains(output, "Python")

Separating Fast and Slow Tests

Mark LLM tests so you can skip them in fast CI runs:

import pytest

@pytest.mark.llm
def test_with_llm(check):
    check.hallucination(output, context=ctx)

def test_fast(check):
    check.contains(output, "Python")

# Run only fast deterministic tests
pytest -m "not llm"

# Run only LLM tests
pytest -m llm

If OPENAI_API_KEY is not set, LLM checks automatically skip instead of crashing.

Regression Detection

checkllm uses Welch's t-test to detect statistically significant score regressions.

checkllm snapshot tests/ --output .checkllm/snapshots/v1.json
# ... make changes ...
checkllm snapshot tests/ --output .checkllm/snapshots/v2.json
checkllm diff -b .checkllm/snapshots/v1.json -c .checkllm/snapshots/v2.json

# Fail CI on regression
checkllm diff -b v1.json -c v2.json --fail-on-regression

Reporting

# HTML report
checkllm report tests/ --output report.html

# JUnit XML for CI/CD
checkllm run tests/ --junit-xml results.xml

# pytest flags work directly
pytest tests/ --checkllm-snapshot=snap.json --checkllm-report=report.html

CLI Reference

Command	Description
`checkllm run <path>`	Run tests with `--snapshot`, `--html-report`, `--junit-xml`, `--compare`, `--fail-on-regression`
`checkllm snapshot <path>`	Save test results as baseline (`--output PATH`)
`checkllm report <path>`	Generate HTML report (`--output PATH`, `--junit-xml PATH`)
`checkllm diff`	Compare snapshots (`--baseline`, `--current`, `--fail-on-regression`)
`checkllm eval`	Evaluate prompt template (`--prompt`, `--dataset`, `--metric`, `--threshold`)
`checkllm init [path]`	Scaffold a new project
`checkllm list-metrics`	List available metrics
`checkllm --version`	Show version

Configuration

[tool.checkllm]
judge_backend = "openai"           # "openai" or "anthropic"
judge_model = "gpt-4o"             # Model for LLM-as-judge
default_threshold = 0.8            # Pass/fail threshold (0.0-1.0)
runs_per_test = 1                  # Repeat LLM checks N times
snapshot_dir = ".checkllm/snapshots"
confidence_level = 0.95
p_value_threshold = 0.05

Environment variable overrides: CHECKLLM_JUDGE_BACKEND, CHECKLLM_JUDGE_MODEL, CHECKLLM_DEFAULT_THRESHOLD, CHECKLLM_RUNS_PER_TEST.

Custom Judge Backends

Anthropic Claude

[tool.checkllm]
judge_backend = "anthropic"
judge_model = "claude-sonnet-4-6"

Your Own Backend

Implement the JudgeBackend protocol:

from checkllm import JudgeBackend, JudgeResponse
from checkllm.check import CheckCollector
from checkllm.config import CheckllmConfig

class MyJudge:
    async def evaluate(self, prompt: str, system_prompt: str | None = None) -> JudgeResponse:
        return JudgeResponse(score=0.9, reasoning="Looks good", cost=0.0)

config = CheckllmConfig()
collector = CheckCollector(config=config, judge=MyJudge())

Configuring the Judge in conftest.py

To use a cheaper model or a custom backend for all tests:

# tests/conftest.py
import pytest
from checkllm.check import CheckCollector
from checkllm.config import load_config
from checkllm.judge import OpenAIJudge
from checkllm.pytest_plugin import _CHECKLLM_KEY

@pytest.fixture
def check(request):
    config = load_config()
    judge = OpenAIJudge(model="gpt-4o-mini")  # cheaper model for dev
    collector = CheckCollector(config=config, judge=judge)
    request.node.stash[_CHECKLLM_KEY] = collector
    return collector

Project Setup

checkllm init

Creates pyproject.toml, tests/conftest.py, sample test file, sample dataset, and .checkllm/snapshots/ directory.

Examples

See the examples/ directory for working code:

test_basic.py - Deterministic checks (no API key needed)
test_dataset_driven.py - YAML and generator datasets
test_custom_metrics.py - Register domain-specific metrics
test_llm_judge.py - LLM-as-judge evaluation
test_regression_workflow.py - Snapshot and regression detection

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

5.0.1

Apr 18, 2026

5.0.0

Apr 10, 2026

3.2.0

Apr 6, 2026

This version

0.1.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checkllm-0.1.0.tar.gz (40.8 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

checkllm-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file checkllm-0.1.0.tar.gz.

File metadata

Download URL: checkllm-0.1.0.tar.gz
Upload date: Mar 28, 2026
Size: 40.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for checkllm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e54b61739dadfc5d76add184d1335bd2997aec2b46b9a1c97cb28436a2df3f5a`
MD5	`bbd9ae8a669c2099eb8aae5b9f9e24ca`
BLAKE2b-256	`fa84cd12f84ccf6548710f0f965a6dff4e0faa057a94bd6fc984cf213f8b7bd4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-0.1.0.tar.gz:

Publisher: publish.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: checkllm-0.1.0.tar.gz
- Subject digest: e54b61739dadfc5d76add184d1335bd2997aec2b46b9a1c97cb28436a2df3f5a
- Sigstore transparency entry: 1191584127
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: javierdejesusda/checkllm@e90307dfc97b188643494b1eba789caafdb3c4df
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/javierdejesusda
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e90307dfc97b188643494b1eba789caafdb3c4df
- Trigger Event: release

File details

Details for the file checkllm-0.1.0-py3-none-any.whl.

File metadata

Download URL: checkllm-0.1.0-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for checkllm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d8d1fb190986c149eb613c3a6eb72f2a1fc647a3a9f6d868912dc7406bf5d27`
MD5	`8c7c077ff512c2a53ca88f6b2422f093`
BLAKE2b-256	`6c895687aff0a5ba0633bcb8a54de61cacbcda5d20c7f7d9406456905f46774a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: checkllm-0.1.0-py3-none-any.whl
- Subject digest: 3d8d1fb190986c149eb613c3a6eb72f2a1fc647a3a9f6d868912dc7406bf5d27
- Sigstore transparency entry: 1191584128
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: javierdejesusda/checkllm@e90307dfc97b188643494b1eba789caafdb3c4df
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/javierdejesusda
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e90307dfc97b188643494b1eba789caafdb3c4df
- Trigger Event: release

checkllm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

checkllm

Why checkllm?

Installation

Quick Start

1. Write a test

2. Run it

3. Track regressions

Deterministic Checks

LLM-as-Judge Metrics

Custom Thresholds

Multiple Runs

Dataset-Driven Testing

Custom Metrics

Async Tests

Separating Fast and Slow Tests

Regression Detection

Reporting

CLI Reference

Configuration

Custom Judge Backends

Anthropic Claude

Your Own Backend

Configuring the Judge in conftest.py

Project Setup

Examples

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance