Skip to main content

Test LLM-powered applications with the same rigor as traditional software.

Project description

checkllm

Test LLM-powered applications with the same rigor as traditional software.

checkllm is a pytest plugin and CLI that lets you write assertions for LLM outputs using deterministic checks, LLM-as-judge evaluation, and statistical regression detection.

Why checkllm?

  • Works with pytest - no new test runner to learn, just add a check fixture
  • Free deterministic checks run instantly with zero API calls
  • LLM-as-judge for subjective quality (hallucination, relevance, toxicity, custom rubrics)
  • Statistical regression detection using Welch's t-test, not just "did it change?"
  • Multiple judge backends - OpenAI and Anthropic, or bring your own
  • One command to snapshot, report, or diff your test results

Installation

pip install checkllm

For Anthropic Claude support:

pip install checkllm[anthropic]

Quick Start

1. Write a test

# tests/test_my_agent.py

def test_output_quality(check):
    output = my_agent("What is Python?")

    # Deterministic checks (free, instant)
    check.contains(output, "programming language")
    check.not_contains(output, "JavaScript")
    check.max_tokens(output, limit=200)

    # LLM-as-judge checks (requires OPENAI_API_KEY)
    check.hallucination(output, context="Python is a high-level programming language.")
    check.relevance(output, query="What is Python?")
    check.toxicity(output)

2. Run it

export OPENAI_API_KEY=sk-...

pytest tests/test_my_agent.py -v

# Or use the CLI
checkllm run tests/test_my_agent.py

3. Track regressions

checkllm snapshot tests/ --output .checkllm/snapshots/baseline.json

# After changes, compare
checkllm snapshot tests/ --output .checkllm/snapshots/current.json
checkllm diff --baseline .checkllm/snapshots/baseline.json \
              --current .checkllm/snapshots/current.json

Deterministic Checks

Zero-cost, zero-latency checks that run locally:

def test_deterministic(check):
    output = my_agent("...")

    check.contains(output, "expected substring")
    check.not_contains(output, "forbidden text")
    check.exact_match(output, "exact expected output")
    check.exact_match(output, "EXPECTED", ignore_case=True)
    check.starts_with(output, "Python")
    check.ends_with(output, "language.")
    check.regex(output, pattern=r"\d{3}-\d{4}")
    check.max_tokens(output, limit=500)
    check.latency(response_time_ms, max_ms=2000)
    check.cost(api_cost_usd, max_usd=0.05)

    # Validate JSON structure
    from pydantic import BaseModel

    class Response(BaseModel):
        answer: str
        confidence: float

    check.json_schema(output, schema=Response)

LLM-as-Judge Metrics

Use GPT-4o (or Claude) as an automated judge:

def test_llm_quality(check):
    output = my_agent("Summarize this article about climate change.")
    article = "..."

    check.hallucination(output, context=article)
    check.relevance(output, query="Summarize the article")
    check.toxicity(output)
    check.rubric(output, criteria="concise, under 3 sentences, mentions key findings")

Each check records a score (0.0-1.0), pass/fail status, reasoning, cost, and latency.

Custom Thresholds

check.hallucination(output, context=ctx, threshold=0.9)  # stricter
check.relevance(output, query=q, threshold=0.6)           # more lenient

Multiple Runs

check.hallucination(output, context=ctx, runs=5)

Or set globally:

[tool.checkllm]
runs_per_test = 3

Dataset-Driven Testing

# tests/fixtures/cases.yaml
- input: "What is Python?"
  expected: "Python is a programming language"
  query: "Explain Python"
  context: "Python was created by Guido van Rossum in 1991."
  criteria: "accurate, mentions creator"

- input: "What is 2+2?"
  expected: "4"
  criteria: "correct, concise"
from checkllm import dataset

@dataset("tests/fixtures/cases.yaml")
def test_across_cases(check, case):
    output = my_agent(case.input)
    check.contains(output, case.expected)
    if case.context:
        check.hallucination(output, context=case.context)

Or use a Python generator:

from checkllm import Case, dataset

def my_cases():
    yield Case(input="Hello", expected="greeting", criteria="friendly")
    yield Case(input="Goodbye", expected="farewell", criteria="polite")

@dataset(my_cases)
def test_generated(check, case):
    output = my_agent(case.input)
    check.rubric(output, criteria=case.criteria)

Custom Metrics

import checkllm
from checkllm import CheckResult

@checkllm.metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
    word_count = len(output.split())
    return CheckResult(
        passed=word_count <= max_words,
        score=min(1.0, max_words / max(word_count, 1)),
        reasoning=f"{word_count} words (limit: {max_words})",
        cost=0.0,
        latency_ms=0,
        metric_name="brevity",
    )

def test_brevity(check):
    output = my_agent("Explain quantum physics")
    check.run_metric("brevity", output=output, max_words=100)

Async Tests

import pytest

@pytest.mark.asyncio
async def test_async_quality(check):
    output = await my_async_agent("What is Python?")

    await check.ahallucination(output, context="...")
    await check.arelevance(output, query="What is Python?")
    await check.atoxicity(output)
    await check.arubric(output, criteria="concise and accurate")

    # Deterministic checks are always sync (instant, no I/O)
    check.contains(output, "Python")

Separating Fast and Slow Tests

Mark LLM tests so you can skip them in fast CI runs:

import pytest

@pytest.mark.llm
def test_with_llm(check):
    check.hallucination(output, context=ctx)

def test_fast(check):
    check.contains(output, "Python")
# Run only fast deterministic tests
pytest -m "not llm"

# Run only LLM tests
pytest -m llm

If OPENAI_API_KEY is not set, LLM checks automatically skip instead of crashing.

Regression Detection

checkllm uses Welch's t-test to detect statistically significant score regressions.

checkllm snapshot tests/ --output .checkllm/snapshots/v1.json
# ... make changes ...
checkllm snapshot tests/ --output .checkllm/snapshots/v2.json
checkllm diff -b .checkllm/snapshots/v1.json -c .checkllm/snapshots/v2.json

# Fail CI on regression
checkllm diff -b v1.json -c v2.json --fail-on-regression

Reporting

# HTML report
checkllm report tests/ --output report.html

# JUnit XML for CI/CD
checkllm run tests/ --junit-xml results.xml

# pytest flags work directly
pytest tests/ --checkllm-snapshot=snap.json --checkllm-report=report.html

CLI Reference

Command Description
checkllm run <path> Run tests with --snapshot, --html-report, --junit-xml, --compare, --fail-on-regression
checkllm snapshot <path> Save test results as baseline (--output PATH)
checkllm report <path> Generate HTML report (--output PATH, --junit-xml PATH)
checkllm diff Compare snapshots (--baseline, --current, --fail-on-regression)
checkllm eval Evaluate prompt template (--prompt, --dataset, --metric, --threshold)
checkllm init [path] Scaffold a new project
checkllm list-metrics List available metrics
checkllm --version Show version

Configuration

[tool.checkllm]
judge_backend = "openai"           # "openai" or "anthropic"
judge_model = "gpt-4o"             # Model for LLM-as-judge
default_threshold = 0.8            # Pass/fail threshold (0.0-1.0)
runs_per_test = 1                  # Repeat LLM checks N times
snapshot_dir = ".checkllm/snapshots"
confidence_level = 0.95
p_value_threshold = 0.05

Environment variable overrides: CHECKLLM_JUDGE_BACKEND, CHECKLLM_JUDGE_MODEL, CHECKLLM_DEFAULT_THRESHOLD, CHECKLLM_RUNS_PER_TEST.

Custom Judge Backends

Anthropic Claude

[tool.checkllm]
judge_backend = "anthropic"
judge_model = "claude-sonnet-4-6"

Your Own Backend

Implement the JudgeBackend protocol:

from checkllm import JudgeBackend, JudgeResponse
from checkllm.check import CheckCollector
from checkllm.config import CheckllmConfig

class MyJudge:
    async def evaluate(self, prompt: str, system_prompt: str | None = None) -> JudgeResponse:
        return JudgeResponse(score=0.9, reasoning="Looks good", cost=0.0)

config = CheckllmConfig()
collector = CheckCollector(config=config, judge=MyJudge())

Configuring the Judge in conftest.py

To use a cheaper model or a custom backend for all tests:

# tests/conftest.py
import pytest
from checkllm.check import CheckCollector
from checkllm.config import load_config
from checkllm.judge import OpenAIJudge
from checkllm.pytest_plugin import _CHECKLLM_KEY

@pytest.fixture
def check(request):
    config = load_config()
    judge = OpenAIJudge(model="gpt-4o-mini")  # cheaper model for dev
    collector = CheckCollector(config=config, judge=judge)
    request.node.stash[_CHECKLLM_KEY] = collector
    return collector

Project Setup

checkllm init

Creates pyproject.toml, tests/conftest.py, sample test file, sample dataset, and .checkllm/snapshots/ directory.

Examples

See the examples/ directory for working code:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checkllm-0.1.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

checkllm-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file checkllm-0.1.0.tar.gz.

File metadata

  • Download URL: checkllm-0.1.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for checkllm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e54b61739dadfc5d76add184d1335bd2997aec2b46b9a1c97cb28436a2df3f5a
MD5 bbd9ae8a669c2099eb8aae5b9f9e24ca
BLAKE2b-256 fa84cd12f84ccf6548710f0f965a6dff4e0faa057a94bd6fc984cf213f8b7bd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-0.1.0.tar.gz:

Publisher: publish.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file checkllm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: checkllm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for checkllm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d8d1fb190986c149eb613c3a6eb72f2a1fc647a3a9f6d868912dc7406bf5d27
MD5 8c7c077ff512c2a53ca88f6b2422f093
BLAKE2b-256 6c895687aff0a5ba0633bcb8a54de61cacbcda5d20c7f7d9406456905f46774a

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page