Skip to main content

The most comprehensive LLM testing and evaluation framework for Python.

Project description

checkllm

The most comprehensive LLM evaluation framework. The pytest of LLM testing.

PyPI Python License

pip install checkllm
def test_my_llm(check):
    output = my_llm("What is Python?")
    check.contains(output, "programming language")
    check.no_pii(output)
    check.hallucination(output, context="Python is a programming language created by Guido van Rossum.")

That's it. No setup, no boilerplate. The check fixture works in any pytest test.

Why checkllm?

  • Zero learning curve -- if you know pytest, you know checkllm. Just add a check parameter.
  • 39 free deterministic checks run instantly with zero API calls. No API key needed to start.
  • 72 LLM-as-judge metrics -- hallucination, faithfulness, trajectory, per-turn, dual-judge, and more.
  • 151 red team vulnerability types with 25 attack strategies -- the most comprehensive adversarial testing suite available.
  • 17 compliance frameworks -- OWASP LLM/API/Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, HIPAA, GDPR, and more.
  • Same checks everywhere -- use them in tests, CI, and production guardrails.

Quickstart

Install

pip install checkllm
checkllm init --use-case rag  # generates a tailored test file

1. Deterministic checks (free, instant)

def test_basic_quality(check):
    output = my_llm("Summarize this article.")

    check.contains(output, "key finding")
    check.max_tokens(output, limit=200)
    check.no_pii(output)
    check.is_json(output)
    check.gleu(output, reference="Expected summary text.", threshold=0.5)
    check.chrf(output, reference="Expected summary text.", threshold=0.4)
    check.latency_check(start_time, end_time, max_ms=3000)
    check.cost_check(input_tokens=500, output_tokens=200, model="gpt-4o", max_cost=0.05)

2. LLM-as-judge (deeper evaluation)

def test_rag_quality(check):
    output = my_rag("What causes climate change?")
    context = retrieve_context("climate change")

    check.hallucination(output, context=context)
    check.faithfulness(output, context=context)
    check.relevance(output, query="What causes climate change?")
    check.toxicity(output)

3. Fluent chaining

def test_with_chaining(check):
    output = my_llm("Explain quantum physics simply.")

    check.that(output) \
        .contains("quantum") \
        .max_tokens(200) \
        .has_no_pii() \
        .scores_above("relevance", 0.8, query="quantum physics")

4. Production guardrails

from checkllm import Guard, CheckSpec

guard = Guard(checks=[
    CheckSpec(check_type="no_pii"),
    CheckSpec(check_type="max_tokens", params={"limit": 500}),
    CheckSpec(check_type="toxicity"),
])

result = guard.validate(llm_output)
if not result.valid:
    result.raise_on_failure()

5. YAML-based evaluation

# checkllm.yaml
description: "Customer support chatbot evaluation"
judge:
  backend: openai
  model: gpt-4o

prompts:
  - "You are a helpful support agent. Answer: {{query}}"

tests:
  - vars:
      query: "How do I return an item?"
    assert:
      - type: contains
        value: "return policy"
      - type: relevance
        threshold: 0.8
      - type: no_pii
      - type: max_tokens
        value: 500

settings:
  budget: 5.0
checkllm eval-yaml checkllm.yaml

How checkllm compares

Feature checkllm DeepEval Ragas promptfoo
pytest native Yes Wrapper No No
Free deterministic checks 39 Limited Limited Yes
LLM-as-judge metrics 72 ~50 ~40 Custom
Red team vulnerability types 151 40+ 0 100+
Attack strategies 25 10+ 0 30+
Compliance frameworks 17 3 0 10+
Multi-provider judges 15+ backends 13+ ~6 50+
Consensus judging 7 strategies No Dual-judge No
Production guardrails Built-in No No API
Cost control & budgets Built-in No No Caching
Knowledge Graph synthesis Full pipeline No Yes No
Multilingual prompts 20 languages No Yes No
Prompt optimization 4 algorithms 4 2 No
YAML config evaluation Yes No No Yes
Streaming evaluation Token-by-token No No No
Regression detection Statistical (p-values) No No No
DPO export Yes No No No
Telemetry / phoning home None PostHog + Sentry None Telemetry
Independence Fully independent YC-backed YC-backed OpenAI-owned

All metrics by category

RAG Evaluation (14 metrics)

hallucination faithfulness faithfulness_hhem context_relevance context_entity_recall contextual_precision contextual_recall answer_completeness groundedness nonllm_context_precision nonllm_context_recall quoted_spans_alignment nv_context_relevance nv_response_groundedness

General Quality (12 metrics)

relevance coherence fluency consistency correctness factual_correctness sentiment toxicity bias summarization nv_answer_accuracy prompt_alignment

Completeness & Instruction Following (5 metrics)

response_completeness instruction_following instruction_completeness conversation_completeness topic_adherence

Agent & Tool Evaluation (12 metrics)

task_completion tool_accuracy tool_call_f1 plan_adherence plan_quality step_efficiency knowledge_retention goal_accuracy trajectory_goal_success trajectory_tool_sequence trajectory_step_count trajectory_tool_args_match

Per-Turn Conversation (3 metrics)

turn_relevancy turn_faithfulness turn_coherence

Multimodal (6 metrics)

image_relevance image_helpfulness image_coherence text_to_image image_editing image_reference

Structured Output (4 metrics)

code_correctness sql_equivalence comparative_quality datacompy_score

Role & Safety (3 metrics)

role_adherence role_violation non_advice

MCP & Tool-Specific (3 metrics)

mcp_use mcp_task_completion multi_turn_mcp_use

Specialized (3 metrics)

g_eval noise_sensitivity rubric

Deterministic Checks (39, zero API cost)

contains not_contains starts_with ends_with regex exact_match exact_match_strict min_tokens max_tokens min_words max_words min_chars max_chars min_sentences max_sentences is_json json_schema is_xml is_yaml is_html no_pii language readability similarity bleu rouge_l meteor gleu chrf latency_check cost_check string_distance perplexity is_valid_python is_url has_url word_count char_count sentence_count

Red teaming & adversarial testing

from checkllm.redteam import RedTeamer, VulnerabilityType
from checkllm.redteam_strategies import StrategyType

red = RedTeamer()
report = await red.scan(
    target=my_llm_function,
    vulnerability_types=[
        VulnerabilityType.PROMPT_INJECTION,
        VulnerabilityType.JAILBREAK,
        VulnerabilityType.PII_LEAKAGE,
        VulnerabilityType.DATA_EXFILTRATION,
    ],
    strategies=[StrategyType.BASE64, StrategyType.CRESCENDO, StrategyType.PERSONA],
    attacks_per_type=5,
)
print(report.summary())
print(report.risk_summary())  # CVSS severity breakdown

151 vulnerability types across 12 categories: prompt injection, jailbreak, PII leakage, harmful content, encoding attacks, privilege escalation, agentic AI attacks, brand & reputation, industry compliance, and more.

25 attack strategies: BASE64, ROT13, HEX, LEETSPEAK, MORSE, HOMOGLYPH, CRESCENDO (multi-turn escalation), JAILBREAK_TREE, JAILBREAK_META, JAILBREAK_COMPOSITE, BEST_OF_N, PERSONA, HYPOTHETICAL, ROLEPLAY, LAYER (composable chaining), and more.

Coding agent security

from checkllm.redteam_coding_agents import CodingAgentScanner

scanner = CodingAgentScanner(judge=judge)
report = await scanner.scan(target=my_coding_agent)
# Tests: repo prompt injection, sandbox escape, secret leakage, verifier sabotage

Compliance frameworks

from checkllm.compliance_frameworks import ComplianceScanner, ComplianceFramework

scanner = ComplianceScanner(judge=judge)
report = await scanner.scan(
    target=my_llm,
    frameworks=[
        ComplianceFramework.OWASP_LLM_TOP10,
        ComplianceFramework.OWASP_AGENTIC_TOP10,
        ComplianceFramework.EU_AI_ACT,
        ComplianceFramework.HIPAA,
    ],
)
print(report.summary())

17 frameworks: OWASP LLM Top 10, OWASP API Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, NIST AI RMF, NIST CSF, HIPAA, GDPR, PCI-DSS, SOC2, ISO 27001, COPPA, FERPA, CCPA, DoD AI Ethics.

Knowledge Graph test generation

from checkllm.knowledge_graph import KGTestGenerator, EntityExtractor, SimilarityBuilder

gen = KGTestGenerator(judge=judge)
samples = await gen.generate(
    documents=["doc1 text...", "doc2 text..."],
    num_samples=50,
    synthesizers={"single_hop": 0.4, "multi_hop_abstract": 0.3, "multi_hop_specific": 0.3},
    personas=5,
)
cases = gen.to_cases(samples)

Build a knowledge graph from your documents, then generate diverse test cases with single-hop, multi-hop abstract, and multi-hop specific queries. Supports persona variation, query styles (web search, misspelled, conversational), and configurable complexity.

Multilingual evaluation

from checkllm.multilingual import PromptAdapter, detect_language

adapter = PromptAdapter(judge=judge)
translated = await adapter.adapt(template=my_prompt, target_language="ja")
adapter.save_translations("translations/ja.json")

lang = detect_language("Esto es un texto en espanol.")  # "es"

Supports 20+ languages with automatic prompt adaptation. Language detection uses Unicode character-range analysis with LLM fallback.

Prompt optimization

from checkllm.optimize import create_optimizer

optimizer = create_optimizer("miprov2", judge=judge)  # or "genetic", "copro", "simba"
result = await optimizer.optimize(
    prompt="Summarize this document.",
    test_cases=my_test_cases,
    metric_fn=my_metric,
    num_candidates=10,
)
print(f"Improved from {result.initial_score:.2f} to {result.best_score:.2f}")

Four optimization algorithms: Genetic (evolutionary), MIPROv2 (instruction + demonstration), COPRO (failure-driven iterative), SIMBA (similarity-based adaptation).

Multi-provider judges

from checkllm import create_judge

judge = create_judge("openai", model="gpt-4o")
judge = create_judge("anthropic", model="claude-sonnet-4-6")
judge = create_judge("gemini", model="gemini-2.0-flash")
judge = create_judge("ollama", model="llama3.1")       # Free, local
judge = create_judge("litellm", model="any-model")     # 100+ models
judge = create_judge("deepseek")
judge = create_judge("groq")
judge = create_judge("fireworks")

Auto-detection: set OPENAI_API_KEY, ANTHROPIC_API_KEY, or have Ollama running -- checkllm picks the best judge automatically.

Consensus judging

from checkllm import ConsensusJudge

judges = [("gpt4", gpt4_judge), ("claude", claude_judge), ("gemini", gemini_judge)]
consensus = ConsensusJudge(judges, strategy="majority")  # or mean, median, unanimous, min, max, weighted

Cost control

checkllm estimate tests/              # See costs before running
checkllm run tests/ --budget 5.0      # Cap spend at $5
checkllm run tests/ --dry-run         # Estimate without executing

Configuration

# pyproject.toml
[tool.checkllm]
judge_backend = "auto"
judge_model = "gpt-4o"
default_threshold = 0.8
budget = 10.0
cache_enabled = true
engine = "auto"

CLI

Command Description
checkllm init Scaffold a project (--use-case, --ci)
checkllm run Run tests (--budget, --dry-run, --snapshot)
checkllm eval-yaml Run YAML-based evaluation
checkllm estimate Estimate costs before running
checkllm watch Re-run on file changes
checkllm report Generate HTML report
checkllm snapshot Save baseline for regression detection
checkllm diff Compare snapshots
checkllm history View run history and trends
checkllm list-metrics Show all available checks and metrics
checkllm cache Manage judge response cache
checkllm dashboard Launch web dashboard

Framework integrations

# LangChain
from checkllm.integrations.langchain import CheckllmCallbackHandler
chain.invoke(input, config={"callbacks": [CheckllmCallbackHandler(checks=["no_pii"])]})

# CrewAI
from checkllm.integrations.crewai import CheckllmCrewCallback

# OpenAI Agents SDK
from checkllm.integrations.openai_agents import CheckllmRunHandler

# Claude Agent SDK
from checkllm.integrations.claude_agents import CheckllmAgentHandler

# PydanticAI
from checkllm.integrations.pydantic_ai import CheckllmResultValidator

# LlamaIndex
from checkllm.integrations.llama_index import CheckllmCallbackHandler

Custom metrics

from checkllm import metric, CheckResult

@metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
    words = len(output.split())
    return CheckResult(
        passed=words <= max_words,
        score=min(1.0, max_words / max(words, 1)),
        reasoning=f"{words} words (limit: {max_words})",
        cost=0.0, latency_ms=0, metric_name="brevity",
    )

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checkllm-5.0.0.tar.gz (614.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

checkllm-5.0.0-py3-none-any.whl (523.0 kB view details)

Uploaded Python 3

File details

Details for the file checkllm-5.0.0.tar.gz.

File metadata

  • Download URL: checkllm-5.0.0.tar.gz
  • Upload date:
  • Size: 614.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for checkllm-5.0.0.tar.gz
Algorithm Hash digest
SHA256 1d2e129263a8fc73e4734d04f2434760cfd4d73f8b81a81a8403b054ca289a09
MD5 3200db68db3373f7fbc17d04b9df0687
BLAKE2b-256 ae43dd4843710cc7e7762ce4c8f90a2cb0624fbbd80566e381ba1d0a6e3f1541

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-5.0.0.tar.gz:

Publisher: publish.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file checkllm-5.0.0-py3-none-any.whl.

File metadata

  • Download URL: checkllm-5.0.0-py3-none-any.whl
  • Upload date:
  • Size: 523.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for checkllm-5.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f66216d7fb4a54c424ddaa14a0ac80eb7d789988b25779181df7949ce1fbd1d4
MD5 9075d8eca08fb69bcfc69b7baa041677
BLAKE2b-256 b64b61dc160eb3e835e8f11cab87f580dee3b7e8adaded17ae43ab55e1e60518

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-5.0.0-py3-none-any.whl:

Publisher: publish.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page