The most comprehensive LLM testing and evaluation framework for Python.

These details have not been verified by PyPI

Project description

checkllm

The most comprehensive LLM evaluation framework. The pytest of LLM testing.

pip install checkllm

def test_my_llm(check):
    output = my_llm("What is Python?")
    check.contains(output, "programming language")
    check.no_pii(output)
    check.hallucination(output, context="Python is a programming language created by Guido van Rossum.")

That's it. No setup, no boilerplate. The check fixture works in any pytest test.

Why checkllm?

Zero learning curve -- if you know pytest, you know checkllm. Just add a check parameter.
39 free deterministic checks run instantly with zero API calls. No API key needed to start.
72 LLM-as-judge metrics -- hallucination, faithfulness, trajectory, per-turn, dual-judge, and more.
151 red team vulnerability types with 25 attack strategies -- the most comprehensive adversarial testing suite available.
17 compliance frameworks -- OWASP LLM/API/Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, HIPAA, GDPR, and more.
Same checks everywhere -- use them in tests, CI, and production guardrails.

Quickstart

Install

pip install checkllm
checkllm init --use-case rag  # generates a tailored test file

1. Deterministic checks (free, instant)

def test_basic_quality(check):
    output = my_llm("Summarize this article.")

    check.contains(output, "key finding")
    check.max_tokens(output, limit=200)
    check.no_pii(output)
    check.is_json(output)
    check.gleu(output, reference="Expected summary text.", threshold=0.5)
    check.chrf(output, reference="Expected summary text.", threshold=0.4)
    check.latency_check(start_time, end_time, max_ms=3000)
    check.cost_check(input_tokens=500, output_tokens=200, model="gpt-4o", max_cost=0.05)

2. LLM-as-judge (deeper evaluation)

def test_rag_quality(check):
    output = my_rag("What causes climate change?")
    context = retrieve_context("climate change")

    check.hallucination(output, context=context)
    check.faithfulness(output, context=context)
    check.relevance(output, query="What causes climate change?")
    check.toxicity(output)

3. Fluent chaining

def test_with_chaining(check):
    output = my_llm("Explain quantum physics simply.")

    check.that(output) \
        .contains("quantum") \
        .max_tokens(200) \
        .has_no_pii() \
        .scores_above("relevance", 0.8, query="quantum physics")

4. Production guardrails

from checkllm import Guard, CheckSpec

guard = Guard(checks=[
    CheckSpec(check_type="no_pii"),
    CheckSpec(check_type="max_tokens", params={"limit": 500}),
    CheckSpec(check_type="toxicity"),
])

result = guard.validate(llm_output)
if not result.valid:
    result.raise_on_failure()

5. YAML-based evaluation

# checkllm.yaml
description: "Customer support chatbot evaluation"
judge:
  backend: openai
  model: gpt-4o

prompts:
  - "You are a helpful support agent. Answer: {{query}}"

tests:
  - vars:
      query: "How do I return an item?"
    assert:
      - type: contains
        value: "return policy"
      - type: relevance
        threshold: 0.8
      - type: no_pii
      - type: max_tokens
        value: 500

settings:
  budget: 5.0

checkllm eval-yaml checkllm.yaml

How checkllm compares

Independent benchmark, not just feature counts. On the public competitor leaderboard (docs/benchmarks/competitor-comparison.md) checkllm holds rank 1 on every published row against DeepEval and promptfoo: halubench/hallucination 0.783, ragtruth/hallucination 0.663, ragtruth/faithfulness 0.754, ragtruth/context_relevance 0.565, and truthfulqa/answer_relevancy 0.546 (ROC-AUC, gpt-4o-mini judge, 200 source rows per slice). Methodology is in docs/benchmarks/methodology.md; raw scores ship in benchmarks/competitor_comparison/.

Feature comparison

Feature	checkllm	DeepEval	Ragas	promptfoo
pytest native	Yes	Wrapper	No	No
Free deterministic checks	39	Limited	Limited	Yes
LLM-as-judge metrics	72	~50	~40	Custom
Red team vulnerability types	151	40+	0	100+
Attack strategies	25	10+	0	30+
Compliance frameworks	17	3	0	10+
Multi-provider judges	15+ backends	13+	~6	50+
Consensus judging	7 strategies	No	Dual-judge	No
Production guardrails	Built-in	No	No	API
Cost control & budgets	Built-in	No	No	Caching
Knowledge Graph synthesis	Full pipeline	No	Yes	No
Multilingual prompts	20 languages	No	Yes	No
Prompt optimization	4 algorithms	4	2	No
YAML config evaluation	Yes	No	No	Yes
Streaming evaluation	Token-by-token	No	No	No
Regression detection	Statistical (p-values)	No	No	No
DPO export	Yes	No	No	No
Telemetry / phoning home	None	PostHog + Sentry	None	Telemetry
Independence	Fully independent	YC-backed	YC-backed	OpenAI-owned

All metrics by category

RAG Evaluation (14 metrics)

hallucination faithfulness faithfulness_hhem context_relevance context_entity_recall contextual_precision contextual_recall answer_completeness groundedness nonllm_context_precision nonllm_context_recall quoted_spans_alignment nv_context_relevance nv_response_groundedness

General Quality (12 metrics)

relevance coherence fluency consistency correctness factual_correctness sentiment toxicity bias summarization nv_answer_accuracy prompt_alignment

Completeness & Instruction Following (5 metrics)

response_completeness instruction_following instruction_completeness conversation_completeness topic_adherence

Agent & Tool Evaluation (12 metrics)

task_completion tool_accuracy tool_call_f1 plan_adherence plan_quality step_efficiency knowledge_retention goal_accuracy trajectory_goal_success trajectory_tool_sequence trajectory_step_count trajectory_tool_args_match

Per-Turn Conversation (3 metrics)

turn_relevancy turn_faithfulness turn_coherence

Multimodal (6 metrics)

image_relevance image_helpfulness image_coherence text_to_image image_editing image_reference

Structured Output (4 metrics)

code_correctness sql_equivalence comparative_quality datacompy_score

Role & Safety (3 metrics)

role_adherence role_violation non_advice

MCP & Tool-Specific (3 metrics)

mcp_use mcp_task_completion multi_turn_mcp_use

Specialized (3 metrics)

g_eval noise_sensitivity rubric

Deterministic Checks (39, zero API cost)

contains not_contains starts_with ends_with regex exact_match exact_match_strict min_tokens max_tokens min_words max_words min_chars max_chars min_sentences max_sentences is_json json_schema is_xml is_yaml is_html no_pii language readability similarity bleu rouge_l meteor gleu chrf latency_check cost_check string_distance perplexity is_valid_python is_url has_url word_count char_count sentence_count

Red teaming & adversarial testing

from checkllm.redteam import RedTeamer, VulnerabilityType
from checkllm.redteam_strategies import StrategyType

red = RedTeamer()
report = await red.scan(
    target=my_llm_function,
    vulnerability_types=[
        VulnerabilityType.PROMPT_INJECTION,
        VulnerabilityType.JAILBREAK,
        VulnerabilityType.PII_LEAKAGE,
        VulnerabilityType.DATA_EXFILTRATION,
    ],
    strategies=[StrategyType.BASE64, StrategyType.CRESCENDO, StrategyType.PERSONA],
    attacks_per_type=5,
)
print(report.summary())
print(report.risk_summary())  # CVSS severity breakdown

151 vulnerability types across 12 categories: prompt injection, jailbreak, PII leakage, harmful content, encoding attacks, privilege escalation, agentic AI attacks, brand & reputation, industry compliance, and more.

25 attack strategies: BASE64, ROT13, HEX, LEETSPEAK, MORSE, HOMOGLYPH, CRESCENDO (multi-turn escalation), JAILBREAK_TREE, JAILBREAK_META, JAILBREAK_COMPOSITE, BEST_OF_N, PERSONA, HYPOTHETICAL, ROLEPLAY, LAYER (composable chaining), and more.

Coding agent security

from checkllm.redteam_coding_agents import CodingAgentScanner

scanner = CodingAgentScanner(judge=judge)
report = await scanner.scan(target=my_coding_agent)
# Tests: repo prompt injection, sandbox escape, secret leakage, verifier sabotage

Compliance frameworks

from checkllm.compliance_frameworks import ComplianceScanner, ComplianceFramework

scanner = ComplianceScanner(judge=judge)
report = await scanner.scan(
    target=my_llm,
    frameworks=[
        ComplianceFramework.OWASP_LLM_TOP10,
        ComplianceFramework.OWASP_AGENTIC_TOP10,
        ComplianceFramework.EU_AI_ACT,
        ComplianceFramework.HIPAA,
    ],
)
print(report.summary())

17 frameworks: OWASP LLM Top 10, OWASP API Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, NIST AI RMF, NIST CSF, HIPAA, GDPR, PCI-DSS, SOC2, ISO 27001, COPPA, FERPA, CCPA, DoD AI Ethics.

Knowledge Graph test generation

from checkllm.knowledge_graph import KGTestGenerator, EntityExtractor, SimilarityBuilder

gen = KGTestGenerator(judge=judge)
samples = await gen.generate(
    documents=["doc1 text...", "doc2 text..."],
    num_samples=50,
    synthesizers={"single_hop": 0.4, "multi_hop_abstract": 0.3, "multi_hop_specific": 0.3},
    personas=5,
)
cases = gen.to_cases(samples)

Build a knowledge graph from your documents, then generate diverse test cases with single-hop, multi-hop abstract, and multi-hop specific queries. Supports persona variation, query styles (web search, misspelled, conversational), and configurable complexity.

Multilingual evaluation

from checkllm.multilingual import PromptAdapter, detect_language

adapter = PromptAdapter(judge=judge)
translated = await adapter.adapt(template=my_prompt, target_language="ja")
adapter.save_translations("translations/ja.json")

lang = detect_language("Esto es un texto en espanol.")  # "es"

Supports 20+ languages with automatic prompt adaptation. Language detection uses Unicode character-range analysis with LLM fallback.

Prompt optimization

from checkllm.optimize import create_optimizer

optimizer = create_optimizer("miprov2", judge=judge)  # or "genetic", "copro", "simba"
result = await optimizer.optimize(
    prompt="Summarize this document.",
    test_cases=my_test_cases,
    metric_fn=my_metric,
    num_candidates=10,
)
print(f"Improved from {result.initial_score:.2f} to {result.best_score:.2f}")

Four optimization algorithms: Genetic (evolutionary), MIPROv2 (instruction + demonstration), COPRO (failure-driven iterative), SIMBA (similarity-based adaptation).

Multi-provider judges

from checkllm import create_judge

judge = create_judge("openai", model="gpt-4o")
judge = create_judge("anthropic", model="claude-sonnet-4-6")
judge = create_judge("gemini", model="gemini-2.0-flash")
judge = create_judge("ollama", model="llama3.1")       # Free, local
judge = create_judge("litellm", model="any-model")     # 100+ models
judge = create_judge("deepseek")
judge = create_judge("groq")
judge = create_judge("fireworks")

Auto-detection: set OPENAI_API_KEY, ANTHROPIC_API_KEY, or have Ollama running -- checkllm picks the best judge automatically.

Consensus judging

from checkllm import ConsensusJudge

judges = [("gpt4", gpt4_judge), ("claude", claude_judge), ("gemini", gemini_judge)]
consensus = ConsensusJudge(judges, strategy="majority")  # or mean, median, unanimous, min, max, weighted

Cost control

checkllm estimate tests/              # See costs before running
checkllm run tests/ --budget 5.0      # Cap spend at $5
checkllm run tests/ --dry-run         # Estimate without executing

Configuration

# pyproject.toml
[tool.checkllm]
judge_backend = "auto"
judge_model = "gpt-4o"
default_threshold = 0.8
budget = 10.0
cache_enabled = true
engine = "auto"

CLI

Command	Description
`checkllm init`	Scaffold a project (`--use-case`, `--ci`)
`checkllm run`	Run tests (`--budget`, `--dry-run`, `--snapshot`)
`checkllm eval-yaml`	Run YAML-based evaluation
`checkllm estimate`	Estimate costs before running
`checkllm watch`	Re-run on file changes
`checkllm report`	Generate HTML report
`checkllm snapshot`	Save baseline for regression detection
`checkllm diff`	Compare snapshots
`checkllm history`	View run history and trends
`checkllm list-metrics`	Show all available checks and metrics
`checkllm cache`	Manage judge response cache
`checkllm dashboard`	Launch web dashboard

Framework integrations

# LangChain
from checkllm.integrations.langchain import CheckllmCallbackHandler
chain.invoke(input, config={"callbacks": [CheckllmCallbackHandler(checks=["no_pii"])]})

# CrewAI
from checkllm.integrations.crewai import CheckllmCrewCallback

# OpenAI Agents SDK
from checkllm.integrations.openai_agents import CheckllmRunHandler

# Claude Agent SDK
from checkllm.integrations.claude_agents import CheckllmAgentHandler

# PydanticAI
from checkllm.integrations.pydantic_ai import CheckllmResultValidator

# LlamaIndex
from checkllm.integrations.llama_index import CheckllmCallbackHandler

Custom metrics

from checkllm import metric, CheckResult

@metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
    words = len(output.split())
    return CheckResult(
        passed=words <= max_words,
        score=min(1.0, max_words / max(words, 1)),
        reasoning=f"{words} words (limit: {max_words})",
        cost=0.0, latency_ms=0, metric_name="brevity",
    )

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

5.1.0

Apr 23, 2026

5.0.1

Apr 18, 2026

5.0.0

Apr 10, 2026

3.2.0

Apr 6, 2026

0.1.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checkllm-5.1.0.tar.gz (1.0 MB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

checkllm-5.1.0-py3-none-any.whl (735.8 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file checkllm-5.1.0.tar.gz.

File metadata

Download URL: checkllm-5.1.0.tar.gz
Upload date: Apr 23, 2026
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for checkllm-5.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1aa1caefa5f81dba7b9f1c6eb011f06ce73d56e74dc1bba3f865dbc164cdc9c9`
MD5	`8a6b78186b26ed25c1db8de14dc5e77f`
BLAKE2b-256	`e61eea3e8d505dbe1f7d528d1fbc37440ca0f69bcf2f6ee970c99bf15b262eb6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-5.1.0.tar.gz:

Publisher: release.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: checkllm-5.1.0.tar.gz
- Subject digest: 1aa1caefa5f81dba7b9f1c6eb011f06ce73d56e74dc1bba3f865dbc164cdc9c9
- Sigstore transparency entry: 1364820612
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: javierdejesusda/checkllm@839ce34c3b1bc38b5226e95b7e900ed057872832
- Branch / Tag: refs/tags/v5.1.0
- Owner: https://github.com/javierdejesusda
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@839ce34c3b1bc38b5226e95b7e900ed057872832
- Trigger Event: push

File details

Details for the file checkllm-5.1.0-py3-none-any.whl.

File metadata

Download URL: checkllm-5.1.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 735.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for checkllm-5.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da75897dbbed692d781ae4571774154f2e0840e9c631fafcec0614ad2175ff92`
MD5	`a4335016bcc1373083d0775253e3e72d`
BLAKE2b-256	`fabf9e7a9da7feec381c427e6e45b875824bd4ba7bdfed6b704917903f48c30a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for checkllm-5.1.0-py3-none-any.whl:

Publisher: release.yml on javierdejesusda/checkllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: checkllm-5.1.0-py3-none-any.whl
- Subject digest: da75897dbbed692d781ae4571774154f2e0840e9c631fafcec0614ad2175ff92
- Sigstore transparency entry: 1364820617
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: javierdejesusda/checkllm@839ce34c3b1bc38b5226e95b7e900ed057872832
- Branch / Tag: refs/tags/v5.1.0
- Owner: https://github.com/javierdejesusda
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@839ce34c3b1bc38b5226e95b7e900ed057872832
- Trigger Event: push

checkllm 5.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

checkllm

Why checkllm?

Quickstart

Install

1. Deterministic checks (free, instant)

2. LLM-as-judge (deeper evaluation)

3. Fluent chaining

4. Production guardrails

5. YAML-based evaluation

How checkllm compares

Feature comparison

All metrics by category

RAG Evaluation (14 metrics)

General Quality (12 metrics)

Completeness & Instruction Following (5 metrics)

Agent & Tool Evaluation (12 metrics)

Per-Turn Conversation (3 metrics)

Multimodal (6 metrics)

Structured Output (4 metrics)

Role & Safety (3 metrics)

MCP & Tool-Specific (3 metrics)

Specialized (3 metrics)

Deterministic Checks (39, zero API cost)

Red teaming & adversarial testing

Coding agent security

Compliance frameworks

Knowledge Graph test generation

Multilingual evaluation

Prompt optimization

Multi-provider judges

Consensus judging

Cost control

Configuration

CLI

Framework integrations

Custom metrics

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance