The most comprehensive LLM testing and evaluation framework for Python.
Project description
checkllm
The most comprehensive LLM evaluation framework. The pytest of LLM testing.
pip install checkllm
def test_my_llm(check):
output = my_llm("What is Python?")
check.contains(output, "programming language")
check.no_pii(output)
check.hallucination(output, context="Python is a programming language created by Guido van Rossum.")
That's it. No setup, no boilerplate. The check fixture works in any pytest test.
Why checkllm?
- Zero learning curve -- if you know pytest, you know checkllm. Just add a
checkparameter. - 39 free deterministic checks run instantly with zero API calls. No API key needed to start.
- 72 LLM-as-judge metrics -- hallucination, faithfulness, trajectory, per-turn, dual-judge, and more.
- 151 red team vulnerability types with 25 attack strategies -- the most comprehensive adversarial testing suite available.
- 17 compliance frameworks -- OWASP LLM/API/Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, HIPAA, GDPR, and more.
- Same checks everywhere -- use them in tests, CI, and production guardrails.
Quickstart
Install
pip install checkllm
checkllm init --use-case rag # generates a tailored test file
1. Deterministic checks (free, instant)
def test_basic_quality(check):
output = my_llm("Summarize this article.")
check.contains(output, "key finding")
check.max_tokens(output, limit=200)
check.no_pii(output)
check.is_json(output)
check.gleu(output, reference="Expected summary text.", threshold=0.5)
check.chrf(output, reference="Expected summary text.", threshold=0.4)
check.latency_check(start_time, end_time, max_ms=3000)
check.cost_check(input_tokens=500, output_tokens=200, model="gpt-4o", max_cost=0.05)
2. LLM-as-judge (deeper evaluation)
def test_rag_quality(check):
output = my_rag("What causes climate change?")
context = retrieve_context("climate change")
check.hallucination(output, context=context)
check.faithfulness(output, context=context)
check.relevance(output, query="What causes climate change?")
check.toxicity(output)
3. Fluent chaining
def test_with_chaining(check):
output = my_llm("Explain quantum physics simply.")
check.that(output) \
.contains("quantum") \
.max_tokens(200) \
.has_no_pii() \
.scores_above("relevance", 0.8, query="quantum physics")
4. Production guardrails
from checkllm import Guard, CheckSpec
guard = Guard(checks=[
CheckSpec(check_type="no_pii"),
CheckSpec(check_type="max_tokens", params={"limit": 500}),
CheckSpec(check_type="toxicity"),
])
result = guard.validate(llm_output)
if not result.valid:
result.raise_on_failure()
5. YAML-based evaluation
# checkllm.yaml
description: "Customer support chatbot evaluation"
judge:
backend: openai
model: gpt-4o
prompts:
- "You are a helpful support agent. Answer: {{query}}"
tests:
- vars:
query: "How do I return an item?"
assert:
- type: contains
value: "return policy"
- type: relevance
threshold: 0.8
- type: no_pii
- type: max_tokens
value: 500
settings:
budget: 5.0
checkllm eval-yaml checkllm.yaml
How checkllm compares
Independent benchmark, not just feature counts. On the public competitor leaderboard (docs/benchmarks/competitor-comparison.md) checkllm holds rank 1 on every published row against DeepEval and promptfoo: halubench/hallucination 0.783, ragtruth/hallucination 0.663, ragtruth/faithfulness 0.754, ragtruth/context_relevance 0.565, and truthfulqa/answer_relevancy 0.546 (ROC-AUC, gpt-4o-mini judge, 200 source rows per slice). Methodology is in docs/benchmarks/methodology.md; raw scores ship in
benchmarks/competitor_comparison/.
Feature comparison
| Feature | checkllm | DeepEval | Ragas | promptfoo |
|---|---|---|---|---|
| pytest native | Yes | Wrapper | No | No |
| Free deterministic checks | 39 | Limited | Limited | Yes |
| LLM-as-judge metrics | 72 | ~50 | ~40 | Custom |
| Red team vulnerability types | 151 | 40+ | 0 | 100+ |
| Attack strategies | 25 | 10+ | 0 | 30+ |
| Compliance frameworks | 17 | 3 | 0 | 10+ |
| Multi-provider judges | 15+ backends | 13+ | ~6 | 50+ |
| Consensus judging | 7 strategies | No | Dual-judge | No |
| Production guardrails | Built-in | No | No | API |
| Cost control & budgets | Built-in | No | No | Caching |
| Knowledge Graph synthesis | Full pipeline | No | Yes | No |
| Multilingual prompts | 20 languages | No | Yes | No |
| Prompt optimization | 4 algorithms | 4 | 2 | No |
| YAML config evaluation | Yes | No | No | Yes |
| Streaming evaluation | Token-by-token | No | No | No |
| Regression detection | Statistical (p-values) | No | No | No |
| DPO export | Yes | No | No | No |
| Telemetry / phoning home | None | PostHog + Sentry | None | Telemetry |
| Independence | Fully independent | YC-backed | YC-backed | OpenAI-owned |
All metrics by category
RAG Evaluation (14 metrics)
hallucination faithfulness faithfulness_hhem context_relevance context_entity_recall contextual_precision contextual_recall answer_completeness groundedness nonllm_context_precision nonllm_context_recall quoted_spans_alignment nv_context_relevance nv_response_groundedness
General Quality (12 metrics)
relevance coherence fluency consistency correctness factual_correctness sentiment toxicity bias summarization nv_answer_accuracy prompt_alignment
Completeness & Instruction Following (5 metrics)
response_completeness instruction_following instruction_completeness conversation_completeness topic_adherence
Agent & Tool Evaluation (12 metrics)
task_completion tool_accuracy tool_call_f1 plan_adherence plan_quality step_efficiency knowledge_retention goal_accuracy trajectory_goal_success trajectory_tool_sequence trajectory_step_count trajectory_tool_args_match
Per-Turn Conversation (3 metrics)
turn_relevancy turn_faithfulness turn_coherence
Multimodal (6 metrics)
image_relevance image_helpfulness image_coherence text_to_image image_editing image_reference
Structured Output (4 metrics)
code_correctness sql_equivalence comparative_quality datacompy_score
Role & Safety (3 metrics)
role_adherence role_violation non_advice
MCP & Tool-Specific (3 metrics)
mcp_use mcp_task_completion multi_turn_mcp_use
Specialized (3 metrics)
g_eval noise_sensitivity rubric
Deterministic Checks (39, zero API cost)
contains not_contains starts_with ends_with regex exact_match exact_match_strict min_tokens max_tokens min_words max_words min_chars max_chars min_sentences max_sentences is_json json_schema is_xml is_yaml is_html no_pii language readability similarity bleu rouge_l meteor gleu chrf latency_check cost_check string_distance perplexity is_valid_python is_url has_url word_count char_count sentence_count
Red teaming & adversarial testing
from checkllm.redteam import RedTeamer, VulnerabilityType
from checkllm.redteam_strategies import StrategyType
red = RedTeamer()
report = await red.scan(
target=my_llm_function,
vulnerability_types=[
VulnerabilityType.PROMPT_INJECTION,
VulnerabilityType.JAILBREAK,
VulnerabilityType.PII_LEAKAGE,
VulnerabilityType.DATA_EXFILTRATION,
],
strategies=[StrategyType.BASE64, StrategyType.CRESCENDO, StrategyType.PERSONA],
attacks_per_type=5,
)
print(report.summary())
print(report.risk_summary()) # CVSS severity breakdown
151 vulnerability types across 12 categories: prompt injection, jailbreak, PII leakage, harmful content, encoding attacks, privilege escalation, agentic AI attacks, brand & reputation, industry compliance, and more.
25 attack strategies: BASE64, ROT13, HEX, LEETSPEAK, MORSE, HOMOGLYPH, CRESCENDO (multi-turn escalation), JAILBREAK_TREE, JAILBREAK_META, JAILBREAK_COMPOSITE, BEST_OF_N, PERSONA, HYPOTHETICAL, ROLEPLAY, LAYER (composable chaining), and more.
Coding agent security
from checkllm.redteam_coding_agents import CodingAgentScanner
scanner = CodingAgentScanner(judge=judge)
report = await scanner.scan(target=my_coding_agent)
# Tests: repo prompt injection, sandbox escape, secret leakage, verifier sabotage
Compliance frameworks
from checkllm.compliance_frameworks import ComplianceScanner, ComplianceFramework
scanner = ComplianceScanner(judge=judge)
report = await scanner.scan(
target=my_llm,
frameworks=[
ComplianceFramework.OWASP_LLM_TOP10,
ComplianceFramework.OWASP_AGENTIC_TOP10,
ComplianceFramework.EU_AI_ACT,
ComplianceFramework.HIPAA,
],
)
print(report.summary())
17 frameworks: OWASP LLM Top 10, OWASP API Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, NIST AI RMF, NIST CSF, HIPAA, GDPR, PCI-DSS, SOC2, ISO 27001, COPPA, FERPA, CCPA, DoD AI Ethics.
Knowledge Graph test generation
from checkllm.knowledge_graph import KGTestGenerator, EntityExtractor, SimilarityBuilder
gen = KGTestGenerator(judge=judge)
samples = await gen.generate(
documents=["doc1 text...", "doc2 text..."],
num_samples=50,
synthesizers={"single_hop": 0.4, "multi_hop_abstract": 0.3, "multi_hop_specific": 0.3},
personas=5,
)
cases = gen.to_cases(samples)
Build a knowledge graph from your documents, then generate diverse test cases with single-hop, multi-hop abstract, and multi-hop specific queries. Supports persona variation, query styles (web search, misspelled, conversational), and configurable complexity.
Multilingual evaluation
from checkllm.multilingual import PromptAdapter, detect_language
adapter = PromptAdapter(judge=judge)
translated = await adapter.adapt(template=my_prompt, target_language="ja")
adapter.save_translations("translations/ja.json")
lang = detect_language("Esto es un texto en espanol.") # "es"
Supports 20+ languages with automatic prompt adaptation. Language detection uses Unicode character-range analysis with LLM fallback.
Prompt optimization
from checkllm.optimize import create_optimizer
optimizer = create_optimizer("miprov2", judge=judge) # or "genetic", "copro", "simba"
result = await optimizer.optimize(
prompt="Summarize this document.",
test_cases=my_test_cases,
metric_fn=my_metric,
num_candidates=10,
)
print(f"Improved from {result.initial_score:.2f} to {result.best_score:.2f}")
Four optimization algorithms: Genetic (evolutionary), MIPROv2 (instruction + demonstration), COPRO (failure-driven iterative), SIMBA (similarity-based adaptation).
Multi-provider judges
from checkllm import create_judge
judge = create_judge("openai", model="gpt-4o")
judge = create_judge("anthropic", model="claude-sonnet-4-6")
judge = create_judge("gemini", model="gemini-2.0-flash")
judge = create_judge("ollama", model="llama3.1") # Free, local
judge = create_judge("litellm", model="any-model") # 100+ models
judge = create_judge("deepseek")
judge = create_judge("groq")
judge = create_judge("fireworks")
Auto-detection: set OPENAI_API_KEY, ANTHROPIC_API_KEY, or have Ollama running -- checkllm picks the best judge automatically.
Consensus judging
from checkllm import ConsensusJudge
judges = [("gpt4", gpt4_judge), ("claude", claude_judge), ("gemini", gemini_judge)]
consensus = ConsensusJudge(judges, strategy="majority") # or mean, median, unanimous, min, max, weighted
Cost control
checkllm estimate tests/ # See costs before running
checkllm run tests/ --budget 5.0 # Cap spend at $5
checkllm run tests/ --dry-run # Estimate without executing
Configuration
# pyproject.toml
[tool.checkllm]
judge_backend = "auto"
judge_model = "gpt-4o"
default_threshold = 0.8
budget = 10.0
cache_enabled = true
engine = "auto"
CLI
| Command | Description |
|---|---|
checkllm init |
Scaffold a project (--use-case, --ci) |
checkllm run |
Run tests (--budget, --dry-run, --snapshot) |
checkllm eval-yaml |
Run YAML-based evaluation |
checkllm estimate |
Estimate costs before running |
checkllm watch |
Re-run on file changes |
checkllm report |
Generate HTML report |
checkllm snapshot |
Save baseline for regression detection |
checkllm diff |
Compare snapshots |
checkllm history |
View run history and trends |
checkllm list-metrics |
Show all available checks and metrics |
checkllm cache |
Manage judge response cache |
checkllm dashboard |
Launch web dashboard |
Framework integrations
# LangChain
from checkllm.integrations.langchain import CheckllmCallbackHandler
chain.invoke(input, config={"callbacks": [CheckllmCallbackHandler(checks=["no_pii"])]})
# CrewAI
from checkllm.integrations.crewai import CheckllmCrewCallback
# OpenAI Agents SDK
from checkllm.integrations.openai_agents import CheckllmRunHandler
# Claude Agent SDK
from checkllm.integrations.claude_agents import CheckllmAgentHandler
# PydanticAI
from checkllm.integrations.pydantic_ai import CheckllmResultValidator
# LlamaIndex
from checkllm.integrations.llama_index import CheckllmCallbackHandler
Custom metrics
from checkllm import metric, CheckResult
@metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
words = len(output.split())
return CheckResult(
passed=words <= max_words,
score=min(1.0, max_words / max(words, 1)),
reasoning=f"{words} words (limit: {max_words})",
cost=0.0, latency_ms=0, metric_name="brevity",
)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file checkllm-5.0.1.tar.gz.
File metadata
- Download URL: checkllm-5.0.1.tar.gz
- Upload date:
- Size: 645.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
551a3a312a76e410195b44a75d3ad5e6b150ce340823340c3099f84b6338edce
|
|
| MD5 |
a475015d0d084ea26597a9ec2a2c65f5
|
|
| BLAKE2b-256 |
1f575b0c775bfa26de0f6a5f8da5b0aa174c43ab50a35cc7c6f3da471d555439
|
Provenance
The following attestation bundles were made for checkllm-5.0.1.tar.gz:
Publisher:
publish.yml on javierdejesusda/checkllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
checkllm-5.0.1.tar.gz -
Subject digest:
551a3a312a76e410195b44a75d3ad5e6b150ce340823340c3099f84b6338edce - Sigstore transparency entry: 1338663663
- Sigstore integration time:
-
Permalink:
javierdejesusda/checkllm@f4e95c596d427952fd36b136b527a3de7d35ac98 -
Branch / Tag:
refs/tags/v5.0.1 - Owner: https://github.com/javierdejesusda
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f4e95c596d427952fd36b136b527a3de7d35ac98 -
Trigger Event:
release
-
Statement type:
File details
Details for the file checkllm-5.0.1-py3-none-any.whl.
File metadata
- Download URL: checkllm-5.0.1-py3-none-any.whl
- Upload date:
- Size: 524.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
260ec920ca67b855ad22ceb4744cd3e619ea34ada9ebff9a14ffea6fa745dff9
|
|
| MD5 |
aa02ee6080d5850820e78cfb267e98c5
|
|
| BLAKE2b-256 |
b6f38b8272ce2b31fa9caef550101a2b97fb7760b31c7d4448d0044ad5851444
|
Provenance
The following attestation bundles were made for checkllm-5.0.1-py3-none-any.whl:
Publisher:
publish.yml on javierdejesusda/checkllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
checkllm-5.0.1-py3-none-any.whl -
Subject digest:
260ec920ca67b855ad22ceb4744cd3e619ea34ada9ebff9a14ffea6fa745dff9 - Sigstore transparency entry: 1338663690
- Sigstore integration time:
-
Permalink:
javierdejesusda/checkllm@f4e95c596d427952fd36b136b527a3de7d35ac98 -
Branch / Tag:
refs/tags/v5.0.1 - Owner: https://github.com/javierdejesusda
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f4e95c596d427952fd36b136b527a3de7d35ac98 -
Trigger Event:
release
-
Statement type: