Skip to main content

A library of checklist generation and scoring methods for LLM evaluation

Project description

AutoChecklist

GitHub Stars Python 3.10+ License

A library of composable pipelines for generating and scoring checklist criteria.

A checklist is a list of yes/no questions used. autochecklist provides 5 generator abstractions, each representing a different reasoning approach to producing evaluation criteria, along with a configurable ChecklistScorer that consolidates three scoring strategies from literature. You can mix, extend, and customize all components.

Terminology

  • input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").
  • target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).
  • reference: An optional gold-standard response used by some methods to improve checklist generation.

Generator Abstractions

The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:

Level Generator Approach Analogy
Instance DirectGenerator Prompt → checklist Direct inference
Instance ContrastiveGenerator Candidates → checklist Counterfactual reasoning
Corpus InductiveGenerator Observations → criteria Inductive reasoning (bottom-up)
Corpus DeductiveGenerator Dimensions → criteria Deductive reasoning (top-down)
Corpus InteractiveGenerator Eval sessions → criteria Protocol analysis

Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.

Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.

Built-in Pipelines

The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.

Scoring

A single configurable ChecklistScorer class supports all scoring modes:

Config Description
mode="batch" All items in one LLM call (efficient)
mode="batch", capture_reasoning=True Batch with per-item explanations
mode="item" One item per call
mode="item", capture_reasoning=True One item per call with reasoning
mode="item", primary_metric="weighted" Item weights (0-100) for importance
mode="item", use_logprobs=True Logprob confidence calibration

Refiners

Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:

  • Deduplicator — merges semantically similar items via embeddings
  • Tagger — filters by applicability and specificity
  • UnitTester — validates that items are enforceable
  • Selector — picks a diverse subset via beam search

Installation

pip install autochecklist

Optional extras

# ML dependencies for corpus-level refiners (embeddings, deduplication)
pip install "autochecklist[ml]"

# vLLM for offline GPU inference (no server needed)
pip install "autochecklist[vllm]"

# Everything
pip install "autochecklist[all]"

For development installation from source, see the GitHub repository.

Using the Package

Custom Prompts

Write a prompt template and generate a checklist:

from autochecklist import DirectGenerator, ChecklistScorer

gen = DirectGenerator(
    custom_prompt="You are an expert evaluator. Generate yes/no checklist questions to score:\n\n{input}",
    model="openai/gpt-5-mini",
)
checklist = gen.generate(input="Write a haiku about autumn.")

scorer = ChecklistScorer(mode="batch", model="openai/gpt-5-mini")
score = scorer.score(checklist, target="Leaves fall gently down...")
print(f"Pass rate: {score.pass_rate:.0%}")

Scorers also take custom prompts. Prompts can also be loaded from .md files — see Custom Prompts for the full guide (placeholders, custom scorers, registration).

Custom Pipelines

Register a custom pipeline (generator + scorer + prompts) as a reusable unit:

from autochecklist import register_custom_pipeline, pipeline

# Register from config
register_custom_pipeline(
    "my_eval",
    generator_prompt="Generate yes/no questions for:\n\n{input}",
    scorer="weighted",
)
pipe = pipeline("my_eval", generator_model="openai/gpt-5-mini")

# Or register from an existing pipeline instance
register_custom_pipeline("my_eval_v2", pipe)

# Save/load pipeline configs as JSON
from autochecklist import save_pipeline_config, load_pipeline_config
save_pipeline_config("my_eval", "my_eval.json")
load_pipeline_config("my_eval.json")  # registers and returns the name

Built-in Pipelines

The library includes pipelines implementing methods from research papers. Use them via method_name or the pipeline() shorthand:

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn", target="Leaves fall gently...")
print(f"Pass rate: {result.pass_rate:.0%}")

See Supported Pipelines for the full list of pipelines, paper details, and configuration options.

Batch Evaluation

data = [
    {"input": "Write a haiku", "target": "Leaves fall..."},
    {"input": "Write a limerick", "target": "There once was..."},
]
result = pipe.run_batch(data, show_progress=True)
print(f"Macro pass rate: {result.macro_pass_rate:.0%}")

For pipeline composition, provider configuration, and the full API, see the Pipeline Guide.

Command-Line Interface

Run evaluations directly from the terminal:

# Full evaluation (generate + score)
autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

# Generate checklists only
autochecklist generate --pipeline tick --data inputs.jsonl -o checklists.jsonl \
  --generator-model openai/gpt-4o-mini

# Score with existing checklist
autochecklist score --data eval_data.jsonl --checklist checklist.json \
  -o results.jsonl --scorer-model openai/gpt-4o-mini

# List available pipelines
autochecklist list

API keys can be set via --api-key, environment variables (OPENROUTER_API_KEY), or a .env file. See the CLI Guide for full details.

Examples

Detailed examples with runnable code:

Links

Citation

TBA

License

Apache-2.0 (see LICENSE)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autochecklist-0.2.0.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autochecklist-0.2.0-py3-none-any.whl (123.3 kB view details)

Uploaded Python 3

File details

Details for the file autochecklist-0.2.0.tar.gz.

File metadata

  • Download URL: autochecklist-0.2.0.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autochecklist-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8414b71186da61af3a9400c7e00700e2a4f2703789dac58faaaefd873c67d8dc
MD5 58bf34e911993e2640094eec638b0e3c
BLAKE2b-256 3fdb396d50cb2b1b9712f37aa691c21a4783a96121bcdafc6e4a441d5d9542dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for autochecklist-0.2.0.tar.gz:

Publisher: publish.yml on ChicagoHAI/AutoChecklist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autochecklist-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: autochecklist-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 123.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autochecklist-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 117b0f2041d8324eb5ac56a5e58fc99b3887435660c3646158889700bcccb1c6
MD5 e8a7047dd4f9534c8a9446e11a1ee915
BLAKE2b-256 47e4690b87ee8a9731244f5cbf34e0b86642e3b612801bd964115567dae19aca

See more details on using hashes here.

Provenance

The following attestation bundles were made for autochecklist-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ChicagoHAI/AutoChecklist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page