Skip to main content

A library of checklist generation and scoring methods for LLM evaluation

Project description

AutoChecklist

GitHub Stars Python 3.10+ License Site

A library of composable pipelines for generating and scoring checklist criteria.

A checklist is a list of yes/no questions used. autochecklist provides 5 generator abstractions, each representing a different reasoning approach to producing evaluation criteria, along with a configurable ChecklistScorer that consolidates three scoring strategies from literature. You can mix, extend, and customize all components.

Terminology

  • input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").
  • target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).
  • reference: An optional gold-standard response used by some methods to improve checklist generation.

Generator Abstractions

The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:

Level Generator Approach Analogy
Instance DirectGenerator Prompt → checklist Direct inference
Instance ContrastiveGenerator Candidates → checklist Counterfactual reasoning
Corpus InductiveGenerator Observations → criteria Inductive reasoning (bottom-up)
Corpus DeductiveGenerator Dimensions → criteria Deductive reasoning (top-down)
Corpus InteractiveGenerator Eval sessions → criteria Protocol analysis

Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.

Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.

Built-in Pipelines

The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, OpenRubrics, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.

Scoring

A single configurable ChecklistScorer class supports all scoring modes:

Config Description
mode="batch" All items in one LLM call (efficient)
mode="batch", capture_reasoning=True Batch with per-item explanations
mode="item" One item per call
mode="item", capture_reasoning=True One item per call with reasoning
mode="item", primary_metric="weighted" Item weights (0-100) for importance
mode="item", use_logprobs=True Logprob confidence calibration

Refiners

Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:

  • Deduplicator — merges semantically similar items via embeddings
  • Tagger — filters by applicability and specificity
  • UnitTester — validates that items are enforceable
  • Selector — picks a diverse subset via beam search

Installation

pip install autochecklist

Optional extras

# ML dependencies for corpus-level refiners (embeddings, deduplication)
pip install "autochecklist[ml]"

# vLLM for offline GPU inference (no server needed)
pip install "autochecklist[vllm]"

# Everything
pip install "autochecklist[all]"

For development installation from source, see the GitHub repository.

Quick Start

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn.", target="Leaves fall gently down...")
print(f"Pass rate: {result.pass_rate:.0%}")

See the Quick Start guide for custom prompts, batch evaluation, and more.

CLI

autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

See the CLI guide for all commands.

Links

Citation

TBA

License

Apache-2.0 (see LICENSE)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autochecklist-0.2.1.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autochecklist-0.2.1-py3-none-any.whl (122.7 kB view details)

Uploaded Python 3

File details

Details for the file autochecklist-0.2.1.tar.gz.

File metadata

  • Download URL: autochecklist-0.2.1.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autochecklist-0.2.1.tar.gz
Algorithm Hash digest
SHA256 adee2c635ad16d48a7090d66dfc329dcd7b88e505de04bfd24ebc24743f75679
MD5 cb324dc38d903154023e3ebd12b5f2af
BLAKE2b-256 ebc0bfa1fd52df7a211cd8f52ef8191d2b76a69b2ae82f7614cc5556dfce9d70

See more details on using hashes here.

Provenance

The following attestation bundles were made for autochecklist-0.2.1.tar.gz:

Publisher: publish.yml on ChicagoHAI/AutoChecklist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autochecklist-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: autochecklist-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 122.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autochecklist-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0d8fb766138393c2444d3fd0a0e21b59b4782d32b8a9fb202896401c083d9fda
MD5 7769c328fcae122913a659a58e867ad8
BLAKE2b-256 bcaa9016b19dbaeb4379da23d329826571f1700299cb35f7117497f6f1a93a15

See more details on using hashes here.

Provenance

The following attestation bundles were made for autochecklist-0.2.1-py3-none-any.whl:

Publisher: publish.yml on ChicagoHAI/AutoChecklist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page