Skip to main content

A library of checklist generation and scoring methods for LLM evaluation

Project description

AutoChecklist

GitHub Stars Python 3.10+ License Site

A library of composable pipelines for generating and scoring checklist criteria.

A checklist is a list of yes/no questions used. autochecklist provides 5 generator abstractions, each representing a different reasoning approach to producing evaluation criteria, along with a configurable ChecklistScorer that consolidates three scoring strategies from literature. You can mix, extend, and customize all components.

Terminology

  • input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").
  • target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).
  • reference: An optional gold-standard response used by some methods to improve checklist generation.

Generator Abstractions

The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:

Level Generator Approach Analogy
Instance DirectGenerator Prompt → checklist Direct inference
Instance ContrastiveGenerator Candidates → checklist Counterfactual reasoning
Corpus InductiveGenerator Observations → criteria Inductive reasoning (bottom-up)
Corpus DeductiveGenerator Dimensions → criteria Deductive reasoning (top-down)
Corpus InteractiveGenerator Eval sessions → criteria Protocol analysis

Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.

Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.

Built-in Pipelines

The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, OpenRubrics, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.

Scoring

A single configurable ChecklistScorer class supports all scoring modes:

Config Description
mode="batch" All items in one LLM call (efficient)
mode="batch", capture_reasoning=True Batch with per-item explanations
mode="item" One item per call
mode="item", capture_reasoning=True One item per call with reasoning
mode="item", primary_metric="weighted" Item weights (0-100) for importance
mode="item", use_logprobs=True Logprob confidence calibration

Refiners

Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:

  • Deduplicator — merges semantically similar items via embeddings
  • Tagger — filters by applicability and specificity
  • UnitTester — validates that items are enforceable
  • Selector — picks a diverse subset via beam search

Installation

pip install autochecklist

Optional extras

# ML dependencies for corpus-level refiners (embeddings, deduplication)
pip install "autochecklist[ml]"

# vLLM for offline GPU inference (no server needed)
pip install "autochecklist[vllm]"

# Everything
pip install "autochecklist[all]"

For development installation from source, see the GitHub repository.

Quick Start

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn.", target="Leaves fall gently down...")
print(f"Pass rate: {result.pass_rate:.0%}")

See the Quick Start guide for custom prompts, batch evaluation, and more.

CLI

autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

See the CLI guide for all commands.

Links

Citation

TBA

License

Apache-2.0 (see LICENSE)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autochecklist-0.2.2.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autochecklist-0.2.2-py3-none-any.whl (123.3 kB view details)

Uploaded Python 3

File details

Details for the file autochecklist-0.2.2.tar.gz.

File metadata

  • Download URL: autochecklist-0.2.2.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autochecklist-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ac1626c5a4789868302101a323289002bc3a273038a1d510eb6907b29e94adc4
MD5 9c685668bef6e5caf44138e184a7f073
BLAKE2b-256 0df0366997750f832dc54697abaf49cf7277902b362046b66367792998847d68

See more details on using hashes here.

Provenance

The following attestation bundles were made for autochecklist-0.2.2.tar.gz:

Publisher: publish.yml on ChicagoHAI/AutoChecklist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autochecklist-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: autochecklist-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 123.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autochecklist-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fdad4419c2db3b0e64595e4fb26b1d3f317569070db28f6f0873f347e5851234
MD5 57c995cb9c596f3fe76cbfaabf0127aa
BLAKE2b-256 a4f99e964bd47b6030ea1546ea5a9517358ccf706bdf0f7d09bd8434cdb3f72a

See more details on using hashes here.

Provenance

The following attestation bundles were made for autochecklist-0.2.2-py3-none-any.whl:

Publisher: publish.yml on ChicagoHAI/AutoChecklist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page