Skip to main content

Opt-in lint for Inspect AI tasks: warn when a verifiable task uses a model-graded scorer where a deterministic alternative is available.

Project description

inspect-build-time-contract

PyPI version Python License: MIT

Opt-in lint for Inspect AI tasks: warn when a task you've declared verifiable uses a model-graded scorer where a deterministic alternative is available.

The Inspect AI scorer documentation recommends "deterministic where possible, LLM where necessary." This package makes that recommendation mechanically checkable for tasks that opt in.

Install

pip install inspect-build-time-contract

Usage

from inspect_ai import Task, task
from inspect_ai.scorer import match, model_graded_qa
from inspect_build_time_contract import verifiable_task

# Deterministic scorer on a verifiable task: silent.
@verifiable_task
def my_factoid_eval():
    return Task(dataset=..., scorer=match())

# Model-graded scorer on a verifiable task: WARNING at task load.
@verifiable_task
def my_judged_eval():
    return Task(dataset=..., scorer=model_graded_qa())
# WARNING:inspect_build_time_contract:Task 'my_judged_eval' is decorated with
#         @verifiable_task but its scorer is classified as 'model_graded'.
#         Consider a deterministic alternative ... or use Inspect's @task directly.

# Task with no claim about verifiability: use Inspect's @task as normal.
@task
def my_genuinely_subjective_eval():
    return Task(dataset=..., scorer=model_graded_qa())

CI mode

Set INSPECT_BUILD_TIME_CONTRACT_STRICT=1 to escalate warnings to a RuntimeError:

INSPECT_BUILD_TIME_CONTRACT_STRICT=1 inspect eval my_eval.py
# Warnings now raise; CI fails on contract violations.

Scorer taxonomy

Class Inspect built-ins
deterministic match, includes, pattern, exact, f1, answer, choice, math
model_graded model_graded_qa, model_graded_fact
unknown Any custom or third-party scorer the package doesn't recognize

Custom scorers are classified as "unknown" and fire the warning. To suppress, either use Inspect's @task directly (you've opted out of the verifiable contract) or fork the package and add your scorer to DETERMINISTIC_BUILTINS / MODEL_GRADED_BUILTINS.

What this is not

  • It does not force any task to use a deterministic scorer.
  • It does not override any existing Inspect API. @task continues to work exactly as before.
  • It does not run at eval time. It's a pre-flight check at task load.

Why this exists

I built Jig around the idea that an LLM-eval framework should make "declare your deterministic check at build time" a first-class concept. A pre-registered N=50 study on BIRD-SQL (results) found a Sonnet 4.6 LLM-as-judge had a 40% false-approval rate against the deterministic execution-based scorer; a Haiku 4.5 judge had 10% false-approval rate. Even when the deterministic check is sitting right there, choosing model-graded is a measurable accuracy cost.

This extension is a small experiment in surfacing that choice at task-definition time inside Inspect AI specifically. There's an upstream issue proposing the taxonomy + lint as in-core features at UKGovernmentBEIS/inspect_ai. If that lands, this package will be deprecated in favor of in-core support.

Compatibility

Tested against inspect-ai==0.3.212. Should work with any 0.3.200+ version. Requires Python 3.10+.

Development

git clone https://github.com/smledbetter/inspect-build-time-contract
cd inspect-build-time-contract
uv venv --python 3.11 .venv
uv pip install --python .venv/bin/python -e ".[dev]"
.venv/bin/python -m pytest tests/

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_build_time_contract-0.1.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inspect_build_time_contract-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file inspect_build_time_contract-0.1.0.tar.gz.

File metadata

File hashes

Hashes for inspect_build_time_contract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9711b25b1844fa503249c13b0ff01700ef352c4f25844d29f96e4d8912901e8
MD5 61e250aff3b8e22f4eb29d6363ef28c3
BLAKE2b-256 96137c86d79b223af614ccb0986fe0d1b4ad1bd6f985672f95a66bc43e8332c8

See more details on using hashes here.

File details

Details for the file inspect_build_time_contract-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for inspect_build_time_contract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca5d0de591d2159100df53744a68e786761687ca60fa1842d28bec2a6d897288
MD5 7f39d7443dec6310b087b4f7fc96c59f
BLAKE2b-256 8696e4249d600fd81f0431fae7331ca20abf13129e3ef6e1fd80495e1e9d34e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page