Skip to main content

Public-slice harness for the CONJURE transformative-creativity benchmark.

Project description

conjure-eval

Public-slice harness for the CONJURE transformative-creativity benchmark. Ships the 393-instance public split (70 percent of the 560-instance Phase 4.8 frozen corpus: 510 closed-problem instances across 17 Lakatos families plus the 50-instance C4-OPEN axis of formalised open mathematical conjectures, SHA-256 a8c9842ea4d59072802689603b1e38c679fd1695194aa1cf73f81c076903daf6) so frontier-model developers can self-evaluate locally before submitting to the hidden split.

This package contains:

  • The frozen public-slice corpus JSON (conjure_eval.data.public_corpus).
  • A CLI for inspecting the corpus, driving a model pass, and checking submission files before they are sent to the hidden-split adjudicator.
  • The deterministic split provenance, so any third party can re-derive the public/hidden split byte-for-byte from the source corpus.

What this package is and isn't

conjure-eval is a self-service developer convenience: it lets a model team inspect the public contracts, run their model against the public slice, and smoke-test their submission format before sending results to the benchmark author. It does not ship the hidden split, and it does not run the kernel-verified tight-mode adjudicator that produces the headline accept rate. Those live in the private blanc repository and are operated by the benchmark author against frozen model snapshots; the headline number reported in the brief is the hidden-split rate.

Install

pip install conjure-eval

Usage

# List all 393 public-slice instance IDs
conjure-eval list-public

# Inspect a single instance
conjure-eval show C1-bv-001

# Drive a model pass (OpenAI-compatible endpoint)
conjure-eval run \
    --base-url https://your-endpoint/v1 \
    --api-key-env MY_API_KEY \
    --model your-model-name \
    --out submissions.jsonl

# Check submission file well-formedness before sending
conjure-eval verify-submission submissions.jsonl

# Print corpus provenance fields
conjure-eval provenance

Provenance

The public corpus is a deterministic 70/30 axis-stratified slice of the 560-instance Phase 4.8 frozen corpus maintained in the private blanc repository (510 closed-problem instances + 50 C4-OPEN instances). Seed: 4317. Anyone with the source corpus can reproduce both slices via scripts/build_conjure_split.py. Every C4-OPEN instance carries a snapshot-pinned open-status certificate generated by scripts/certify_open_status.py; certificate JSONs live under instances/open_status_certificates/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conjure_eval-0.3.1.tar.gz (58.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

conjure_eval-0.3.1-py3-none-any.whl (58.3 kB view details)

Uploaded Python 3

File details

Details for the file conjure_eval-0.3.1.tar.gz.

File metadata

  • Download URL: conjure_eval-0.3.1.tar.gz
  • Upload date:
  • Size: 58.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for conjure_eval-0.3.1.tar.gz
Algorithm Hash digest
SHA256 6bf9cdbbb6550067abb777c340d371f4f73128310513670835ef8203e43b1090
MD5 608e3fc8d1fd0081c51b67a162eaf6aa
BLAKE2b-256 7db8b3623f93bf235ed1c08dd7cbcf42562fbc8fdc714517d719cdf6a8263f0b

See more details on using hashes here.

File details

Details for the file conjure_eval-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: conjure_eval-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 58.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for conjure_eval-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2eb895b3e4023516db083367fb2cb996ae4d6d29ecbfee31e4baab657faa4e76
MD5 79e369f99d1f6f390ebc1c0fff1bf1d1
BLAKE2b-256 1792137fed2f772d5e4ab4e5548cbbb6bc2b00fd846ec5e7f8e46aac8e8ba8ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page