Skip to main content

Python library for running the CLEVER Benchmark

Project description

CLEVER: Curated Lean Verified Code Generation Benchmark

Overview

CLEVER is a benchmark suite for end-to-end code generation and formal verification in Lean 4, adapted from the HumanEval dataset. The goal is to move beyond test-case-driven evaluation by requiring models to generate not only implementations but also formal specifications and proofs — all verifiable by Lean’s type checker.

CLEVER Framework

Benchmark Focus

CLEVER evaluates models across a staged verification pipeline:

  • Task 1: Generate a formal specification from a natural language description.
  • Task 2: Prove semantic equivalence to a hidden ground-truth specification.
  • Task 3: Synthesize a Lean implementation that satisfies the specification.
  • Task 4: Prove the implementation's correctness against the reference specification.

Each step is independently verified, allowing fine-grained diagnosis of model performance.

Key Features

  • Non-computable specifications that prevent implementation leakage and enforce semantic reasoning.
  • Staged certification design to isolate failures across spec generation, implementation, and proof.
  • Supports symbolic proof search with agents like COPRA, enabling deeper proof automation analysis.

🚀 Submitting Your Solutions to CLEVER

To evaluate your LLM-generated solutions against the CLEVER benchmark, use the Python API to package and submit them as LeanProblemView objects. Each submission is compiled and verified using Lean 4, and results are returned as structured ValidationResult objects.

🔧 Steps

  1. Installing the python package:

    First install the package by cloning the repository and then run python -m build (this will build the python package). Install the package thus built (should be in dist/). You can also use pip install -e ., if you don't want to use clever as an external package. PyPi package coming soon! After installing the package, run the command clever-bench-install (this will work on Linux, no prerequisite of installing Lean 4).

    If you are not on Linux, then you will have install the Lean 4 dependency by yourself. After Lean 4 installation, you can run cd <path-to-clever-package-installation/lean4-dir> && lake exe cache get && lake build (or equivalent command) to build the Lean 4 environment (This is a one time step only).

  2. Load the benchmark:

    from clever_bench.benchmark import Benchmark
    benchmark = Benchmark(is_sample=True)  # or is_sample=False for actual HumanEval problems in `src/lean4/human_eval`
    benchmark.load_all()
    
  3. Select a task (e.g., proof generation):

    from clever_bench.task import ProblemViewTask, TaskComponent
    task = ProblemViewTask(benchmark, TaskComponent.PROOF_GENERATION)
    
  4. Get a problem and fill in your solution:

    problem = task.get_view(3) # Abstraction to hide the staged problem details
    problem.implementation = "<your Lean implementation>"
    problem.correctness_proof = "<your proof>"
    
  5. Submit the solution:

    import asyncio
    result = asyncio.run(task.submit_async(problem, timeout_in_ms=30000))
    print(result.correctness_ok, result.error_message)
    

✅ Result

The returned ValidationResult will tell you whether your implementation compiled, and whether the proofs were accepted by Lean (i.e., no sorry).

CLEVER also supports multi-stage verification: the Python API automatically hides irrelevant fields during each task (e.g., only showing the natural language description field for spec generation), enabling clean task-specific prompting and evaluation.

This process allows you to validate your solutions programmatically—whether you're using LLMs, proof agents, or writing Lean by hand.

You can try our baselines here: Baseline Provers

Build Instructions

  • Install Lean 4: https://leanprover.github.io/lean4/doc/quickstart.html
  • Run lake build clever to build the project after changing the directory to src/lean4. OR use VS Code with Lean 4 extension to build the project.
  • All the formalizations are in the src/lean4/human_eval directory.
  • Some extra formalizations are provided in src/lean4/sample_examples/ directory, some of which were used in prompts created for evaluating the benchmark.

Paper

You can find the paper describing CLEVER at https://arxiv.org/abs/2505.13938.

@misc{thakur2025clever,
      title={CLEVER: A Curated Benchmark for Formally Verified Code Generation}, 
      author={Amitayush Thakur and Jasper Lee and George Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri},
      year={2025},
      eprint={2505.13938},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.13938}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clever_bench-1.0.0.tar.gz (560.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clever_bench-1.0.0-py3-none-any.whl (197.7 kB view details)

Uploaded Python 3

File details

Details for the file clever_bench-1.0.0.tar.gz.

File metadata

  • Download URL: clever_bench-1.0.0.tar.gz
  • Upload date:
  • Size: 560.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for clever_bench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cec6a62890f9ca5438ba171580640711465271ed66c432d306bf77880fd57956
MD5 5e4a0ab8046c75c051703c226cc79cda
BLAKE2b-256 c1be6876dd6f95205e80d670f26a90865d5b453b1750a1321555cdf95dc545bc

See more details on using hashes here.

File details

Details for the file clever_bench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: clever_bench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 197.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for clever_bench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 351fa3e894e3757dd9575b65df27099a860db04d59236308cece6d5c0cec3a62
MD5 90c04b6e062126e954d9afc82cebfe46
BLAKE2b-256 0258307b8329cd84ecdededb7061c1bda6f91a680405d158d20244c3ed17aa76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page