Python library for running the CLEVER Benchmark
Project description
CLEVER: Curated Lean Verified Code Generation Benchmark
Overview
CLEVER is a benchmark suite for end-to-end code generation and formal verification in Lean 4, adapted from the HumanEval dataset. The goal is to move beyond test-case-driven evaluation by requiring models to generate not only implementations but also formal specifications and proofs — all verifiable by Lean’s type checker.
Benchmark Focus
CLEVER evaluates models across a staged verification pipeline:
- Task 1: Generate a formal specification from a natural language description.
- Task 2: Prove semantic equivalence to a hidden ground-truth specification.
- Task 3: Synthesize a Lean implementation that satisfies the specification.
- Task 4: Prove the implementation's correctness against the reference specification.
Each step is independently verified, allowing fine-grained diagnosis of model performance.
Key Features
- Non-computable specifications that prevent implementation leakage and enforce semantic reasoning.
- Staged certification design to isolate failures across spec generation, implementation, and proof.
- Supports symbolic proof search with agents like COPRA, enabling deeper proof automation analysis.
🚀 Submitting Your Solutions to CLEVER
To evaluate your LLM-generated solutions against the CLEVER benchmark, use the Python API to package and submit them as LeanProblemView objects. Each submission is compiled and verified using Lean 4, and results are returned as structured ValidationResult objects.
🔧 Steps
-
Installing the python package: You can simply use
pip install clever-benchto install the package from PyPi. Alternatively, you can install from source by cloning the repository and then usepip install -e .After installing the package, run the command
clever-bench-install(this will work on Linux, no prerequisite of installing Lean 4).If you are not on Linux, then you will have install the Lean 4 dependency by yourself. After Lean 4 installation, you can run
cd <path-to-clever-package-installation/lean4-dir> && lake exe cache get && lake build(or equivalent command) to build the Lean 4 environment (This is a one time step only). -
Load the benchmark:
from clever_bench.benchmark import Benchmark benchmark = Benchmark(is_sample=True) # or is_sample=False for actual HumanEval problems in `src/lean4/human_eval` benchmark.load_all()
-
Select a task (e.g., proof generation):
from clever_bench.task import ProblemViewTask, TaskComponent task = ProblemViewTask(benchmark, TaskComponent.PROOF_GENERATION)
-
Get a problem and fill in your solution:
problem = task.get_view(3) # Abstraction to hide the staged problem details and only show relevant fields for the selected task for problem with id 3 problem.implementation = "<your Lean implementation>" problem.correctness_proof = "<your proof>"
-
Submit the solution:
import asyncio result = asyncio.run(task.submit_async(problem, timeout_in_ms=30000)) print(result.correctness_ok, result.error_message)
✅ Result
The returned ValidationResult will tell you whether your implementation compiled, and whether the proofs were accepted by Lean (i.e., no sorry).
CLEVER also supports multi-stage verification: the Python API automatically hides irrelevant fields during each task (e.g., only showing the natural language description field for spec generation), enabling clean task-specific prompting and evaluation.
This process allows you to validate your solutions programmatically—whether you're using LLMs, proof agents, or writing Lean by hand.
You can try our baselines here: Baseline Provers
🏆 Leaderboard
View the latest results and submit your own: CLEVER Leaderboard
We welcome submissions from the community! To add your results to the leaderboard:
- Evaluate your approach using the Python API described above
- Document your methodology with a preprint or publication
- Submit your results by contacting us with your evaluation metrics and methodology details
For submissions, please contact: amitayush@utexas.edu
Build Instructions
- Install Lean 4: https://leanprover.github.io/lean4/doc/quickstart.html
- Run
lake build cleverto build the project after changing the directory tosrc/lean4. OR use VS Code with Lean 4 extension to build the project. - All the formalizations are in the
src/lean4/human_evaldirectory. - Some extra formalizations are provided in
src/lean4/sample_examples/directory, some of which were used in prompts created for evaluating the benchmark.
Paper
You can find the paper describing CLEVER at https://arxiv.org/abs/2505.13938.
@inproceedings{thakur2025clever,
title={CLEVER: A Curated Benchmark for Formally Verified Code Generation},
author={Amitayush Thakur and Jasper Lee and George Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri},
booktitle={39th Conference on Neural Information Processing Systems (NeurIPS 2025)},
year={2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clever_bench-1.6.0.tar.gz.
File metadata
- Download URL: clever_bench-1.6.0.tar.gz
- Upload date:
- Size: 564.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5b6c25710269c50e9adaab8d049c1a406b1afde3ab57608d240e6ed78cdd076
|
|
| MD5 |
ffb5dcdd39cabc94e02c2b02ee9b0242
|
|
| BLAKE2b-256 |
2f73ffe62dbc00e88541bf7fc28e7ecc9d7c22be5f503b00b70737d9c1a3e808
|
File details
Details for the file clever_bench-1.6.0-py3-none-any.whl.
File metadata
- Download URL: clever_bench-1.6.0-py3-none-any.whl
- Upload date:
- Size: 198.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36005fd68d2a072219ac50e53ce9b3bb8e4423cc4104800a68aaf1a8b1446327
|
|
| MD5 |
3e064dd60b860302dd51006e0d23af65
|
|
| BLAKE2b-256 |
2600e0dfaa83f9604d79f7466762573737e51167b621029dee6daab5716b2076
|