A comprehensive library for AI-assisted theorem proving in Lean
Project description
LeanDojo-v2
LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4. It combines repository tracing, lifelong dataset management, retrieval-augmented agents, Hugging Face fine-tuning, and external inference APIs into one toolkit.
Table of Contents
- Overview
- Key Features
- Repository Layout
- Requirements
- Installation
- Environment Setup
- Quick Start
- Working with Agents and Trainers
- Tracing and Dataset Generation
- LeanProgress Step-Prediction
- Proving Theorems
- Testing
- Troubleshooting & Tips
- Contributing
- License
Overview
LeanDojo-v2 extends the original LeanDojo stack with the LeanAgent lifelong learning pipeline. It automates the entire loop of:
- Cloning Lean repositories (GitHub or local) and tracing them with Lean instrumentation.
- Storing structured theorem information in a dynamic database.
- Training agent policies with supervised fine-tuning (SFT), GRPO-style RL, or retrieval objectives.
- Driving Pantograph-based provers to fill in sorrys or verify solutions.
- Using HuggingFace API for large model inference.
The codebase is modular: you can reuse the tracing pipeline without the agents, swap in custom trainers, or stand up your own inference service via the external API layer.
Key Features
- Unified Agent Abstractions:
BaseAgentorchestrates repository setup, training, and proving. Concrete implementations (HFAgent,LeanAgent, andExternalAgent) tailor the workflow to Hugging Face models, retrieval-based provers, or REST-backed models. - Powerful Trainers:
SFTTrainer,GRPOTrainer, andRetrievalTrainercover LoRA-enabled supervised fine-tuning, group-relative policy optimization, and retriever-only curriculum learning. - Multi-Modal Provers:
HFProver,RetrievalProver, andExternalProverrun on top of Pantograph’s Lean RPC server to search for tactics, generate whole proofs, or delegate to custom models. - Lean Tracing Pipeline:
lean_dojoincludes the Lean 4 instrumentation (ExtractData.lean) and Python utilities to trace commits, normalize ASTs, and cache proof states. - Dynamic Repository Database:
databasetracks repositories, theorems, curriculum difficulty, and sorry status, enabling lifelong training schedules. - External API: The
external_apifolder exposes HTTP endpoints (FastAPI + uvicorn) and Lean frontend snippets so you can query LLMs from Lean editors.
Repository Layout
| Path | Description |
|---|---|
lean_dojo_v2/agent/ |
Base class plus HFAgent, LeanAgent, and helpers to manage repositories and provers. |
lean_dojo_v2/trainer/ |
SFT, GRPO, and retrieval trainers with Hugging Face + DeepSpeed integration. |
lean_dojo_v2/prover/ |
Pantograph-based prover implementations (HF, retrieval, external). |
lean_dojo_v2/lean_dojo/ |
Lean tracing, dataset generation, caching, and AST utilities. |
lean_dojo_v2/lean_agent/ |
Lifelong learning pipeline (configs, database, retrieval stack, generator). |
lean_dojo_v2/external_api/ |
LeanCopilot code (Lean + Python server) to query external models. |
lean_dojo_v2/utils/ |
Shared helpers for Git, filesystem operations, and constants. |
lean_dojo_v2/tests/ |
Pytest regression suite. |
For deeper documentation on the lifelong learning component, see lean_dojo_v2/lean_agent/README.md.
Requirements
- Python ≥ 3.11.
- CUDA-capable GPU for training and inference (tested with CUDA 12.6).
- Git ≥ 2.25 and
wget. - elan Lean toolchain to trace repositories locally.
- Adequate disk space for the
raid/working directory (datasets, checkpoints, traces).
Python dependencies are declared in pyproject.toml and include PyTorch, PyTorch Lightning, Transformers, DeepSpeed, TRL, PEFT, and more.
Installation
Option 1: From PyPI
# Install the core package
pip install lean-dojo-v2
# Pantograph is required for Lean RPC
pip install git+https://github.com/stanford-centaur/PyPantograph
# Install a CUDA-enabled torch build (adjust the index URL for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
Option 2: From Source (development)
git clone https://github.com/lean-dojo/LeanDojo-v2.git
cd LeanDojo-v2
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ".[dev]"
pip install git+https://github.com/stanford-centaur/PyPantograph
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
Tip: You can use uv (
uv pip install lean-dojo-v2) as an alternative Python package manager.
Environment Setup
-
GitHub Access Token (required)
The tracing pipeline calls the GitHub API extensively. Create a personal access token and export it before running any agent:export GITHUB_ACCESS_TOKEN=<token>
-
Hugging Face Token (optional but needed for gated models)
export HF_TOKEN=<hf-token>
-
Working directories
By default all datasets, caches, and checkpoints live under<repo>/raid. Change the layout by editinglean_dojo_v2/utils/constants.pyor by pointingRAID_DIRto faster storage. -
Lean toolchains
Ensureelanis configured and Lean 4 (e.g.,leanprover/lean4:nightly) is available on your$PATH. The tracing scripts look under~/.elan/toolchains/.
Quick Start
from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer
url = "https://github.com/durant42040/lean4-example"
commit = "005de00d03f1aaa32cb2923d5e3cbaf0b954a192"
trainer = SFTTrainer(
model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
output_dir="outputs-deepseek",
epochs_per_repo=1,
batch_size=2,
lr=2e-5,
)
agent = HFAgent(trainer=trainer)
agent.setup_github_repository(url=url, commit=commit)
agent.train()
agent.prove()
This example:
- Downloads and traces the target Lean repository + commit.
- Builds a supervised dataset from sorry theorems.
- Fine-tunes the specified Hugging Face model (optionally with LoRA).
- Launches an
HFProverbacked by Pantograph to search for proofs.
Tracing and Dataset Generation
The lean_dojo_v2/lean_dojo/data_extraction package powers repository tracing:
lean.pyclones repositories (GitHub, remote, or local), validates Lean versions, and normalizes URLs.trace.pydrives Lean with the customExtractData.leaninstrumented module to capture theorem states.dataset.pyconverts traced files to JSONL datasets ready for trainers.cache.pymemoizes repository metadata to avoid redundant downloads.traced_data.pyexposes typed wrappers for traced AST nodes and sorrys.
Typical usage:
from lean_dojo_v2.database import DynamicDatabase
url = "https://github.com/durant42040/lean4-example"
commit = "005de00d03f1aaa32cb2923d5e3cbaf0b954a192"
database = DynamicDatabase()
database.trace_repository(
url=url,
commit=commit,
build_deps=False,
)
The generated artifacts flow into the DynamicDatabase, which keeps repositories sorted by difficulty and appends new sorrys without retracing everything.
Working with Agents and Trainers
Agents
Agents orchestrate the full workflow of repository setup, training, and theorem proving. Each agent pairs a trainer with a compatible prover.
HFAgent
Uses Hugging Face models fine-tuned with SFTTrainer or GRPOTrainer for theorem proving. Loads checkpoints locally and uses HFProver for proof search. Ideal for training custom models on your traced repositories. Does not build Lean dependencies by default.
from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer
trainer = SFTTrainer(model_name="deepseek-ai/DeepSeek-Prover-V2-7B", ...)
agent = HFAgent(trainer=trainer)
agent.setup_github_repository(url, commit)
agent.train()
agent.prove()
ExternalAgent
Uses the Hugging Face Inference API to access large models like DeepSeek-Prover-V2-671B without local model loading. Pairs with ExternalProver for whole-proof generation or proof search. Best for quick experiments or when you don't have GPU resources for local inference.
from lean_dojo_v2.agent.external_agent import ExternalAgent
agent = ExternalAgent()
agent.setup_github_repository(url, commit)
agent.prove()
LeanAgent
Implements the lifelong learning pipeline with retrieval-augmented generation. Uses RetrievalTrainer to train premise retrievers, then pairs with RetrievalProver for retrieval-augmented tactic generation. Maintains repository curricula and builds Lean dependencies by default.
from lean_dojo_v2.agent.lean_agent import LeanAgent
agent = LeanAgent()
agent.setup_github_repository(url, commit)
agent.train()
agent.prove()
Trainers
Supervised Fine-Tuning (SFTTrainer)
- Accepts any Hugging Face causal LM identifier.
- Supports LoRA by passing a
peft.LoraConfig. - Key arguments:
epochs_per_repo,batch_size,max_seq_len,lr,warmup_steps,gradient_checkpointing. - Produces checkpoints under
output_dirthat theHFProverconsumes.
GRPO Trainer (GRPOTrainer)
- Implements Group Relative Policy Optimization for reinforcement-style refinement.
- Accepts
reference_model,reward_weights, andkl_betasettings. - Useful for improving search policies on curated theorem batches.
Retrieval Trainer (RetrievalTrainer)
- Trains the dense retriever that scores prior proofs from the corpus.
- Used by
LeanAgentto build retrieval-augmented generation models. - Requires indexed corpus and generator checkpoints.
Each agent inherits BaseAgent, so you can implement your own by overriding _get_build_deps() and _setup_prover() to register new trainer/prover pairs.
LeanProgress Step-Prediction
-
Generate a JSONL dataset with remaining-step targets (or replace it with your own LeanProgress export):
python -m lean_dojo_v2.lean_progress.create_sample_dataset --output raid/data/sample_leanprogress_dataset.jsonl
-
Fine-tune a regression head that predicts
steps_remaining:from pathlib import Path from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer sample_dataset_path = Path("raid/data/sample_leanprogress_dataset.jsonl") trainer = ProgressTrainer( model_name="bert-base-uncased", data_path=str(sample_dataset_path), output_dir="outputs-progress", ) trainer.train()
Proving Theorems
LeanDojo-v2 provides three prover implementations, each for different use cases:
HFProver
Loads a fine-tuned Hugging Face model from a local checkpoint (supports full models and LoRA adapters) and generates tactics directly, used for locally trained Hugging Face model (e.g. with SFTTrainer and GRPOTrainer).
ExternalProver
Performs inference with the Hugging Face Inference API to access large models without local GPU resources. Defaults to DeepSeek-Prover-V2-671B. Supports both proof search and whole-proof generation.
RetrievalProver
Used directly with LeanAgent.
Proof Methods
LeanDojo-v2 supports two methods for theorem proving:
-
Whole-proof generation: generate complete proof in one forward pass of the prover.
from lean_dojo_v2.prover import ExternalProver theorem = "theorem my_and_comm : ∀ {p q : Prop}, And p q → And q p := by" prover = ExternalProver() proof = prover.generate_whole_proof(theorem)
-
Proof search: generate tactics sequentially and update the goal state through interaction with Pantograph until the proof is complete.
from pantograph.server import Server from lean_dojo_v2.prover import HFProver server = Server() prover = HFProver(ckpt_path="outputs-deepseek") result, used_tactics = prover.search( server=server, goal="∀ {p q : Prop}, p ∧ q → q ∧ p", verbose=False )
Testing
We use pytest for regression coverage.
pip install -e .[dev] # make sure dev extras like pytest/trl are present
export GITHUB_ACCESS_TOKEN=<token>
export HF_TOKEN=<hf-token> # only required for tests touching HF APIs
pytest -v
Troubleshooting & Tips
- 401 Bad Credentials / rate limits: Ensure
GITHUB_ACCESS_TOKENis exported and hasrepo+read:orgscopes. - Lean tracing failures: Confirm that the repo’s Lean version exists locally (
elan toolchain install <version>). - Missing CUDA libraries: Install the PyTorch wheel that matches your driver and CUDA version.
- Dataset location: The default
raid/directory can grow large. Point it to high-throughput storage or use symlinks. - Pantograph errors: Reinstall Pantograph from source (
pip install git+https://github.com/stanford-centaur/PyPantograph) whenever Lean upstream changes.
Contributing
Issues and pull requests are welcome! Please:
- Open an issue describing the bug or feature.
- Run formatters (
black,isort) andpytestbefore submitting. - Mention if your change touches Lean tracing files so reviewers can re-generate artifacts.
License
LeanDojo-v2 is released under the MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lean_dojo_v2-1.0.5.tar.gz.
File metadata
- Download URL: lean_dojo_v2-1.0.5.tar.gz
- Upload date:
- Size: 131.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b1a820f02aee4a440ab54c415b0fa152b9053758753fd0fa8e1f21a46e9408d
|
|
| MD5 |
d2e351d184e6eaf58573e7a05f18c208
|
|
| BLAKE2b-256 |
745cf5cfed32ec3116e83f8cad494d61a18079ec45be105e0c4cb8560e095f86
|
File details
Details for the file lean_dojo_v2-1.0.5-py3-none-any.whl.
File metadata
- Download URL: lean_dojo_v2-1.0.5-py3-none-any.whl
- Upload date:
- Size: 156.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b767b15f5a4fb95c1bb47f6312581c6988a9fb417aafbeb265f148a3063cfa0
|
|
| MD5 |
166e3dc749db43a3d1c87cced0baeaa4
|
|
| BLAKE2b-256 |
fb94cbdbbdfdd39070d08e1a6f23af2caa89df2d81250df503990ab5bfb5324a
|