Skip to main content

Formal verification as agentic training signal — CLI + self-hosted runner

Project description

Athanor

athanor-ai

Lean 4 proof verification as agentic training signal.
Turn formal proofs into reward functions. Score agent output with compilers, not judges.

athanor-ai.com


Your agent writes code. Then it writes a proof that the code is correct. The Lean 4 compiler checks the proof. The result is a training signal with no ambiguity.

import athanor

# Verify a Lean 4 proof
result = athanor.verify_proof("""
theorem add_comm (a b : Nat) : a + b = b + a := by
  omega
""")

print(result.compiles)    # True
print(result.has_sorry)   # False
print(result.score)       # 1.0

Install

pip install athanor-ai

What this solves

You have domain expertise. You know what correct code looks like. You want an AI agent to produce verified solutions, not guesses.

The problem: LLM judges are noisy. Unit tests are brittle. Benchmarks don't produce training signal.

The solution: Lean 4 formal proofs are deterministic, machine-checked, and produce continuous reward signal (full proof = 1.0, partial = 0.35, broken = 0.25).

Verify proofs

Check if a Lean 4 proof compiles. Detect sorry placeholders. Catch banned constructs (axiom, import Mathlib, unsafe).

from athanor import verify_proof, check_sorry, score_proof

# Full verification with detailed result
result = verify_proof(proof_code)
result.compiles      # did it compile?
result.has_sorry     # any incomplete proof markers?
result.sorry_count   # how many sorry placeholders?
result.score         # 0.0 - 1.0
result.status        # "full_proof" | "partial_proof" | "compile_error" | "banned"
result.errors        # compiler error messages

# Quick score (just the float)
score = score_proof(proof_code)  # 1.0, 0.35, 0.25, or 0.0

# Check for sorry without full compilation
has_sorry, count = check_sorry(proof_code)

Works with local Lean 4 installation or via Docker (ghcr.io/leanprover/lean4).

Score agent output

Pair code with a proof. Score both. Use the result as reward.

import athanor

env = athanor.make("my-environment", task="my-task")
env.reset()

result = env.score({
    "kernel.py": agent_code,
    "proof.lean": agent_proof,
})

# Scoring layers:
# 1. Does the code work? (verifier checks)
# 2. Does the proof compile? (Lean compiler)
# 3. Is the proof complete? (no sorry)
print(result.score)        # combined score
print(result.lean_status)  # proof status

Agent retry with verifier feedback

Agent gets the scoring output and tries again. No human in the loop. The verifier feedback is the teacher.

results = env.run(
    model="anthropic/claude-sonnet-4-6",
    api_key="...",
    max_retries=3,
    target_score=0.95,
)
# Attempt 1: 0.35 (code correct, proof has sorry)
# Attempt 2: 0.72 (proof compiles, 2 sorry remaining)
# Attempt 3: 0.98 (full proof, verified)

RL training

Use proof scores as reward signal in any RL framework.

from trl import PPOTrainer

env = athanor.make("my-environment")
trainer = PPOTrainer(
    reward_fn=lambda completions: env.reward_fn(completions),
    ...
)

Compatible with TRL, veRL, NeMo-RL, or any custom training loop.

Proof scoring

proof_multiplier:
  1.00  full proof (compiles, no sorry)
  0.35  partial proof (compiles with sorry)
  0.25  broken proof (does not compile)
  0.15  no proof submitted
  0.00  banned construct (axiom, Mathlib, unsafe)

Partial proofs produce gradient. An agent that proves 4 of 7 theorems scores higher than one that proves 0. This is the training signal.

Getting environments

The verify_proof and score_proof functions work standalone with any Lean 4 code. For full environment scoring (code + proof + property tests), contact athanor-ai.com.

Requirements

  • Python >= 3.9
  • Lean 4 or Docker (for proof verification)

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athanor_ai-0.3.0.tar.gz (53.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

athanor_ai-0.3.0-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file athanor_ai-0.3.0.tar.gz.

File metadata

  • Download URL: athanor_ai-0.3.0.tar.gz
  • Upload date:
  • Size: 53.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for athanor_ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d070ae4859a9be0cfafcd7d10addf29f16e4c05e0167e6bd0351e3b4bddc7542
MD5 be98f96392527e24a7aee3a3501d95e4
BLAKE2b-256 8d5552bf4770ba29d25c09c992f427d92c134c4318155c037404f0b15f5a087b

See more details on using hashes here.

File details

Details for the file athanor_ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: athanor_ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for athanor_ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd8b7f0ab3a87cef501a130f80622f3432d7040aba4b98fafe640a7ac24e3af0
MD5 a3e8ab251ec742fb035a8ee92b34a72c
BLAKE2b-256 0d7ffeed98c97fbaef82e1612395ae0d91b6d22d907255e06c1c7cff22be076b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page