Formal verification as agentic training signal — CLI + self-hosted runner
Project description
kairos
The SDK from Athanor AI. Distributed as athanor-kairos on PyPI; imports as kairos.
Lean 4 proof verification as agentic training signal.
Turn formal proofs into reward functions. Score agent output with compilers, not judges.
athanor-ai.com
Your agent writes code. Then it writes a proof that the code is correct. The Lean 4 compiler checks the proof. The result is a training signal with no ambiguity.
import kairos
# Verify a Lean 4 proof
result = kairos.verify_proof("""
theorem add_comm (a b : Nat) : a + b = b + a := by
omega
""")
print(result.compiles) # True
print(result.has_sorry) # False
print(result.score) # 1.0
Install
pip install athanor-kairos
# or, from source:
pip install git+https://github.com/athanor-ai/athanor-sdk.git@main
What this solves
You have domain expertise. You know what correct code looks like. You want an AI agent to produce verified solutions, not guesses.
The problem: LLM judges are noisy. Unit tests are brittle. Benchmarks don't produce training signal.
The solution: Lean 4 formal proofs are deterministic, machine-checked, and produce continuous reward signal (full proof = 1.0, partial = 0.35, broken = 0.25).
Verify proofs
Check if a Lean 4 proof compiles. Detect sorry placeholders. Catch banned constructs (axiom, import Mathlib, unsafe).
from kairos import verify_proof, check_sorry, score_proof
# Full verification with detailed result
result = verify_proof(proof_code)
result.compiles # did it compile?
result.has_sorry # any incomplete proof markers?
result.sorry_count # how many sorry placeholders?
result.score # 0.0 - 1.0
result.status # "full_proof" | "partial_proof" | "compile_error" | "banned"
result.errors # compiler error messages
# Quick score (just the float)
score = score_proof(proof_code) # 1.0, 0.35, 0.25, or 0.0
# Check for sorry without full compilation
has_sorry, count = check_sorry(proof_code)
Works with local Lean 4 installation or via Docker (ghcr.io/leanprover/lean4).
Score agent output
Pair code with a proof. Score both. Use the result as reward.
import kairos
env = kairos.make("my-environment", task="my-task")
env.reset()
result = env.score({
"kernel.py": agent_code,
"proof.lean": agent_proof,
})
# Scoring layers:
# 1. Does the code work? (verifier checks)
# 2. Does the proof compile? (Lean compiler)
# 3. Is the proof complete? (no sorry)
print(result.score) # combined score
print(result.lean_status) # proof status
Agent retry with verifier feedback
Agent gets the scoring output and tries again. No human in the loop. The verifier feedback is the teacher.
results = env.run(
model="anthropic/claude-sonnet-4-6",
api_key="...",
max_retries=3,
target_score=0.95,
)
# Attempt 1: 0.35 (code correct, proof has sorry)
# Attempt 2: 0.72 (proof compiles, 2 sorry remaining)
# Attempt 3: 0.98 (full proof, verified)
RL training
Use proof scores as reward signal in any RL framework.
from trl import PPOTrainer
env = kairos.make("my-environment")
trainer = PPOTrainer(
reward_fn=lambda completions: env.reward_fn(completions),
...
)
Compatible with TRL, veRL, NeMo-RL, or any custom training loop.
Proof scoring
proof_multiplier:
1.00 full proof (compiles, no sorry)
0.35 partial proof (compiles with sorry)
0.25 broken proof (does not compile)
0.15 no proof submitted
0.00 banned construct (axiom, Mathlib, unsafe)
Partial proofs produce gradient. An agent that proves 4 of 7 theorems scores higher than one that proves 0. This is the training signal.
Getting environments
The verify_proof and score_proof functions work standalone with any Lean 4 code. For full environment scoring (code + proof + property tests), contact athanor-ai.com.
CLI
# Container sanity checks — 7 gates including shim re-export + reward-leak probe
kairos preflight --env <env_slug>
# Batch evaluate all tasks in an environment (or filter/patch)
kairos evaluate --env hw-cbmc --model anthropic/claude-sonnet-4-6
kairos evaluate --env hw-cbmc --tasks fix_arb_lock,fix_tlb_ctrl # subset
kairos evaluate --env hw-cbmc --patch runs/prev_run1.json # resume
# Score, solve, lint, estimate
kairos score --env <env> --task <slug> --file solution.py
kairos solve --env <env> --task <slug> --model claude-opus-4-6
kairos lint Proof.lean # positional
kairos estimate --model claude-sonnet-4-6 --tasks 26
# Run/calibrate/stats
kairos runs --env <env>
kairos run-summary <run_id> # pass-rate + cost + elapsed for one session
kairos calibrate --run runs/<model>_run1.json
kairos stats runs/*.json
kairos compare runs/a.json runs/b.json
kairos eval-status [env_dir]
Results save incrementally after each task (crash-resilient). --patch merges into an existing run file so you can resume interrupted sweeps.
Requirements
- Python >= 3.9
- Lean 4 or Docker (for proof verification)
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file athanor_sdk-0.4.1.tar.gz.
File metadata
- Download URL: athanor_sdk-0.4.1.tar.gz
- Upload date:
- Size: 894.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aeb4fba3e95363677ecc22ea080b4fc7c3e2299f48f5ba26daddccce64ab4ff8
|
|
| MD5 |
9f2f25e5b812f697702d12ebe7b82bbb
|
|
| BLAKE2b-256 |
80a879be1f32fe1acf24b9ad3fb47cb9bbd3bc6144820148f2985aeef18ba50f
|
File details
Details for the file athanor_sdk-0.4.1-py3-none-any.whl.
File metadata
- Download URL: athanor_sdk-0.4.1-py3-none-any.whl
- Upload date:
- Size: 767.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a0001cefe2b7a0e5a41130eaa539765da7680d86e57340a791186c0fb218b86
|
|
| MD5 |
ace8c4a4116b975709b94bba3df10f51
|
|
| BLAKE2b-256 |
d00faf61ac52bc4e13905906baf757124bd1071789f4acb1dc20f16b2aa74715
|