A benchmark for LLM reasoning with verifiable logic puzzles
Project description
Pencil Puzzle Bench
A benchmark for evaluating LLM reasoning through pencil puzzles — constraint-satisfaction problems closely related to NP-complete problems — with deterministic, step-level verification.
Paper: Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
Features
- 62,000+ puzzles across 94 puzzle types sourced from puzz.link, each with a unique solution verified by cspuz-solver2 (SAT-based constraint solver)
- Step-level verification via pzpr.js — every intermediate board state is checked against variety-specific constraints, localizing errors to the exact rule violated (e.g., "Two shaded cells are adjacent" in Nurikabe, "Loop crosses itself" in Slitherlink)
- Dense reward signals — per-move constraint checking enables process supervision and reinforcement learning
- Gymnasium environment for RL training
- Verifiers environment for GRPO training with verifiers
- Benchmark harness with pluggable strategies, built on pydantic-ai
- Local model support — works with LM Studio, ollama, vLLM, or any OpenAI-compatible endpoint
Install
Prerequisites
Node.js is required — the puzzle engine (pzpr.js) runs in a Node.js subprocess via JSPyBridge.
# macOS
brew install node
# Ubuntu/Debian
apt install nodejs npm
# Or use nvm
nvm install 20
Install ppbench
pip install ppbench # Core: puzzles, gym env, pydantic-ai framework
pip install ppbench[all] # + OpenAI and Anthropic API clients
Install only the providers you need:
pip install ppbench[openai] # + OpenAI client
pip install ppbench[anthropic] # + Anthropic client
Local models (LM Studio, ollama, vLLM) work with just the base install — no provider extras needed.
Docker
Minimal Dockerfile for a clean environment:
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends nodejs npm curl \
&& rm -rf /var/lib/apt/lists/*
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"
WORKDIR /app
COPY . .
RUN uv sync --all-extras
docker build -t ppbench .
docker run --rm ppbench uv run python -c \
"from ppbench import Puzzle, load_dataset; print(len(load_dataset('golden_30')), 'puzzles')"
See Dockerfile.test for a full smoke test.
Quick Start
from ppbench import Puzzle, load_dataset
# Load a puzzle from the benchmark
records = load_dataset("golden") # 300 curated puzzles
record = records[0]
# Create and interact with a puzzle
puzzle = Puzzle.from_url(record["puzzlink_url"])
print(puzzle.pid) # e.g., "sudoku"
print(puzzle.get_state()) # board state as text
# Apply moves and check
puzzle.send_move("mouse,left,3,5")
violations = puzzle.check() # [] if valid
solved = puzzle.is_complete() # True when solved
# Render as SVG
svg = puzzle.svg()
Gymnasium Environment
from ppbench import PuzzleEnv, load_dataset
records = load_dataset("golden")
env = PuzzleEnv(puzzle_url=records[0]["puzzlink_url"])
obs, info = env.reset()
# Standard Gymnasium loop
obs, reward, terminated, truncated, info = env.step("mouse,left,3,5")
# reward = 1.0 when puzzle is solved
Verifiers Environment (GRPO Training)
from ppbench.verifiers_env import load_environment
env = load_environment("golden")
# Use with verifiers GRPO training pipeline
Datasets
| Name | Size | Description |
|---|---|---|
golden / golden_300 |
300 puzzles | Curated benchmark (20 types × 15 each), bundled |
golden_30 |
30 puzzles | Small subset for expensive agentic strategies, bundled |
full |
62,231 puzzles | All 94 puzzle types (HuggingFace) |
from ppbench import load_dataset
# Bundled datasets (no download needed)
records = load_dataset("golden") # 300 puzzles
records = load_dataset("golden_30") # 30 puzzles
Full dataset
Download from HuggingFace (one JSONL file):
# Using the huggingface-cli
pip install huggingface-hub
huggingface-cli download bluecoconut/pencil-puzzle-bench \
full_dataset.jsonl \
--repo-type dataset \
--local-dir ppbench/data
Then load it:
records = load_dataset("full") # 62,231 puzzles
Each record contains:
puzzlink_url— canonical puzzle URL (encodes the puzzle state)pid— puzzle type (e.g.,"sudoku","slither","tapa")number_required_moves— minimum moves to solvesolution— decoded solution withmoves_full,moves_required,moves_hint
Running the Benchmark
# Set API keys for the providers you want to use
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
# Quick test: 1 puzzle, both strategies
uv run python -u examples/quick_test.py
# Multi-model comparison
uv run python -u examples/multi_model.py
# Sweep an entire dataset
uv run python -u examples/dataset_sweep.py
# Analyze results
uv run python -u examples/analyze_results.py
Results are cached per (model, strategy, puzzle) — re-runs skip completed work.
Using a local model
Point the benchmark at any OpenAI-compatible endpoint (LM Studio, ollama, vLLM, etc.):
# Default: http://127.0.0.1:1234/v1 (LM Studio default)
export LOCAL_API_BASE=http://127.0.0.1:1234/v1
# Or for ollama:
export LOCAL_API_BASE=http://127.0.0.1:11434/v1
import asyncio
from ppbench.benchmarks import run, DirectAskStrategy
asyncio.run(run(
models=["local/qwen3.5-35b-a3b"],
strategies=[DirectAskStrategy],
dataset="golden_30",
))
The model name after local/ is passed directly to the server — use whatever model name your server expects.
Architecture Guide
Core primitive
ppbench.Puzzle wraps a headless pzpr.js puzzle instance running in Node.js. pzpr.js is the engine behind the puzz.link puzzle community — it implements 100+ puzzle varieties with full rule checking, error localization, and completion detection. You send moves, check the board against variety-specific constraints, and verify completeness — all deterministically, no browser needed.
The benchmark harness uses pydantic-ai to build LLM agents that interact with puzzles.
Models
Models use provider/model-name@variant syntax, parsed by ppbench/benchmarks/model_list.py:
| Provider | Example | Notes |
|---|---|---|
openai |
openai/gpt-4o |
Direct OpenAI API |
openai |
openai/gpt-5.2@medium |
Responses API with reasoning effort |
anthropic |
anthropic/claude-sonnet-4-6 |
Direct Anthropic API |
anthropic |
anthropic/claude-opus-4-6@thinking |
Extended thinking |
google |
google/gemini-3-pro |
Gemini API |
xai |
xai/grok-4-1-fast |
xAI API (OpenAI-compatible) |
openrouter |
openrouter/deepseek/deepseek-v3.2 |
OpenRouter (OpenAI-compatible) |
local |
local/my-model |
Any local OpenAI-compatible server |
Each provider maps to a pydantic-ai model class. To add a new provider, add a _build_* function in ppbench/benchmarks/model_list.py.
Strategies
A strategy defines what the agent does. The harness handles execution, retries, usage tracking, and caching.
Subclass ppbench.benchmarks.Strategy and implement two methods:
from ppbench.benchmarks import Strategy, AgentConfig, StrategyResult
from pydantic_ai import Agent
from ppbench import Puzzle
class MyStrategy(Strategy):
requires_tools = False # True if your agent uses tool calling
def build_agent(self, puzzle, model_obj, model_name):
"""Create the agent and prompt. No execution happens here."""
agent = Agent(model_obj, system_prompt="Solve this puzzle...")
prompt = f"Puzzle: {puzzle.get_string_repr()}"
return AgentConfig(agent=agent, prompt=prompt)
def extract_result(self, puzzle, deps, output):
"""Interpret the agent's output. Replay moves, check success."""
moves = parse_moves_somehow(output)
fresh = Puzzle.from_url(puzzle.url)
for m in moves:
fresh.send_move(m)
return StrategyResult(
is_success=fresh.isComplete(),
parsed_moves=moves,
raw_output=output,
)
Key concepts:
build_agent()returns anAgentConfigwith a pydantic-ai Agent, a prompt, and optional depsextract_result()replays moves on a fresh puzzle to verify the solutionon_node()is an optional per-step hook (compactification, progress tracking, etc.)strategy_idis a hash of your strategy's source — harness changes don't invalidate the cache
Built-in strategies for reference:
direct_ask.py— single-shot, no tools (simplest)basic_agentic.py— tool-calling agent with make_move, check_board, reset
The run() API
import asyncio
from ppbench.benchmarks import run, DirectAskStrategy, BasicAgenticSolve
results = asyncio.run(run(
models=["openai/gpt-4o", "local/qwen3.5-35b"],
strategies=[DirectAskStrategy, BasicAgenticSolve],
dataset="golden_30", # or "golden", "golden_300"
puzzle_types=["tapa"], # optional: filter by puzzle type
n_puzzles=5, # optional: limit count
concurrency=10, # max concurrent tasks
seed=42, # reproducible puzzle sampling
))
Results are saved as JSONL index + JSON artifacts in output/runs/. See examples/analyze_results.py for how to load and inspect them.
Move Format
Puzzles use pzpr.js input commands:
| Move | Description |
|---|---|
mouse,left,x,y |
Left click at (x,y) |
mouse,right,x,y |
Right click at (x,y) |
mouse,left,x1,y1,x2,y2 |
Drag from (x1,y1) to (x2,y2) |
mouse,leftx2,x,y |
Double left-click at (x,y) |
mouse,rightx2,x,y |
Double right-click at (x,y) |
key,1 |
Press key '1' |
License
MIT
Citation
@article{waugh2026ppbench,
title={Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning},
author={Justin Waugh},
year={2026},
eprint={2603.02119},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.02119}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ppbench-0.1.0.tar.gz.
File metadata
- Download URL: ppbench-0.1.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32c098b478322a854019c2b45f652d99c2694e546deca5906ec9d73a19054761
|
|
| MD5 |
f22183422d69d7268588d334e0fb5acb
|
|
| BLAKE2b-256 |
0d550b2f42bdb7a5a4d17d77e87f3c6462af08a4942102b9d9c3774631b9d0f6
|
Provenance
The following attestation bundles were made for ppbench-0.1.0.tar.gz:
Publisher:
workflow.yml on approximatelabs/pencil-puzzle-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ppbench-0.1.0.tar.gz -
Subject digest:
32c098b478322a854019c2b45f652d99c2694e546deca5906ec9d73a19054761 - Sigstore transparency entry: 1069756380
- Sigstore integration time:
-
Permalink:
approximatelabs/pencil-puzzle-bench@518012b082873f4d1691e320d2de00a0b896dafa -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/approximatelabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@518012b082873f4d1691e320d2de00a0b896dafa -
Trigger Event:
release
-
Statement type:
File details
Details for the file ppbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ppbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7f3d9aea109cf3534693d2c0a7c3c74b11654af1a4592d5152c085b5a0d5ee2
|
|
| MD5 |
82df0e064b0b5b46121837667d622b59
|
|
| BLAKE2b-256 |
c148e6ece5a07ee86caba01dc986591bbe732bf58deec9a7a705412730967a9f
|
Provenance
The following attestation bundles were made for ppbench-0.1.0-py3-none-any.whl:
Publisher:
workflow.yml on approximatelabs/pencil-puzzle-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ppbench-0.1.0-py3-none-any.whl -
Subject digest:
c7f3d9aea109cf3534693d2c0a7c3c74b11654af1a4592d5152c085b5a0d5ee2 - Sigstore transparency entry: 1069756387
- Sigstore integration time:
-
Permalink:
approximatelabs/pencil-puzzle-bench@518012b082873f4d1691e320d2de00a0b896dafa -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/approximatelabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@518012b082873f4d1691e320d2de00a0b896dafa -
Trigger Event:
release
-
Statement type: