A benchmark for LLM reasoning with verifiable logic puzzles

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Pencil Puzzle Bench

A benchmark for evaluating LLM reasoning through pencil puzzles — constraint-satisfaction problems closely related to NP-complete problems — with deterministic, step-level verification.

Paper: Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Model success rates by strategy (left) with puzzle solve gallery (right)

Features

62,000+ puzzles across 94 puzzle types sourced from puzz.link, each with a unique solution verified by cspuz-solver2 (SAT-based constraint solver)
Step-level verification via pzpr.js — every intermediate board state is checked against variety-specific constraints, localizing errors to the exact rule violated (e.g., "Two shaded cells are adjacent" in Nurikabe, "Loop crosses itself" in Slitherlink)
Dense reward signals — per-move constraint checking enables process supervision and reinforcement learning
Gymnasium environment for RL training
Verifiers environment for GRPO training with verifiers
Benchmark harness with pluggable strategies, built on pydantic-ai
Local model support — works with LM Studio, ollama, vLLM, or any OpenAI-compatible endpoint

Install

Prerequisites

Node.js is required — the puzzle engine (pzpr.js) runs in a Node.js subprocess via JSPyBridge.

# macOS
brew install node

# Ubuntu/Debian
apt install nodejs npm

# Or use nvm
nvm install 20

Install ppbench

pip install ppbench          # Core: puzzles, gym env, pydantic-ai framework
pip install ppbench[all]     # + OpenAI and Anthropic API clients

Install only the providers you need:

pip install ppbench[openai]      # + OpenAI client
pip install ppbench[anthropic]   # + Anthropic client

Local models (LM Studio, ollama, vLLM) work with just the base install — no provider extras needed.

Docker

Minimal Dockerfile for a clean environment:

FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends nodejs npm curl \
    && rm -rf /var/lib/apt/lists/*
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"
WORKDIR /app
COPY . .
RUN uv sync --all-extras

docker build -t ppbench .
docker run --rm ppbench uv run python -c \
  "from ppbench import Puzzle, load_dataset; print(len(load_dataset('golden_30')), 'puzzles')"

See Dockerfile.test for a full smoke test.

Quick Start

from ppbench import Puzzle, load_dataset

# Load a puzzle from the benchmark
records = load_dataset("golden")  # 300 curated puzzles
record = records[0]

# Create and interact with a puzzle
puzzle = Puzzle.from_url(record["puzzlink_url"])
print(puzzle.pid)           # e.g., "sudoku"
print(puzzle.get_state())   # board state as text

# Apply moves and check
puzzle.send_move("mouse,left,3,5")
violations = puzzle.check()      # [] if valid
solved = puzzle.is_complete()    # True when solved

# Render as SVG
svg = puzzle.svg()

Gymnasium Environment

from ppbench import PuzzleEnv, load_dataset

records = load_dataset("golden")
env = PuzzleEnv(puzzle_url=records[0]["puzzlink_url"])
obs, info = env.reset()

# Standard Gymnasium loop
obs, reward, terminated, truncated, info = env.step("mouse,left,3,5")
# reward = 1.0 when puzzle is solved

Verifiers Environment (GRPO Training)

from ppbench.verifiers_env import load_environment

env = load_environment("golden")
# Use with verifiers GRPO training pipeline

Datasets

Name	Size	Description
`golden` / `golden_300`	300 puzzles	Curated benchmark (20 types × 15 each), bundled
`golden_30`	30 puzzles	Small subset for expensive agentic strategies, bundled
`full`	62,231 puzzles	All 94 puzzle types (HuggingFace)

from ppbench import load_dataset

# Bundled datasets (no download needed)
records = load_dataset("golden")      # 300 puzzles
records = load_dataset("golden_30")   # 30 puzzles

Full dataset

Download from HuggingFace (one JSONL file):

# Using the huggingface-cli
pip install huggingface-hub
huggingface-cli download bluecoconut/pencil-puzzle-bench \
    full_dataset.jsonl \
    --repo-type dataset \
    --local-dir ppbench/data

Then load it:

records = load_dataset("full")  # 62,231 puzzles

Each record contains:

puzzlink_url — canonical puzzle URL (encodes the puzzle state)
pid — puzzle type (e.g., "sudoku", "slither", "tapa")
number_required_moves — minimum moves to solve
solution — decoded solution with moves_full, moves_required, moves_hint

Running the Benchmark

# Set API keys for the providers you want to use
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...

# Quick test: 1 puzzle, both strategies
uv run python -u examples/quick_test.py

# Multi-model comparison
uv run python -u examples/multi_model.py

# Sweep an entire dataset
uv run python -u examples/dataset_sweep.py

# Analyze results
uv run python -u examples/analyze_results.py

Results are cached per (model, strategy, puzzle) — re-runs skip completed work.

Using a local model

Point the benchmark at any OpenAI-compatible endpoint (LM Studio, ollama, vLLM, etc.):

# Default: http://127.0.0.1:1234/v1 (LM Studio default)
export LOCAL_API_BASE=http://127.0.0.1:1234/v1

# Or for ollama:
export LOCAL_API_BASE=http://127.0.0.1:11434/v1

import asyncio
from ppbench.benchmarks import run, DirectAskStrategy

asyncio.run(run(
    models=["local/qwen3.5-35b-a3b"],
    strategies=[DirectAskStrategy],
    dataset="golden_30",
))

The model name after local/ is passed directly to the server — use whatever model name your server expects.

Architecture Guide

Core primitive

ppbench.Puzzle wraps a headless pzpr.js puzzle instance running in Node.js. pzpr.js is the engine behind the puzz.link puzzle community — it implements 100+ puzzle varieties with full rule checking, error localization, and completion detection. You send moves, check the board against variety-specific constraints, and verify completeness — all deterministically, no browser needed.

The benchmark harness uses pydantic-ai to build LLM agents that interact with puzzles.

Models

Models use provider/model-name@variant syntax, parsed by ppbench/benchmarks/model_list.py:

Provider	Example	Notes
`openai`	`openai/gpt-4o`	Direct OpenAI API
`openai`	`openai/gpt-5.2@medium`	Responses API with reasoning effort
`anthropic`	`anthropic/claude-sonnet-4-6`	Direct Anthropic API
`anthropic`	`anthropic/claude-opus-4-6@thinking`	Extended thinking
`google`	`google/gemini-3-pro`	Gemini API
`xai`	`xai/grok-4-1-fast`	xAI API (OpenAI-compatible)
`openrouter`	`openrouter/deepseek/deepseek-v3.2`	OpenRouter (OpenAI-compatible)
`local`	`local/my-model`	Any local OpenAI-compatible server

Each provider maps to a pydantic-ai model class. To add a new provider, add a _build_* function in ppbench/benchmarks/model_list.py.

Strategies

A strategy defines what the agent does. The harness handles execution, retries, usage tracking, and caching.

Subclass ppbench.benchmarks.Strategy and implement two methods:

from ppbench.benchmarks import Strategy, AgentConfig, StrategyResult
from pydantic_ai import Agent
from ppbench import Puzzle

class MyStrategy(Strategy):
    requires_tools = False  # True if your agent uses tool calling

    def build_agent(self, puzzle, model_obj, model_name):
        """Create the agent and prompt. No execution happens here."""
        agent = Agent(model_obj, system_prompt="Solve this puzzle...")
        prompt = f"Puzzle: {puzzle.get_string_repr()}"
        return AgentConfig(agent=agent, prompt=prompt)

    def extract_result(self, puzzle, deps, output):
        """Interpret the agent's output. Replay moves, check success."""
        moves = parse_moves_somehow(output)
        fresh = Puzzle.from_url(puzzle.url)
        for m in moves:
            fresh.send_move(m)
        return StrategyResult(
            is_success=fresh.isComplete(),
            parsed_moves=moves,
            raw_output=output,
        )

Key concepts:

build_agent() returns an AgentConfig with a pydantic-ai Agent, a prompt, and optional deps
extract_result() replays moves on a fresh puzzle to verify the solution
on_node() is an optional per-step hook (compactification, progress tracking, etc.)
strategy_id is a hash of your strategy's source — harness changes don't invalidate the cache

Built-in strategies for reference:

direct_ask.py — single-shot, no tools (simplest)
basic_agentic.py — tool-calling agent with make_move, check_board, reset

The `run()` API

import asyncio
from ppbench.benchmarks import run, DirectAskStrategy, BasicAgenticSolve

results = asyncio.run(run(
    models=["openai/gpt-4o", "local/qwen3.5-35b"],
    strategies=[DirectAskStrategy, BasicAgenticSolve],
    dataset="golden_30",       # or "golden", "golden_300"
    puzzle_types=["tapa"],     # optional: filter by puzzle type
    n_puzzles=5,               # optional: limit count
    concurrency=10,            # max concurrent tasks
    seed=42,                   # reproducible puzzle sampling
))

Results are saved as JSONL index + JSON artifacts in output/runs/. See examples/analyze_results.py for how to load and inspect them.

Move Format

Puzzles use pzpr.js input commands:

Move	Description
`mouse,left,x,y`	Left click at (x,y)
`mouse,right,x,y`	Right click at (x,y)
`mouse,left,x1,y1,x2,y2`	Drag from (x1,y1) to (x2,y2)
`mouse,leftx2,x,y`	Double left-click at (x,y)
`mouse,rightx2,x,y`	Double right-click at (x,y)
`key,1`	Press key '1'

License

MIT

Citation

@article{waugh2026ppbench,
    title={Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning},
    author={Justin Waugh},
    year={2026},
    eprint={2603.02119},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2603.02119}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bluecoconut

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ppbench-0.1.0.tar.gz (1.3 MB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ppbench-0.1.0-py3-none-any.whl (1.7 MB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file ppbench-0.1.0.tar.gz.

File metadata

Download URL: ppbench-0.1.0.tar.gz
Upload date: Mar 10, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ppbench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`32c098b478322a854019c2b45f652d99c2694e546deca5906ec9d73a19054761`
MD5	`f22183422d69d7268588d334e0fb5acb`
BLAKE2b-256	`0d550b2f42bdb7a5a4d17d77e87f3c6462af08a4942102b9d9c3774631b9d0f6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppbench-0.1.0.tar.gz:

Publisher: workflow.yml on approximatelabs/pencil-puzzle-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ppbench-0.1.0.tar.gz
- Subject digest: 32c098b478322a854019c2b45f652d99c2694e546deca5906ec9d73a19054761
- Sigstore transparency entry: 1069756380
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: approximatelabs/pencil-puzzle-bench@518012b082873f4d1691e320d2de00a0b896dafa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/approximatelabs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@518012b082873f4d1691e320d2de00a0b896dafa
- Trigger Event: release

File details

Details for the file ppbench-0.1.0-py3-none-any.whl.

File metadata

Download URL: ppbench-0.1.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 1.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ppbench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c7f3d9aea109cf3534693d2c0a7c3c74b11654af1a4592d5152c085b5a0d5ee2`
MD5	`82df0e064b0b5b46121837667d622b59`
BLAKE2b-256	`c148e6ece5a07ee86caba01dc986591bbe732bf58deec9a7a705412730967a9f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppbench-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on approximatelabs/pencil-puzzle-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ppbench-0.1.0-py3-none-any.whl
- Subject digest: c7f3d9aea109cf3534693d2c0a7c3c74b11654af1a4592d5152c085b5a0d5ee2
- Sigstore transparency entry: 1069756387
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: approximatelabs/pencil-puzzle-bench@518012b082873f4d1691e320d2de00a0b896dafa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/approximatelabs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@518012b082873f4d1691e320d2de00a0b896dafa
- Trigger Event: release

ppbench 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Pencil Puzzle Bench

Features

Install

Prerequisites

Install ppbench

Docker

Quick Start

Gymnasium Environment

Verifiers Environment (GRPO Training)

Datasets

Full dataset

Running the Benchmark

Using a local model

Architecture Guide

Core primitive

Models

Strategies

The run() API

Move Format

License

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

The `run()` API