A benchmark for testing LLM performance at solving ascii mazes.

Project description

Ascii Maze Benchmark

This is a benchmark for testing how capable different LLMs are at solving ascii mazes. Here is an example 4x4 maze:

START
 v
# #######
#       #
# ##### #
# #     #
# #######
#     # #
##### # #
#       #
####### #
       ^
   FINISH

Here is the solution:

#.#######
#.      #
#.##### #
#.#     #
#.#######
#.....# #
#####.# #
#    ...#
#######.#

The benchmark randomly generates mazes from a seed, and evaluates LLMs ability to solve the maze.

Some LLMs tend to struggle with perfectly formatting the output for some reason, so we report scores at varying string distances to the correct response.

We evaluate all models using the OpenRouter API, to keep it simple. If it's not on open router, the benchmark will not be run.

Usage

Setup

Copy .env.example to .env and add your OpenRouter API key:
```
cp .env.example .env
```
Edit the .env file and replace your_api_key_here with your actual OpenRouter API key.

Generate Example Mazes

To generate and solve an example maze:

uv run ascii-maze-benchmark generate-example WIDTH HEIGHT [--seed SEED]

Example:

uv run ascii-maze-benchmark generate-example 5 5 --seed 42

Run Benchmarks

To run benchmarks against a specific model:

uv run ascii-maze-benchmark run-benchmark MODEL_ID [OPTIONS]

Options:

--maze-sizes TEXT: Comma-separated list of maze sizes to test (format: WIDTHxHEIGHT)
--mazes-per-size INTEGER: Number of different mazes to generate per size
--seed INTEGER: Random seed for reproducible maze generation
--cache-dir TEXT: Directory to cache benchmark results

Example:

uv run ascii-maze-benchmark run-benchmark anthropic/claude-3-haiku-20240307 --maze-sizes 3x3,4x4 --mazes-per-size 2

Development Tips

Benchmark results are cached in the .cache/benchmark_results directory by default, so visualization code can be rerun without spending money to rerun the benchmark.
Test the benchmarking code on a cheap model on OpenRouter first, to save costs.
Use the .env file to manage OpenRouter credentials.
Use uv for package management and running commands.
There is a src/ascii_maze_benchmark/generate_maze_script.py file you can use as a reference for maze generation logic.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Apr 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ascii_maze_benchmark-0.1.0.tar.gz (41.7 kB view details)

Uploaded Apr 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ascii_maze_benchmark-0.1.0-py3-none-any.whl (20.5 kB view details)

Uploaded Apr 19, 2025 Python 3

File details

Details for the file ascii_maze_benchmark-0.1.0.tar.gz.

File metadata

Download URL: ascii_maze_benchmark-0.1.0.tar.gz
Upload date: Apr 19, 2025
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.6.14

File hashes

Hashes for ascii_maze_benchmark-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8f9ce728f524c736e888f8a8ee8000f22b9704e06083f266afc0098641694579`
MD5	`b89e2b8c5aa54f318177f5e78d778a31`
BLAKE2b-256	`6904c98c2b8d22694df24342f2c3f5992f963e7e1690f7cade8fa8e6b9dc6cc8`

See more details on using hashes here.

File details

Details for the file ascii_maze_benchmark-0.1.0-py3-none-any.whl.

File metadata

Download URL: ascii_maze_benchmark-0.1.0-py3-none-any.whl
Upload date: Apr 19, 2025
Size: 20.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.6.14

File hashes

Hashes for ascii_maze_benchmark-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b6d30ac70c992dba9c6ac2df0272f31f3d5b5e59b7bf3191ddd9aa18818b171`
MD5	`b6cab93ece25fc006e7f9c31ba7cb41d`
BLAKE2b-256	`4c22726ebb332541d4930ca44999db93d225fa8679ef51c0cfdb50c39634a7f3`

See more details on using hashes here.

ascii-maze-benchmark 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Ascii Maze Benchmark

Usage

Setup

Generate Example Mazes

Run Benchmarks

Development Tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes