A benchmark for testing LLM performance at solving ascii mazes.
Project description
Ascii Maze Benchmark
This is a benchmark for testing how capable different LLMs are at solving ascii mazes. Here is an example 4x4 maze:
START
v
# #######
# #
# ##### #
# # #
# #######
# # #
##### # #
# #
####### #
^
FINISH
Here is the solution:
#.#######
#. #
#.##### #
#.# #
#.#######
#.....# #
#####.# #
# ...#
#######.#
The benchmark randomly generates mazes from a seed, and evaluates LLMs ability to solve the maze.
Some LLMs tend to struggle with perfectly formatting the output for some reason, so we report scores at varying string distances to the correct response.
We evaluate all models using the OpenRouter API, to keep it simple. If it's not on open router, the benchmark will not be run.
Usage
Setup
-
Copy
.env.exampleto.envand add your OpenRouter API key:cp .env.example .env -
Edit the
.envfile and replaceyour_api_key_herewith your actual OpenRouter API key.
Generate Example Mazes
To generate and solve an example maze:
uv run ascii-maze-benchmark generate-example WIDTH HEIGHT [--seed SEED]
Example:
uv run ascii-maze-benchmark generate-example 5 5 --seed 42
Run Benchmarks
To run benchmarks against a specific model:
uv run ascii-maze-benchmark run-benchmark MODEL_ID [OPTIONS]
Options:
--maze-sizes TEXT: Comma-separated list of maze sizes to test (format: WIDTHxHEIGHT)--mazes-per-size INTEGER: Number of different mazes to generate per size--seed INTEGER: Random seed for reproducible maze generation--cache-dir TEXT: Directory to cache benchmark results
Example:
uv run ascii-maze-benchmark run-benchmark anthropic/claude-3-haiku-20240307 --maze-sizes 3x3,4x4 --mazes-per-size 2
Development Tips
- Benchmark results are cached in the
.cache/benchmark_resultsdirectory by default, so visualization code can be rerun without spending money to rerun the benchmark. - Test the benchmarking code on a cheap model on OpenRouter first, to save costs.
- Use the
.envfile to manage OpenRouter credentials. - Use
uvfor package management and running commands. - There is a
src/ascii_maze_benchmark/generate_maze_script.pyfile you can use as a reference for maze generation logic.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ascii_maze_benchmark-0.1.0.tar.gz.
File metadata
- Download URL: ascii_maze_benchmark-0.1.0.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f9ce728f524c736e888f8a8ee8000f22b9704e06083f266afc0098641694579
|
|
| MD5 |
b89e2b8c5aa54f318177f5e78d778a31
|
|
| BLAKE2b-256 |
6904c98c2b8d22694df24342f2c3f5992f963e7e1690f7cade8fa8e6b9dc6cc8
|
File details
Details for the file ascii_maze_benchmark-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ascii_maze_benchmark-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b6d30ac70c992dba9c6ac2df0272f31f3d5b5e59b7bf3191ddd9aa18818b171
|
|
| MD5 |
b6cab93ece25fc006e7f9c31ba7cb41d
|
|
| BLAKE2b-256 |
4c22726ebb332541d4930ca44999db93d225fa8679ef51c0cfdb50c39634a7f3
|