A Gymnasium environment for benchmarking spatial reasoning capabilities of AI agents on grid-based puzzles
Project description
Spatial-Gym: A Gymnasium Environment for Spatial Reasoning Benchmarking
Abstract
Spatial-Gym is a Gymnasium-compatible environment designed for evaluating spatial reasoning capabilities of Large Language Models (LLMs) and other AI agents. Built upon the spatial puzzle dataset introduced in SPaRC (Kaesberg et al.), this environment provides a standardized interface for benchmarking agent performance on grid-based spatial reasoning tasks. The environment supports multiple observation formats, customizable rendering modes for human and LLM interaction, and comprehensive evaluation metrics for systematic analysis of spatial reasoning abilities.
Key Features
- Standardized RL Interface: Full Gymnasium API compliance for seamless integration with existing RL frameworks
- Dual Observation Modes: Structured tensor representation or JSON-based symbolic encoding
- Multi-Modal Rendering: Human-readable visualizations and LLM-optimized text representations
- Flexible Dataset Support: Compatible with HuggingFace datasets following the SPaRC format
- Comprehensive Metrics: Episode-level tracking of success rate, path efficiency, and reasoning patterns
- Backtracking Support: Optional state reversibility for exploring different solution strategies
Installation
From PyPI
pip install Spatial-Gym
From Source
git clone https://github.com/lkaesberg/Spatial-Gym.git
cd Spatial-Gym
pip install -e .
Quick Start
import gymnasium as gym
import Spatial_Gym
# Initialize environment with default configuration
env = gym.make(
"Spatial-Gym",
df_name='lkaesberg/SPaRC',
df_split='all',
df_set='test',
render_mode='human',
observation='new',
traceback=True,
max_steps=1000
)
# Standard RL loop
observation, info = env.reset()
terminated = False
while not terminated:
action = env.action_space.sample() # Replace with your agent
observation, reward, terminated, truncated, info = env.step(action)
env.render()
env.close()
Environment Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
df_name |
str | 'lkaesberg/SPaRC' |
HuggingFace dataset identifier |
df_split |
str | 'all' |
Dataset split to use |
df_set |
str | 'test' |
Subset of data (train/val/test) |
render_mode |
str | None |
Visualization mode: 'human', 'llm', or None |
observation |
str | 'new' |
Observation format: 'new' (tensor) or 'SPaRC' (JSON) |
traceback |
bool | False |
Enable state reversibility |
max_steps |
int | 2000 |
Maximum steps per episode |
Environment Specification
Action Space
Discrete(4): Four directional moves in the grid environment.
| Action | Value | Description |
|---|---|---|
| RIGHT | 0 | Move agent one cell to the right |
| UP | 1 | Move agent one cell upward |
| LEFT | 2 | Move agent one cell to the left |
| DOWN | 3 | Move agent one cell downward |
Observation Space
Tensor Format (observation='new')
A dictionary containing:
base(Dict[str, np.ndarray]): One-hot encoded spatial featuresvisited: Binary grid marking visited cellsgaps: Binary grid indicating traversable/non-traversable cellsagent_location: One-hot encoding of agent positiontarget_location: One-hot encoding of goal position- Additional puzzle-specific properties (e.g.,
stars,triangles)
color(np.ndarray): Integer grid (1-8) representing color propertiesadditional_info(np.ndarray): Puzzle-specific metadata (polyshape IDs, counts)
JSON Format (observation='SPaRC')
String-encoded JSON representing the grid state with symbolic notation, following the original SPaRC specification.
Reward Structure
- +1.0: Successfully solving the puzzle
- -1.0: Invalid termination or failure state
- +0.01: Incremental reward for remaining on valid solution path (encourages exploration while maintaining progress)
Episode Termination
- Success: Agent reaches target location satisfying all puzzle constraints
- Failure: Agent enters invalid state or violates puzzle rules
- Truncation: Maximum step limit reached
API Reference
Core Methods
env.reset(options: Optional[Dict] = None) -> Tuple[Observation, Dict]
Initializes or resets the environment to a new puzzle state.
- Parameters:
options: Optional dictionary with'puzzle_id'key to load specific puzzle
- Returns: Initial observation and info dictionary
env.step(action: int) -> Tuple[Observation, float, bool, bool, Dict]
Executes one environment step given an action.
- Parameters:
action: Integer in range [0, 3] representing directional move
- Returns: Observation, reward, terminated flag, truncated flag, info dictionary
env.render() -> Optional[np.ndarray]
Generates visual or textual representation of current state based on render_mode.
env.close()
Releases environment resources and closes rendering windows.
Experimental Setup
Dataset
The environment uses puzzles from the SPaRC dataset, which contains spatial reasoning challenges of varying complexity. Each puzzle is defined by:
- Grid dimensions (variable size)
- Initial agent position
- Target position
- Spatial constraints (gaps, regions, colored elements)
- Solution paths of varying lengths
Evaluation Metrics
The info dictionary returned by step() and reset() contains:
- Success Rate: Binary indicator of puzzle completion
- Path Length: Number of steps taken
- Optimality: Ratio of actual path length to shortest possible path
- Invalid Actions: Count of rule violations
- Puzzle Metadata: Difficulty rating, constraint types, grid size
Use Cases
Benchmarking LLM Spatial Reasoning
import gymnasium as gym
import Spatial_Gym
from your_llm_wrapper import LLMAgent
env = gym.make("Spatial-Gym", render_mode='llm', observation='SPaRC')
agent = LLMAgent(model="gpt-4")
observation, info = env.reset()
for _ in range(100): # Evaluate on 100 puzzles
done = False
while not done:
action = agent.predict(env.render()) # LLM sees text representation
observation, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
# Log metrics
print(f"Puzzle {info['puzzle_id']}: Success={info['success']}, Steps={info['steps']}")
observation, info = env.reset()
Reinforcement Learning Training
from stable_baselines3 import PPO
env = gym.make("Spatial-Gym", observation='new', max_steps=500)
model = PPO("MultiInputPolicy", env, verbose=1)
model.learn(total_timesteps=1_000_000)
model.save("spatial_reasoning_agent")
Repository Structure
Spatial-Gym/
├── Spatial_Gym/ # Core environment implementation
│ ├── __init__.py # Package initialization
│ ├── Spatial_Gym.py # Main environment class
│ ├── register_env.py # Gymnasium registration
│ └── render/ # Rendering modules
│ ├── human_renderer.py
│ └── llm_renderer.py
├── llm_testing/ # LLM evaluation utilities
│ ├── llm_host.py # LLM interaction wrapper
│ └── parse_logs.py # Result analysis tools
├── Final_Product.py # Interactive demo script
├── human_play.py # Human player interface
├── pyproject.toml # Package configuration
└── README.md # This file
Testing
Spatial-Gym includes a comprehensive test suite to ensure environment stability and correctness.
Running Tests
# Install with test dependencies
pip install -e ".[test]"
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest tests/test_environment.py -v # Environment API tests
pytest tests/test_random_agent.py -v # Random agent tests
pytest tests/test_predefined_paths.py -v # Path validation tests
# Run with coverage
pytest tests/ --cov=Spatial_Gym --cov-report=html
Test Coverage
The test suite includes 43+ tests covering:
- ✅ Environment initialization and configuration
- ✅ Gymnasium API compliance
- ✅ Random agent behavior (stress tests)
- ✅ Predefined valid and invalid paths
- ✅ Multi-episode stability
- ✅ Different observation formats
- ✅ Rendering modes
Continuous Integration
Tests automatically run on:
- Every push and pull request
- Multiple OS (Ubuntu, macOS)
- Python versions 3.9, 3.10, 3.11
See tests/README.md for detailed testing documentation.
Citation
If you use Spatial-Gym in your research, please cite:
@software{spatial_gym2024,
title={Spatial-Gym: A Gymnasium Environment for Spatial Reasoning Benchmarking},
author={Kaesberg, Lars Benedikt and Mark, Tobias},
year={2024},
url={https://github.com/lkaesberg/Spatial-Gym}
}
For the underlying SPaRC dataset and puzzles:
@inproceedings{kaesberg2024sparc,
title={SPaRC: Spatial Reasoning Challenges for Large Language Models},
author={Kaesberg, Lars Benedikt and others},
booktitle={Proceedings of ACL},
year={2024},
url={https://sparc.gipplab.org/}
}
Contributing
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit changes with descriptive messages
- Add tests for new functionality
- Submit a pull request
For bug reports and feature requests, please use the GitHub issue tracker.
License
This project is licensed under the MIT License - see the LICENCE file for details.
Acknowledgments
- Lars Benedikt Kaesberg (l.kaesberg@uni-goettingen.de) - Project conception and supervision
- Jan Philip Wahle - Project supervision
- Tobias Mark - Initial implementation and environment design
- SPaRC Team - Original puzzle dataset and framework (sparc.gipplab.org)
Contact
For questions, suggestions, or collaboration inquiries, please contact:
- Lars Benedikt Kaesberg: l.kaesberg@uni-goettingen.de
- GitHub Issues: https://github.com/lkaesberg/Spatial-Gym/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spatial_gym-0.1.1.tar.gz.
File metadata
- Download URL: spatial_gym-0.1.1.tar.gz
- Upload date:
- Size: 40.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16a394a561670b652f5e6f59464e4b5d05df8d6f312b201226e0fc675ad9db3b
|
|
| MD5 |
f02b0526a4e0ae527d15a9cab0a3e9ac
|
|
| BLAKE2b-256 |
5fa84887783fac45e383382c2c4d986c9e324b93048fce55ddbca7f30926c27f
|
Provenance
The following attestation bundles were made for spatial_gym-0.1.1.tar.gz:
Publisher:
publish.yml on lkaesberg/Spatial-Gym
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spatial_gym-0.1.1.tar.gz -
Subject digest:
16a394a561670b652f5e6f59464e4b5d05df8d6f312b201226e0fc675ad9db3b - Sigstore transparency entry: 1203568718
- Sigstore integration time:
-
Permalink:
lkaesberg/Spatial-Gym@c7dab51bef78a67e23fed1117c68f71196c2bdfe -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/lkaesberg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7dab51bef78a67e23fed1117c68f71196c2bdfe -
Trigger Event:
release
-
Statement type:
File details
Details for the file spatial_gym-0.1.1-py3-none-any.whl.
File metadata
- Download URL: spatial_gym-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f90e1911c65bac83ebcf95a1ff09fd75e512ab42630991925517d7d778db2572
|
|
| MD5 |
043b5a24d6cfb92757fa6f3baf28a9e5
|
|
| BLAKE2b-256 |
2b36d53c79764aeb370c96fdc0cf37f3caa3aef2958f36da1082238108f0aa66
|
Provenance
The following attestation bundles were made for spatial_gym-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on lkaesberg/Spatial-Gym
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spatial_gym-0.1.1-py3-none-any.whl -
Subject digest:
f90e1911c65bac83ebcf95a1ff09fd75e512ab42630991925517d7d778db2572 - Sigstore transparency entry: 1203568719
- Sigstore integration time:
-
Permalink:
lkaesberg/Spatial-Gym@c7dab51bef78a67e23fed1117c68f71196c2bdfe -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/lkaesberg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7dab51bef78a67e23fed1117c68f71196c2bdfe -
Trigger Event:
release
-
Statement type: