Skip to main content

A Gymnasium environment for benchmarking spatial reasoning capabilities of AI agents on grid-based puzzles

Project description

Spatial-Gym: A Gymnasium Environment for Spatial Reasoning Benchmarking

PyPI Tests License: MIT Python 3.9+

Abstract

Spatial-Gym is a Gymnasium-compatible environment designed for evaluating spatial reasoning capabilities of Large Language Models (LLMs) and other AI agents. Built upon the spatial puzzle dataset introduced in SPaRC (Kaesberg et al.), this environment provides a standardized interface for benchmarking agent performance on grid-based spatial reasoning tasks. The environment supports multiple observation formats, customizable rendering modes for human and LLM interaction, and comprehensive evaluation metrics for systematic analysis of spatial reasoning abilities.

Key Features

  • Standardized RL Interface: Full Gymnasium API compliance for seamless integration with existing RL frameworks
  • Dual Observation Modes: Structured tensor representation or JSON-based symbolic encoding
  • Multi-Modal Rendering: Human-readable visualizations and LLM-optimized text representations
  • Flexible Dataset Support: Compatible with HuggingFace datasets following the SPaRC format
  • Comprehensive Metrics: Episode-level tracking of success rate, path efficiency, and reasoning patterns
  • Backtracking Support: Optional state reversibility for exploring different solution strategies

Installation

From PyPI

pip install Spatial-Gym

From Source

git clone https://github.com/lkaesberg/Spatial-Gym.git
cd Spatial-Gym
pip install -e .

Quick Start

import gymnasium as gym
import Spatial_Gym

# Initialize environment with default configuration
env = gym.make(
    "Spatial-Gym",
    df_name='lkaesberg/SPaRC',
    df_split='all',
    df_set='test',
    render_mode='human',
    observation='new',
    traceback=True,
    max_steps=1000
)

# Standard RL loop
observation, info = env.reset()
terminated = False

while not terminated:
    action = env.action_space.sample()  # Replace with your agent
    observation, reward, terminated, truncated, info = env.step(action)
    env.render()

env.close()

Environment Configuration

Parameter Type Default Description
df_name str 'lkaesberg/SPaRC' HuggingFace dataset identifier
df_split str 'all' Dataset split to use
df_set str 'test' Subset of data (train/val/test)
render_mode str None Visualization mode: 'human', 'llm', or None
observation str 'new' Observation format: 'new' (tensor) or 'SPaRC' (JSON)
traceback bool False Enable state reversibility
max_steps int 2000 Maximum steps per episode

Environment Specification

Action Space

Discrete(4): Four directional moves in the grid environment.

Action Value Description
RIGHT 0 Move agent one cell to the right
UP 1 Move agent one cell upward
LEFT 2 Move agent one cell to the left
DOWN 3 Move agent one cell downward

Observation Space

Tensor Format (observation='new')

A dictionary containing:

  • base (Dict[str, np.ndarray]): One-hot encoded spatial features
    • visited: Binary grid marking visited cells
    • gaps: Binary grid indicating traversable/non-traversable cells
    • agent_location: One-hot encoding of agent position
    • target_location: One-hot encoding of goal position
    • Additional puzzle-specific properties (e.g., stars, triangles)
  • color (np.ndarray): Integer grid (1-8) representing color properties
  • additional_info (np.ndarray): Puzzle-specific metadata (polyshape IDs, counts)

JSON Format (observation='SPaRC')

String-encoded JSON representing the grid state with symbolic notation, following the original SPaRC specification.

Reward Structure

  • +1.0: Successfully solving the puzzle
  • -1.0: Invalid termination or failure state
  • +0.01: Incremental reward for remaining on valid solution path (encourages exploration while maintaining progress)

Episode Termination

  • Success: Agent reaches target location satisfying all puzzle constraints
  • Failure: Agent enters invalid state or violates puzzle rules
  • Truncation: Maximum step limit reached

API Reference

Core Methods

env.reset(options: Optional[Dict] = None) -> Tuple[Observation, Dict]

Initializes or resets the environment to a new puzzle state.

  • Parameters:
    • options: Optional dictionary with 'puzzle_id' key to load specific puzzle
  • Returns: Initial observation and info dictionary
env.step(action: int) -> Tuple[Observation, float, bool, bool, Dict]

Executes one environment step given an action.

  • Parameters:
    • action: Integer in range [0, 3] representing directional move
  • Returns: Observation, reward, terminated flag, truncated flag, info dictionary
env.render() -> Optional[np.ndarray]

Generates visual or textual representation of current state based on render_mode.

env.close()

Releases environment resources and closes rendering windows.

Experimental Setup

Dataset

The environment uses puzzles from the SPaRC dataset, which contains spatial reasoning challenges of varying complexity. Each puzzle is defined by:

  • Grid dimensions (variable size)
  • Initial agent position
  • Target position
  • Spatial constraints (gaps, regions, colored elements)
  • Solution paths of varying lengths

Evaluation Metrics

The info dictionary returned by step() and reset() contains:

  • Success Rate: Binary indicator of puzzle completion
  • Path Length: Number of steps taken
  • Optimality: Ratio of actual path length to shortest possible path
  • Invalid Actions: Count of rule violations
  • Puzzle Metadata: Difficulty rating, constraint types, grid size

Use Cases

Benchmarking LLM Spatial Reasoning

import gymnasium as gym
import Spatial_Gym
from your_llm_wrapper import LLMAgent

env = gym.make("Spatial-Gym", render_mode='llm', observation='SPaRC')
agent = LLMAgent(model="gpt-4")

observation, info = env.reset()
for _ in range(100):  # Evaluate on 100 puzzles
    done = False
    while not done:
        action = agent.predict(env.render())  # LLM sees text representation
        observation, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
    
    # Log metrics
    print(f"Puzzle {info['puzzle_id']}: Success={info['success']}, Steps={info['steps']}")
    observation, info = env.reset()

Reinforcement Learning Training

from stable_baselines3 import PPO

env = gym.make("Spatial-Gym", observation='new', max_steps=500)
model = PPO("MultiInputPolicy", env, verbose=1)
model.learn(total_timesteps=1_000_000)
model.save("spatial_reasoning_agent")

Repository Structure

Spatial-Gym/
├── Spatial_Gym/           # Core environment implementation
│   ├── __init__.py        # Package initialization
│   ├── Spatial_Gym.py     # Main environment class
│   ├── register_env.py    # Gymnasium registration
│   └── render/            # Rendering modules
│       ├── human_renderer.py
│       └── llm_renderer.py
├── llm_testing/           # LLM evaluation utilities
│   ├── llm_host.py        # LLM interaction wrapper
│   └── parse_logs.py      # Result analysis tools
├── Final_Product.py       # Interactive demo script
├── human_play.py          # Human player interface
├── pyproject.toml         # Package configuration
└── README.md              # This file

Testing

Spatial-Gym includes a comprehensive test suite to ensure environment stability and correctness.

Running Tests

# Install with test dependencies
pip install -e ".[test]"

# Run all tests
pytest tests/ -v

# Run specific test categories
pytest tests/test_environment.py -v       # Environment API tests
pytest tests/test_random_agent.py -v      # Random agent tests
pytest tests/test_predefined_paths.py -v  # Path validation tests

# Run with coverage
pytest tests/ --cov=Spatial_Gym --cov-report=html

Test Coverage

The test suite includes 43+ tests covering:

  • ✅ Environment initialization and configuration
  • ✅ Gymnasium API compliance
  • ✅ Random agent behavior (stress tests)
  • ✅ Predefined valid and invalid paths
  • ✅ Multi-episode stability
  • ✅ Different observation formats
  • ✅ Rendering modes

Continuous Integration

Tests automatically run on:

  • Every push and pull request
  • Multiple OS (Ubuntu, macOS)
  • Python versions 3.9, 3.10, 3.11

See tests/README.md for detailed testing documentation.

Citation

If you use Spatial-Gym in your research, please cite:

@software{spatial_gym2024,
  title={Spatial-Gym: A Gymnasium Environment for Spatial Reasoning Benchmarking},
  author={Kaesberg, Lars Benedikt and Mark, Tobias},
  year={2024},
  url={https://github.com/lkaesberg/Spatial-Gym}
}

For the underlying SPaRC dataset and puzzles:

@inproceedings{kaesberg2024sparc,
  title={SPaRC: Spatial Reasoning Challenges for Large Language Models},
  author={Kaesberg, Lars Benedikt and others},
  booktitle={Proceedings of ACL},
  year={2024},
  url={https://sparc.gipplab.org/}
}

Contributing

We welcome contributions! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit changes with descriptive messages
  4. Add tests for new functionality
  5. Submit a pull request

For bug reports and feature requests, please use the GitHub issue tracker.

License

This project is licensed under the MIT License - see the LICENCE file for details.

Acknowledgments

  • Lars Benedikt Kaesberg (l.kaesberg@uni-goettingen.de) - Project conception and supervision
  • Jan Philip Wahle - Project supervision
  • Tobias Mark - Initial implementation and environment design
  • SPaRC Team - Original puzzle dataset and framework (sparc.gipplab.org)

Contact

For questions, suggestions, or collaboration inquiries, please contact:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spatial_gym-0.1.1.tar.gz (40.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spatial_gym-0.1.1-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file spatial_gym-0.1.1.tar.gz.

File metadata

  • Download URL: spatial_gym-0.1.1.tar.gz
  • Upload date:
  • Size: 40.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spatial_gym-0.1.1.tar.gz
Algorithm Hash digest
SHA256 16a394a561670b652f5e6f59464e4b5d05df8d6f312b201226e0fc675ad9db3b
MD5 f02b0526a4e0ae527d15a9cab0a3e9ac
BLAKE2b-256 5fa84887783fac45e383382c2c4d986c9e324b93048fce55ddbca7f30926c27f

See more details on using hashes here.

Provenance

The following attestation bundles were made for spatial_gym-0.1.1.tar.gz:

Publisher: publish.yml on lkaesberg/Spatial-Gym

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spatial_gym-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: spatial_gym-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spatial_gym-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f90e1911c65bac83ebcf95a1ff09fd75e512ab42630991925517d7d778db2572
MD5 043b5a24d6cfb92757fa6f3baf28a9e5
BLAKE2b-256 2b36d53c79764aeb370c96fdc0cf37f3caa3aef2958f36da1082238108f0aa66

See more details on using hashes here.

Provenance

The following attestation bundles were made for spatial_gym-0.1.1-py3-none-any.whl:

Publisher: publish.yml on lkaesberg/Spatial-Gym

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page