Skip to main content

A collection of partially-observable procedural gym environments

Project description

POPGym: Partially Observable Process Gym

tests codecov

POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines. The full paper is available on OpenReview.

Table of Contents

  1. POPGym Environments
    1. Setup
    2. Usage
    3. Table of Environments
    4. Environment Descriptions
  2. POPGym Baselines
    1. Setup
    2. Usage
    3. Available Baselines
  3. Leaderboard
  4. Contributing
  5. Citing

POPGym Environments

POPGym contains Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface. Our environments follow a few basic tenets:

  1. Painless Setup - popgym environments require only gymnasium, numpy, and mazelib as dependencies
  2. Laptop-Sized Tasks - Most tasks can be solved in less than a day on the CPU
  3. True Generalization - All environments are heavily randomized.

Setup

Expand

You may install popgym via pip or from source.

Pip

# Works with python <= 3.10 due to mazelib dependency
pip install popgym

From Source

To install the environments:

git clone https://github.com/smorad/popgym
cd popgym
pip install .

Usage

Expand

import gymnasium as gym
import popgym
from popgym.wrappers import PreviousAction, Antialias, Markovian
from popgym.core.observability import Observability, STATE
# List all envs, see popgym/__init__.py 
env_classes = popgym.ALL_ENVS.keys()
print(env_classes)
env_names = [e["id"] for e in popgym.ALL_ENVS.values()]
print(env_names)
# Create env
env = popgym.envs.stateless_cartpole.StatelessCartPoleEasy()
# In POMDPs, we often condition on the last action along with the observation.
# We can do this using the PreviousAction wrapper.
wrapped_env = PreviousAction(env)
# To prevent observation aliasing during the first timestep of
# each episode (where the previous action is undefined), we can also 
# combine the PreviousAction wrapper with the Antialias wrapper
wrapped_env = Antialias(wrapped_env)
# Finally, we can decide if we want the hidden Markov state.
# This can be part of the observation, placed in the info dict, etc.
wrapped_env = Markovian(wrapped_env, Observability.FULL_IN_INFO_DICT)

wrapped_env.reset()
obs, reward, terminated, truncated, info = wrapped_env.step(wrapped_env.action_space.sample())
print(obs)
# Outputs:
# (
  ## Original observation
  # array([0.0348076 , 0.02231686], dtype=float32), 
  ## Previous action
  # 1, 
  ## Is initial timestep (antialias)
  #0
#)

# Print the hidden Markov state
print(info[STATE])
# Outputs:
# array([ 0.0348076 ,  0.14814377,  0.02231686, -0.31778395], dtype=float32)

Table of Environments

Expand

Environment Tags Temporal Ordering Colab FPS Macbook Air (2020) FPS
Battleship (Code) Game None 117,158 235,402
Concentration (Code) Game Weak 47,515 157,217
Higher Lower (Code) Game, Noisy None 24,312 76,903
Labyrinth Escape (Code) Navigation Strong 1,399 41,122
Labyrinth Explore (Code) Navigation Strong 1,374 30,611
Minesweeper (Code) Game None 8,434 32,003
Multiarmed Bandit (Code) Noisy None 48,751 469,325
Autoencode (Code) Diagnostic Strong 121,756 251,997
Count Recall (Code) Diagnostic, Noisy None 16,799 50,311
Repeat First (Code) Diagnostic None 23,895 155,201
Repeat Previous (Code) Diagnostic Strong 50,349 136,392
Stateless Cartpole (Code) Control Strong 73,622 218,446
Noisy Stateless Cartpole (Code) Control, Noisy Strong 6,269 66,891
Stateless Pendulum (Code) Control Strong 8,168 26,358
Noisy Stateless Pendulum (Code) Control, Noisy Strong 6,808 20,090

We report the frames per second (FPS) for a single instance of each of our environments below. With multiprocessing, environment FPS scales roughly linearly with the number of processes. Feel free to rerun this benchmark using this colab notebook.

Environment Descriptions

Expand

Concentration

The quintessential memory game, sometimes known as "memory". A deck of cards is shuffled and placed face-down. The agent picks two cards to flip face up, if the cards match ranks, the cards are removed from play and the agent receives a reward. If they don't match, they are placed back face-down. The agent must remember where it has seen cards in the past.

Higher Lower

Guess whether the next card drawn from the deck is higher or lower than the previously drawn card. The agent should keep a count like blackjack and modify bets, but this game is significantly simpler than blackjack.

Battleship

One-player battleship. Select a gridsquare to launch an attack, and receive confirmation whether you hit the target. The agent should use memory to remember which gridsquares were hits and which were misses, completing an episode sooner.

Multiarmed Bandit

Over an episode, solve a multiarmed bandit problem by maximizing the expected reward. The agent should use memory to keep a running mean and variance of bandits.

Minesweeper

Classic minesweeper, but with reduced vision range. The agent only has vision of the surroundings near its last sweep. The agent must use memory to remember where the bombs are

Repeat Previous

Output the t-kth observation for a reward

Repeat First

Output the zeroth observation for a reward

Autoencode

The agent will receive k observations then must output them in the same order

Stateless Cartpole

Classic cartpole, except the velocity and angular velocity magnitudes are hidden. The agent must use memory to compute rates of change.

Noisy Stateless Cartpole

Stateless Cartpole with added Gaussian noise

Stateless Pendulum

Classic pendulum, but the velocity and angular velocity are hidden from the agent. The agent must use memory to compute rates of change.

Noisy Stateless Pendulum

Stateless Pendulum with added Gaussian noise

Labyrinth Escape

Escape randomly-generated labyrinths. The agent must remember wrong turns it has taken to find the exit.

Labyrinth Explore

Explore as much of the labyrinth as possible in the time given. The agent must remember where it has been to maximize reward.

Count Recall

The player is given a sequence of cards and is asked to recall how many times it has seen a specific card.

POPGym Baselines

POPGym baselines implements recurrent and memory model in an efficient manner. POPGym baselines is implemented on top of rllib using their custom model API.

Setup

Expand

To install the baselines and dependencies, first install ray

pip install "ray[rllib]==2.0.0"

ray must be installed separately, as it erroneously pins an old verison of gym and will cause dependency issues. Once ray is installed, install popgym either via pip or from source.

Pip

pip install "popgym[baselines]"

From Source

git clone https://github.com/smorad/popgym
cd popgym
pip install ".[baselines]"

Usage

Expand

Our baselines exist in the ray_models directory. Here is how to use the GRU model with rllib.

import popgym
import ray
from torch import nn
from popgym.baselines.ray_models.ray_gru import GRU
# See what GRU-specific hyperparameters we can set
print(GRU.MODEL_CONFIG)
# Show other settable model hyperparameters like 
# what the actor/critic branches look like,
# what hidden size to use, 
# whether to add a positional embedding, etc.
print(GRU.BASE_CONFIG)
# How long the temporal window for backprop is
# This doesn't need to be longer than 1024
bptt_size = 1024
config = {
   "model": {
      "max_seq_len": bptt_size,
      "custom_model": GRU,
      "custom_model_config": {
        # Override the hidden_size from BASE_CONFIG
        # The input and output sizes of the MLP feeding the memory model
        "preprocessor_input_size": 128,
        "preprocessor_output_size": 64,
        "preprocessor": nn.Sequential(nn.Linear(128, 64), nn.ReLU()),
        # this is the size of the recurrent state in most cases
        "hidden_size": 128,
        # We should also change other parts of the architecture to use
        # this new hidden size
        # For the GRU, the output is of size hidden_size
        "postprocessor": nn.Sequential(nn.Linear(128, 64), nn.ReLU()),
        "postprocessor_output_size": 64,
        # Actor and critic networks
        "actor": nn.Linear(64, 64),
        "critic": nn.Linear(64, 64),
        # We can also override GRU-specific hyperparams
        "num_recurrent_layers": 1,
      },
   },
   # Some other rllib defaults you might want to change
   # See https://docs.ray.io/en/latest/rllib/rllib-training.html#common-parameters
   # for a full list of rllib settings
   # 
   # These should be a factor of bptt_size
   "sgd_minibatch_size": bptt_size * 4,
   # Should be a factor of sgd_minibatch_size
   "train_batch_size": bptt_size * 8,
   # The environment we are training on
   "env": "popgym-ConcentrationEasy-v0",
   # You probably don't want to change these values
   "rollout_fragment_length": bptt_size,
   "framework": "torch",
   "horizon": bptt_size,
   "batch_mode": "complete_episodes",
}
# Stop after 50k environment steps
ray.tune.run("PPO", config=config, stop={"timesteps_total": 50_000})

To add your own custom model, inherit from BaseModel and implement the initial_state and memory_forward functions, as well as define your model configuration using MODEL_CONFIG. To use any of these or your own custom model in ray, make it the custom_model in the rllib config.

Available Baselines

Expand

  1. MLP
  2. Positional MLP
  3. Framestacking (Paper)
  4. Temporal Convolution Networks (Paper)
  5. Elman Networks (Paper)
  6. Long Short-Term Memory (Paper)
  7. Gated Recurrent Units (Paper)
  8. Independently Recurrent Neural Networks (Paper)
  9. Fast Autoregressive Transformers (Paper)
  10. Fast Weight Programmers (Paper)
  11. Legendre Memory Units (Paper)
  12. Diagonal State Space Models (Paper)
  13. Differentiable Neural Computers (Paper)

Leaderboard

Expand

We provide a leaderboard of the best module in each environment. Using ppo.py, we run 3 trials of each trial. We compute the mean episodic reward over each batch, and store the maximum for each episode. We report the mean and standard deviations over the maximums, taken from at least 3 distinct trials.

The leaderboard is hosted at paperswithcode.

Contributing

Expand

Steps to follow:

  1. Fork this repo in github
  2. Clone your fork to your machine
  3. Move your environment into the forked repo
  4. Install precommit in the fork (see below)
  5. Write a unittest in tests/, see other tests for examples
  6. Add your environment to ALL_ENVS in popgym/__init__.py
  7. Make sure you don't break any tests by running pytest tests/
  8. Git commit and push to your fork
  9. Open a pull request on github
# Step 4. Install pre-commit in the fork
pip install pre-commit
git clone https://github.com/smorad/popgym
cd popgym
pre-commit install

Citing

@inproceedings{
morad2023popgym,
title={{POPG}ym: Benchmarking Partially Observable Reinforcement Learning},
author={Steven Morad and Ryan Kortvelesy and Matteo Bettini and Stephan Liwicki and Amanda Prorok},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=chDrutUTs0K}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popgym-1.0.0.tar.gz (86.1 kB view details)

Uploaded Source

Built Distribution

popgym-1.0.0-py3-none-any.whl (114.0 kB view details)

Uploaded Python 3

File details

Details for the file popgym-1.0.0.tar.gz.

File metadata

  • Download URL: popgym-1.0.0.tar.gz
  • Upload date:
  • Size: 86.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for popgym-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8c728fb87e8e4af798a2b77c5ee29ad8d46cba38a9b376c009ecde713289c6b7
MD5 f10b6962a0349349e42e727e6cf2cb79
BLAKE2b-256 21a79a32be33246437ac3816c21e65698eb96d1bf9575169d7f539e7b4266142

See more details on using hashes here.

File details

Details for the file popgym-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: popgym-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 114.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for popgym-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65adb4f75df398ab9c49ff5909a96e53c1191418844d1ef5fcd67b23d92cac37
MD5 1e304ebeaa5ef1efeccc9cb30c00e6be
BLAKE2b-256 94a14c860deabce16f141cc07d015aeda09cd8da6031c9a2c44d2fe34839838f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page