A collection of partially-observable procedural gym environments
Project description
POPGym: Partially Observable Process Gym
POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines. The full paper is available on OpenReview.
Table of Contents
POPGym Environments
POPGym contains Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface. Our environments follow a few basic tenets:
- Painless Setup -
popgym
environments require onlygymnasium
,numpy
, andmazelib
as dependencies - Laptop-Sized Tasks - Most tasks can be solved in less than a day on the CPU
- True Generalization - All environments are heavily randomized.
Setup
Expand
You may install popgym
via pip
or from source.
Pip
# Works with python <= 3.10 due to mazelib dependency
pip install popgym
From Source
To install the environments:
git clone https://github.com/smorad/popgym
cd popgym
pip install .
Usage
Expand
import gymnasium as gym
import popgym
from popgym.wrappers import PreviousAction, Antialias, Markovian
from popgym.core.observability import Observability, STATE
# List all envs, see popgym/__init__.py
env_classes = popgym.ALL_ENVS.keys()
print(env_classes)
env_names = [e["id"] for e in popgym.ALL_ENVS.values()]
print(env_names)
# Create env
env = popgym.envs.stateless_cartpole.StatelessCartPoleEasy()
# In POMDPs, we often condition on the last action along with the observation.
# We can do this using the PreviousAction wrapper.
wrapped_env = PreviousAction(env)
# To prevent observation aliasing during the first timestep of
# each episode (where the previous action is undefined), we can also
# combine the PreviousAction wrapper with the Antialias wrapper
wrapped_env = Antialias(wrapped_env)
# Finally, we can decide if we want the hidden Markov state.
# This can be part of the observation, placed in the info dict, etc.
wrapped_env = Markovian(wrapped_env, Observability.FULL_IN_INFO_DICT)
wrapped_env.reset()
obs, reward, terminated, truncated, info = wrapped_env.step(wrapped_env.action_space.sample())
print(obs)
# Outputs:
# (
## Original observation
# array([0.0348076 , 0.02231686], dtype=float32),
## Previous action
# 1,
## Is initial timestep (antialias)
#0
#)
# Print the hidden Markov state
print(info[STATE])
# Outputs:
# array([ 0.0348076 , 0.14814377, 0.02231686, -0.31778395], dtype=float32)
Table of Environments
Expand
Environment | Tags | Temporal Ordering | Colab FPS | Macbook Air (2020) FPS |
---|---|---|---|---|
Battleship (Code) | Game | None | 117,158 | 235,402 |
Concentration (Code) | Game | Weak | 47,515 | 157,217 |
Higher Lower (Code) | Game, Noisy | None | 24,312 | 76,903 |
Labyrinth Escape (Code) | Navigation | Strong | 1,399 | 41,122 |
Labyrinth Explore (Code) | Navigation | Strong | 1,374 | 30,611 |
Minesweeper (Code) | Game | None | 8,434 | 32,003 |
Multiarmed Bandit (Code) | Noisy | None | 48,751 | 469,325 |
Autoencode (Code) | Diagnostic | Strong | 121,756 | 251,997 |
Count Recall (Code) | Diagnostic, Noisy | None | 16,799 | 50,311 |
Repeat First (Code) | Diagnostic | None | 23,895 | 155,201 |
Repeat Previous (Code) | Diagnostic | Strong | 50,349 | 136,392 |
Stateless Cartpole (Code) | Control | Strong | 73,622 | 218,446 |
Noisy Stateless Cartpole (Code) | Control, Noisy | Strong | 6,269 | 66,891 |
Stateless Pendulum (Code) | Control | Strong | 8,168 | 26,358 |
Noisy Stateless Pendulum (Code) | Control, Noisy | Strong | 6,808 | 20,090 |
We report the frames per second (FPS) for a single instance of each of our environments below. With multiprocessing
, environment FPS scales roughly linearly with the number of processes. Feel free to rerun this benchmark using this colab notebook.
Environment Descriptions
Expand
Concentration
The quintessential memory game, sometimes known as "memory". A deck of cards is shuffled and placed face-down. The agent picks two cards to flip face up, if the cards match ranks, the cards are removed from play and the agent receives a reward. If they don't match, they are placed back face-down. The agent must remember where it has seen cards in the past.
Higher Lower
Guess whether the next card drawn from the deck is higher or lower than the previously drawn card. The agent should keep a count like blackjack and modify bets, but this game is significantly simpler than blackjack.
Battleship
One-player battleship. Select a gridsquare to launch an attack, and receive confirmation whether you hit the target. The agent should use memory to remember which gridsquares were hits and which were misses, completing an episode sooner.
Multiarmed Bandit
Over an episode, solve a multiarmed bandit problem by maximizing the expected reward. The agent should use memory to keep a running mean and variance of bandits.
Minesweeper
Classic minesweeper, but with reduced vision range. The agent only has vision of the surroundings near its last sweep. The agent must use memory to remember where the bombs are
Repeat Previous
Output the t-kth observation for a reward
Repeat First
Output the zeroth observation for a reward
Autoencode
The agent will receive k observations then must output them in the same order
Stateless Cartpole
Classic cartpole, except the velocity and angular velocity magnitudes are hidden. The agent must use memory to compute rates of change.
Noisy Stateless Cartpole
Stateless Cartpole with added Gaussian noise
Stateless Pendulum
Classic pendulum, but the velocity and angular velocity are hidden from the agent. The agent must use memory to compute rates of change.
Noisy Stateless Pendulum
Stateless Pendulum with added Gaussian noise
Labyrinth Escape
Escape randomly-generated labyrinths. The agent must remember wrong turns it has taken to find the exit.
Labyrinth Explore
Explore as much of the labyrinth as possible in the time given. The agent must remember where it has been to maximize reward.
Count Recall
The player is given a sequence of cards and is asked to recall how many times it has seen a specific card.
POPGym Baselines
POPGym baselines implements recurrent and memory model in an efficient manner. POPGym baselines is implemented on top of rllib
using their custom model API.
Setup
Expand
To install the baselines and dependencies, first install ray
pip install "ray[rllib]==2.0.0"
ray
must be installed separately, as it erroneously pins an old verison of gym and will cause dependency issues. Once ray is installed, install popgym either via pip or from source.
Pip
pip install "popgym[baselines]"
From Source
git clone https://github.com/smorad/popgym
cd popgym
pip install ".[baselines]"
Usage
Expand
Our baselines exist in the ray_models
directory. Here is how to use
the GRU
model with rllib
.
import popgym
import ray
from torch import nn
from popgym.baselines.ray_models.ray_gru import GRU
# See what GRU-specific hyperparameters we can set
print(GRU.MODEL_CONFIG)
# Show other settable model hyperparameters like
# what the actor/critic branches look like,
# what hidden size to use,
# whether to add a positional embedding, etc.
print(GRU.BASE_CONFIG)
# How long the temporal window for backprop is
# This doesn't need to be longer than 1024
bptt_size = 1024
config = {
"model": {
"max_seq_len": bptt_size,
"custom_model": GRU,
"custom_model_config": {
# Override the hidden_size from BASE_CONFIG
# The input and output sizes of the MLP feeding the memory model
"preprocessor_input_size": 128,
"preprocessor_output_size": 64,
"preprocessor": nn.Sequential(nn.Linear(128, 64), nn.ReLU()),
# this is the size of the recurrent state in most cases
"hidden_size": 128,
# We should also change other parts of the architecture to use
# this new hidden size
# For the GRU, the output is of size hidden_size
"postprocessor": nn.Sequential(nn.Linear(128, 64), nn.ReLU()),
"postprocessor_output_size": 64,
# Actor and critic networks
"actor": nn.Linear(64, 64),
"critic": nn.Linear(64, 64),
# We can also override GRU-specific hyperparams
"num_recurrent_layers": 1,
},
},
# Some other rllib defaults you might want to change
# See https://docs.ray.io/en/latest/rllib/rllib-training.html#common-parameters
# for a full list of rllib settings
#
# These should be a factor of bptt_size
"sgd_minibatch_size": bptt_size * 4,
# Should be a factor of sgd_minibatch_size
"train_batch_size": bptt_size * 8,
# The environment we are training on
"env": "popgym-ConcentrationEasy-v0",
# You probably don't want to change these values
"rollout_fragment_length": bptt_size,
"framework": "torch",
"horizon": bptt_size,
"batch_mode": "complete_episodes",
}
# Stop after 50k environment steps
ray.tune.run("PPO", config=config, stop={"timesteps_total": 50_000})
To add your own custom model, inherit from BaseModel and implement the initial_state
and memory_forward
functions, as well as define your model configuration using MODEL_CONFIG
. To use any of these or your own custom model in ray
, make it the custom_model
in the rllib
config.
Available Baselines
Expand
- MLP
- Positional MLP
- Framestacking (Paper)
- Temporal Convolution Networks (Paper)
- Elman Networks (Paper)
- Long Short-Term Memory (Paper)
- Gated Recurrent Units (Paper)
- Independently Recurrent Neural Networks (Paper)
- Fast Autoregressive Transformers (Paper)
- Fast Weight Programmers (Paper)
- Legendre Memory Units (Paper)
- Diagonal State Space Models (Paper)
- Differentiable Neural Computers (Paper)
Leaderboard
Expand
We provide a leaderboard of the best module in each environment. Using ppo.py
, we run 3 trials of each trial. We compute the mean episodic reward over each batch, and store the maximum for each episode. We report the mean and standard deviations over the maximums, taken from at least 3 distinct trials.
The leaderboard is hosted at paperswithcode.
Contributing
Expand
Steps to follow:
- Fork this repo in github
- Clone your fork to your machine
- Move your environment into the forked repo
- Install precommit in the fork (see below)
- Write a unittest in
tests/
, see other tests for examples - Add your environment to
ALL_ENVS
inpopgym/__init__.py
- Make sure you don't break any tests by running
pytest tests/
- Git commit and push to your fork
- Open a pull request on github
# Step 4. Install pre-commit in the fork
pip install pre-commit
git clone https://github.com/smorad/popgym
cd popgym
pre-commit install
Citing
@inproceedings{
morad2023popgym,
title={{POPG}ym: Benchmarking Partially Observable Reinforcement Learning},
author={Steven Morad and Ryan Kortvelesy and Matteo Bettini and Stephan Liwicki and Amanda Prorok},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=chDrutUTs0K}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file popgym-1.0.0.tar.gz
.
File metadata
- Download URL: popgym-1.0.0.tar.gz
- Upload date:
- Size: 86.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c728fb87e8e4af798a2b77c5ee29ad8d46cba38a9b376c009ecde713289c6b7 |
|
MD5 | f10b6962a0349349e42e727e6cf2cb79 |
|
BLAKE2b-256 | 21a79a32be33246437ac3816c21e65698eb96d1bf9575169d7f539e7b4266142 |
File details
Details for the file popgym-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: popgym-1.0.0-py3-none-any.whl
- Upload date:
- Size: 114.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65adb4f75df398ab9c49ff5909a96e53c1191418844d1ef5fcd67b23d92cac37 |
|
MD5 | 1e304ebeaa5ef1efeccc9cb30c00e6be |
|
BLAKE2b-256 | 94a14c860deabce16f141cc07d015aeda09cd8da6031c9a2c44d2fe34839838f |