Safety benchmarks for reinforcement learning

## Project description

# SafeLife

SafeLife is a novel environment to test the safety of reinforcement learning agents. The long term goal of this project is to develop training environments and benchmarks for numerous technical reinforcement learning safety problems, with the following attributes:

• Controllable difficulty for the environment

• Controllable difficulty for safety constraints

• Procedurally generated levels with richly adjustable distributions of mechanics and phenomena to reduce overfitting

The initial SafeLife version 1.0 (and the roadmap for the next few releases) focuses at first on the problem of side effects: how can one specify that an agent does whatever it needs to do to accomplish its goals, but nothing more? In SafeLife, an agent is tasked with creating or removing certain specified patterns, but its reward function is indifferent to its effects on other pre-existing patterns. A safe agent will learn to minimize its effects on those other patterns without explicitly being told to do so.

The SafeLife code base includes

• the environment definition (observations, available actions, and transitions between states);

• [example levels](./safelife/levels/), including benchmark levels;

• methods to procedurally generate new levels of varying difficulty;

• an implementation of proximal policy optimization to train reinforcement learning agents;

• a set of scripts to simplify [training on Google Cloud](./gcloud).

Minimizing side effects is very much an unsolved problem, and our baseline trained agents do not necessarily do a good job of it! The goal of SafeLife is to allow others to easily test their algorithms and improve upon the current state.

## Quick start

### Standard installation

SafeLife requires Python 3.5 or better. If you wish to install in a clean environment, it’s recommended to use [python virtual environments](https://docs.python.org/3/library/venv.html). You can install SafeLife globally using

pip3 install safelife

If you wish to save training or benchmark videos (using env_wrappers.RecordingSafeLifeWrapper), you’ll also need to install [ffmpeg](https://ffmpeg.org) (e.g., sudo apt-get install ffmpeg or brew install ffmpeg).

### Local installation

pip3 install -r requirements.txt python3 setup.py build_ext –inplace

This will download all of the requirements and build the C extensions in the safelife source folder. Note that you must have have a C compiler installed on your system to compile the extensions! This can be useful if forking the project or running the standard training scripts.

When running locally, console commands will need to use python3 -m safelife [args] instead of just safelife [args].

### Interactive play

To jump into a game, run

safelife play puzzles

All of the puzzle levels are solvable. See if you can do it without disturbing the green patterns!

(You can run safelife play –help to get help on the command-line options. More detail of how the game works is provided below, but it can be fun to try to figure out the basic mechanics yourself.)

### Training an agent

The start-training script is an easy way to get agents up and running using the default proximal policy optimization implementation. Just run

./start-training my-training-run

to start training locally with all saved files going into a new “my-training-run” directory. See below or ./start-training –help for more details.

## Contributing

We are very happy to have contributors and collaborators! To contribute code, fork this repo and make a pull request. All submitted code should be lint-free. Download flake8 (pip3 install flake8) and ensure that running flake8 in this directory results in no errors.

If you would like to establish a longer collaboration or research agenda using SafeLife, contact carroll@partnershiponai.org directly.

## Environment Overview

### Rules

SafeLife is based on [Conway’s Game of Life](https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life), a set of rules for cellular automata on an infinite two-dimensional grid. In Conway’s Game of Life, every cell on the grid is either alive or dead. At each time step the entire grid is updated. Any living cell with fewer than two or more than three living neighbors dies, and any dead cell with exactly three living neighbors comes alive. All other cells retain their previous state. With just these simple rules, extraordinarily complex patterns can emerge. Some patterns will be static—they won’t change between time steps. Other patterns will oscillate between two, or three, [or more](TK) states. Gliders and spaceships travel across the grid, while guns and [puffers](TK) can produce never-ending streams of new patterns. Conway’s Game of Life is Turing complete; anything that can be calculated can be calculated in Game of Life using a large enough grid. Some enterprising souls have taken this to its logical conclusion and [implemented Tetris](TK) in Game of Life.

Despite its name, Conway’s Game of Life is not actually a game—there are no players, and there are no choices to be made. In SafeLife 0.1 we’ve minimally extended the rules by adding a player, player goals, and a level exit. The player has 9 actions that it can choose at each time step: move in any of the four directions, create or destroy a life cell immediately adjacent to itself in any of the four directions, and do nothing. The player also temporarily “freezes” the eight cells in its Moore neighborhood; frozen cells do not change from one time step to the next, regardless of what the Game of Life rules would otherwise proscribe. By judiciously creating and destroying life cells, the player can build up quite complicated patterns. Matching these patterns to goal cells earns the player points and eventually opens the exit to the next level.

A small number of extra features enable more interesting play modes and emergent dynamics. In addition to just being alive or dead (or a player or an exit), individual cells can have the following characteristics.

• Some cells are frozen regardless of whether or not the player stands next to them. Frozen cells can be dead (walls) or alive (trees). Note that the player can only move onto empty cells, so one can easily use walls to build a maze.

• Cells can be movable. Movable cells allow the player to build defenses against out of control patterns.

• Spawning cells randomly create life cells in their own neighborhoods. This results in never-ending stochastic patterns emanating from the spawners.

• Inhibiting and preserving cells respectively prevent cell life and death from happening in their neighborhoods. By default, the player is both inhibiting and preserving (“freezing”), but need not be so on all levels.

• Indestructible life cells cannot be directly destroyed by the player. An indestructible pattern can cause a lot of trouble!

Additionally, all cells have a 3-bit color. New life cells inherit the coloring of their progenitors. The player is (by default) gray, and creates gray cells. Goals have colors too, and matching a goals with their own color yields bonus points. Red cells are harmful (unless in red goals), and yield points when removed from the board.

Finally, to simplify computation (and to prevent players from getting lost), SafeLife operates on finite rather than infinite grids and with wrapped boundary conditions.

### Classes and code

All of these rule are encapsulated by the safelife.safelife_game.SafeLifeGame class. That class is responsible for maintaining the game state associated with each SafeLife level, changing the state in response to player actions, and updating the state at each time step. It also has functions for serializing and de-serializing the state (saving and loading).

Actions in SafeLifeGame do not typically result in any direct rewards (there is a small bonus for successfully reaching a level exit). Instead, each board state is worth a certain number of points, and agent actions can increase or reduce that point value.

The safelife.safelife_env.SafeLifeEnv class wraps SafeLifeGame in an interface suitable for reinforcement learning agents (à la [OpenAI Gym](https://gym.openai.com/)). It implements step() and reset() functions. The former accepts an action (integers 0–8) and outputs an observation, reward, whether or not the episode completed, and a dictionary of extra information (see the code for more details); the latter starts a new episode and returns a new observation. Observations in SafeLifeEnv are not the same as board states in SafeLifeGame. Crucially, the observation is always centered on the agent (this respects the symmetry of the game and means that agents don’t have to implement attention mechanisms), can be partial (the agent can only see a certain distance), and only displays the color of the goal cells rather than their full content.

The reward function in SafeLifeEnv is just the (rescaled) difference in point values between the board before and after an action and time-step update. In addition, SafeLifeEnv introduces a no-movement penalty to encourage agent action, and an unobserved time limit to prevent agents from getting stuck. When the environment is reset, either a new board is chosen from a set of fixed levels, or a new level is randomly generated.

## Level editing

To start, create an empty level using

python3 -m safelife new –board_size <SIZE>

or edit an existing level using

python3 -m safelife play PATH/TO/LEVEL.npz

Various example and benchmark levels can be found in ./safelife/levels/.

SafeLife levels consist of foreground cells, including the player, and background goal cells. The goal cells evolve just like the foreground cells, so goal cells can oscillate by making them out of oscillating life patterns. In interactive mode, one can switch between playing, editing the foreground board, and editing the background goals by hitting the tilde key (~). To make new goals, just change the edit color (g) and add colored cells to the goal board. To get a full list of edit commands, hit the ? key.

More complex edits can be performed in an interactive IPython shell by hitting backslash (). Make edits to the game variable and then quit to affect the current level.

## Train and benchmark levels

We focus on three distinct tasks for agents to accomplish:

• in build tasks, the agent tries to match blue goal cells with its own gray life cells;

• in destroy tasks, the agent tries to remove red cells from the board;

• in the navigate task, the agent just tries to get to the level exit, but there may be obstacles in the way.

In all tasks there can also be green life cells on the board. The agent’s principal reward function is silent on the utility of these green cells, but a safe agent should be able to avoid disrupting them.

Training tasks will typically be randomly generated via safelife.proc_gen.gen_game(). The type of task generated depends on the generation parameters. A set of suggested training parameters is supplied in safelife/levels/random/. To view typical training boards, run e.g.

python3 -m safelife print random/append-still

To play them interactively, use play instead of print.

A set of benchmark levels is supplied in safelife/levels/benchmarks/v1.0/. These levels are fixed to make it easy to gauge progress in both agent performance and agent safety. Each set of benchmarks consists of 100 different levels for each benchmark task, with an agent’s benchmark score as its average performance across all levels in each set.

## Side Effects

• Side effects in static environments should be relatively easy to calculate: any change in the environment is a side effect, and all changes are due to the agent.

• Side effects in dynamic and stochastic environments are more tricky because only some changes are due to the agent. The agent will need to learn to reduce its own effects without disrupting the natural dynamics of the environment.

• Environments that contain both stochastic and oscillating patterns can test an agent’s ability to discern between fragile and robust patterns. Interfering with either permanently changes their subsequent evolution, but interfering with a fragile oscillating patterns tends to destroy it, while interfering with a robust stochastic pattern just changes it to a slightly different stochastic pattern.

Side effects are measured with the safelife.side_effects.side_effect_score() function. This calculates the average displacement of each cell type from a board without agent interaction to a board where the agent acted. See the code or (forthcoming) paper for more details.

Safe agents will likely need to be trained with their own impacts measure which penalize side effects, but importantly, the agent’s impact measure should not just duplicate the specific test-time impact measure for this environment. Reducing side effects is a difficult problem precisely because we do not know what the correct real-world impact measure should be; any impact measure needs to be general enough to make progress on the SafeLife benchmarks without overfitting to this particular environment.

## Training with proximal policy optimization

We include an implementation of proximal policy optimization in the training module. Note that this implementation contains some custom modifications, and shouldn’t be thought of as “reference” implementation. It will be cleaned up in a future release. The training.ppo.PPO class implements the core RL algorithm while training.safelife_ppo.SafeLifePPO adds functionality that is particular to the SafeLife environment and provides reasonable hyperparameters and network architecture.

There are a few import parameters and functions that deserve special attention.

• game_iterator is a generator of new SafeLifeGame instances. This can be replaced to specify a different training task or e.g. a level curriculum.

• environment_factory() builds new SafeLifeEnv instances. Each instance is called with self.game_iterator.

• build_logits_and_values() determines the agent policy and value function network architecture.

For all other parameters, see the code and the documentation therein.

To train an agent using these classes, just instantiate the class and run the train() method. Note that only one instance should be created per process.

### Preliminary results

TK: Update these

The initial agents were trained with no notion of side effects, so they end up being quite unsafe. Nonetheless, they do manage to occasionally complete a level without messing everything up. These are examples of (accidentally) safe and unsafe behaviors.

#### Creating patterns

By far the easiest pattern for agents to create is the “block” (a 2x2 square of life). The agent can get pretty far using only block patterns, but it limits their score. Many levels are impossible to complete safely using only the block, even if the agent were trying to be safe.

Here’s an example of safe behavior:

![safely creating patterns](https://github.com/PartnershipOnAI/safelife-videos/raw/master/v0.1/run-24f-16600.gif)

The agent focuses only on those patterns which are most easily accessible, and in this instance the easily accessible patterns didn’t butt up against pre-existing patterns. The agent is able to build enough patterns to complete the level.

However, it’s easy for the agent to disrupt a large area.

![causing havoc while creating patterns](https://github.com/PartnershipOnAI/safelife-videos/raw/master/v0.1/run-24f-18200.gif)

#### Destroying patterns

In SafeLife, as in life, it’s much easier to destroy things than it is to create them. The agent is quite good at getting rid of unwanted patterns, but it will almost always disrupt bystanders in the process.

![overeager destruction](https://github.com/PartnershipOnAI/safelife-videos/raw/master/v0.1/run-24d-21500.gif)

Note that it’s actually easier to destroy both the green and red patterns than to selectively prune only the red.

Very rarely, it will happen to destroy just what it intended and nothing more, but this is usually a fluke.

![overeager destruction](https://github.com/PartnershipOnAI/safelife-videos/raw/master/v0.1/run-24d-19400.gif)

#### Mixed training

It’s so far proven to be quite difficult to get agents to learn to perform both the build and destroy tasks. Agents will typically focus on only the build task, which yields more points, and never learn how to destroy unwanted patterns. However, we’re hopeful that with a little more work this will be achievable.

SafeLife is nearing its v1.0 release, but there are a couple of big items left to do. Most significantly, we plan on greatly expanding the number of benchmark levels to include on the order of 100 levels for each of the different benchmark types. This will allow for more straightforward benchmark statistics (one can report averages instead of noting the benchmark levels individually), and it will prevent agents from “accidentally” being safe by a fluke on a small number of levels. Along with the increased number of benchmark levels, we will report baseline safety for naive agents and agents trained with a very simple side effect impact penalty.

We also plan on making improvements to the procedural generation parameters and procedures. This includes making it possible to specify the complexity parameters for different region types individually, multiple spawners per spawner region, and potentially goals within or adjacent to spawner regions.

Other than that, we will continue to work on bug fixes, documentation, and editing code for readability. If you find any bugs, do let us know! We’ll try to address them as quickly as possible.

Beyond version 1.0, the immediate focus will be on making progress on the side effects problem itself. Other avenues we may explore include

• multi-agent play, both cooperative and competitive;

• enhanced dynamics, such as pullable blocks;

• measures of other (not side effects) safety problems.

## Project details

Uploaded source
Uploaded cp36