Code for the ICLR 2022 paper "The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models"
Project description
The Boltzmann Policy Distribution
This repository contains code and data for the ICLR 2022 paper The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models. In particular, the repository contains an implementation of our algorithm for computing the Boltzmann Policy Distribution (BPD) which is based around RLlib.
Installation
The code can be downloaded as this GitHub repository or installed as a pip package.
As a repository
-
Install Python 3.8 or later (3.7 might work but may not be able to load pretrained checkpoints).
-
Clone the repository:
git clone https://github.com/cassidylaidlaw/boltzmann-policy-distribution.git cd boltzmann-policy-distribution
-
Install pip requirements:
pip install -r requirements.txt
As a package
-
Install Python 3.
-
Install from PyPI:
pip install boltzmann-policy-distribution
-
Import the package as follows:
from bpd.agents.bpd_trainer import BPDTrainer
See getting_started.ipynb or the Colab notebook below for examples of how to use the package.
Data and Pretrained Models
Download human-human data from here.
Download pretrained models from here. The download includes a README describing which checkpoints are used where in the paper.
Usage
This section explains how to get started with using the code and how to run the Overcooked experiments from the paper.
Getting Started
The getting_started.ipynb notebook shows how to use the BPD to predict human behavior in a new environment. It is also available on Google Colab via the link below.
Experiments
Each of the subsections below describes how to various experiments from the paper. All experiment configuration is done using Sacred, and parameters can be updated from the command line by adding param=value
after the command. For instance, most of the experiments require setting the Overcooked layout by, for instance, writing layout_name="cramped_room"
.
We used RLlib for reinforcement learning (RL) and many experiments output an RLlib checkpoint as the result. If a checkpoint from one experiment is needed for another experiment, you can find the checkpoint by looking at the output of the training run, which should look something like this:
INFO - main - Starting training iteration 0
INFO - main - Starting training iteration 1
...
INFO - main - Saved final checkpoint to data/logs/self_play/ppo/cramped_room/2022-01-01_12-00-00/checkpoint_000500/checkpoint-500
Many experiments also log metrics to TensorBoard during training. Logs and checkpoints are stored in data/logs
by default. You can open TensorBoard by running
pip install tensorboard
tensorboard --logdir data/logs
Calculating the BPD
To calculate the BPD for Overcooked, we used the following command:
python -m bpd.experiments.train_overcooked with run="bpd" num_workers=25 num_training_iters=2000 layout_name="cramped_room" temperature=0.1 prior_concentration=0.2 reward_shaping_horizon=20000000 latents_per_iteration=250 share_dense_reward=True train_batch_size=100000 discriminate_sequences=True max_seq_len=10 entropy_coeff_start=0 entropy_coeff_end=0 latent_size=1000 sgd_minibatch_size=8000 use_latent_attention=True
Some useful parameters include
temperature
: the parameter $1 / \beta$ from the paper, which controls how irrational or suboptimal the human is.prior_concentration
: the parameter $\alpha$ from the paper, which controls how inconsistent the human is.latent_size
: $n$, the size of the Gaussian latent vector $z$.
Training a predictive model for the BPD
In the paper, we describe training a sequence model (transformer) to do online prediction of human actions using the BPD. We also experimented with using an RNN, and the command to train either is as follows. To train a prediction model, the first step is to rollout many episodes from the BPD:
python -m bpd.experiments.rollout with checkpoint=data/checkpoints/cramped_room/bpd_0.1_0.2_1000/checkpoint_000500/checkpoint-500 run=bpd num_workers=10 episodes=5000
Replace the checkpoint=
parameter with the path to your BPD checkpoint. Then, look for a directory called rollouts_2022-...
under the checkpoint directory. Use this to run the sequence model training:
python -m bpd.experiments.train_overcooked with run="distill" num_training_iters=5000 distill_random_policies=True layout_name="cramped_room" use_sequence_model=True use_lstm=False train_batch_size=16000 sgd_minibatch_size=16000 num_sgd_iter=1 size_hidden_layers=256 input="data/checkpoints/cramped_room/bpd_0.1_0.2_1000/checkpoint_000500/rollouts_2022-01-01_12-00-00" save_freq=1000
You can set use_lstm=True
to use an LSTM instead of a transformer for prediction.
Evaluating prediction
We haven't used any human data up until now to train the BPD and the predictive model! However, to evaluate the predictive power of the BPD, we'll need the human trajectories included in data download above. Assuming you've extracted them to data/human_data
, you can run:
python -m bpd.experiments.evaluate_overcooked_prediction with checkpoint_path=data/checkpoints/cramped_room/bpd_0.1_0.2_1000_transformer/checkpoint_005000/checkpoint-5000 run=distill human_data_fname="data/human_data/human_data_state_dict_and_action_by_traj_test_inserted_fixed.pkl" out_tag="test"
You should replace the run=distill
parameter with whatever run
parameter you used to train the model you want to evaluate. For instance, to evaluate the BPD policy distribution directly using mean-field variational inference (MFVI), you could run
python -m bpd.experiments.evaluate_overcooked_prediction with checkpoint_path=data/checkpoints/cramped_room/bpd_0.1_0.2_1000/checkpoint_000500/checkpoint-500 run=bpd human_data_fname="data/human_data/human_data_state_dict_and_action_by_traj_test_inserted_fixed.pkl" out_tag="test"
Training a best response
Besides using the BPD to predict human actions, we might also want to use it to enable human-AI cooperation. We can do this by training a best response to the BPD which will learn to cooperate with all the policies in the BPD and thus hopefully with real humans as well. To train a best response, run:
python -m bpd.experiments.train_overcooked with run="ppo" num_workers=10 num_training_iters=500 multiagent_mode="cross_play" checkpoint_to_load_policies=data/checkpoints/cramped_room/bpd_0.1_0.2_1000/checkpoint_000500/checkpoint-500 layout_name=cramped_room evaluation_interval=None entropy_coeff_start=0 entropy_coeff_end=0 share_dense_reward=True train_batch_size=100000 sgd_minibatch_size=8000
You can replace the checkpoint_to_load_policies
parameter with any other checkpoint you want to train a best response to. For instance, human-aware RL (HARL) is just a best response to a behavior cloned (BC) policy. To train a HARL policy, you can follow the instructions below to train a BC policy and then use that checkpoint with the command above.
Training a behavior cloning/human proxy policy
To train a behavior-cloned (BC) human policy from the human data, run:
python -m bpd.experiments.train_overcooked_bc with layout_name="cramped_room" human_data_fname="data/human_data/human_data_state_dict_and_action_by_traj_train_inserted_fixed.pkl" save_freq=10 num_training_iters=100 validation_prop=0.1
By default, this will use special, hand-engineered features as the input to the policy network. To use the normal Overcooked features add use_bc_features=False
to the command. To train a BC policy on the test set, replace human_data_fname="data/human_data/human_data_state_dict_and_action_by_traj_test_inserted_fixed.pkl"
in the command.
Evaluating with a human proxy
We evaluated cooperative AI policies in the paper by testing how well they performed when paired with a human proxy policy trained via behavior cloning on the test set of human data. To test a best response policy, run:
python -m bpd.experiments.evaluate_overcooked with layout_name=cramped_room run_0=ppo checkpoint_path_0=data/checkpoints/cramped_room/bpd_0.1_0.2_1000_br/checkpoint_002000/checkpoint-2000 policy_id_0=ppo_0 run_1=bc checkpoint_path_1=data/checkpoints/cramped_room/bc_test/checkpoint_000500/checkpoint-500 num_games=100 evaluate_flipped=True ep_length=400 out_tag=hproxy
If you want to test a policy which isn't a best response with the human proxy, remove the policy_id_0=ppo_0
parameter and update the run_0
parameter to whatever run
parameter you used when training the policy.
Baselines
To train a self-play policy, run:
python -m bpd.experiments.train_overcooked with run="ppo" num_workers=10 num_training_iters=500 layout_name="cramped_room" prior_concentration=1 reward_shaping_horizon=20000000 share_dense_reward=True train_batch_size=100000 entropy_coeff_start=0 entropy_coeff_end=0 sgd_minibatch_size=8000
To train a Boltzmann rational policy, use the same command but change the parameters entropy_coeff_start=0.1 entropy_coeff_end=0.1
for $1 / \beta = 0.1$.
To train a human model using generative adversarial imitation learning (GAIL), run:
python -m bpd.experiments.train_overcooked with run="gail" num_workers=10 num_training_iters=500 layout_name=cramped_room prior_concentration=1 reward_shaping_horizon=20000000 share_dense_reward=True train_batch_size=100000 num_sgd_iter=1 entropy_coeff_start=0.1 entropy_coeff_end=0.1 human_data_fname="data/human_data/human_data_state_dict_and_action_by_traj_train_inserted_fixed.pkl" sgd_minibatch_size=8000
Citation
If you find this repository useful for your research, please cite our paper as follows:
@inproceedings{laidlaw2022boltzmann,
title={The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models},
author={Laidlaw, Cassidy and Dragan, Anca},
booktitle={ICLR},
year={2022}
}
Contact
For questions about the paper or code, please contact cassidy_laidlaw@berkeley.edu.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file boltzmann-policy-distribution-0.0.6.tar.gz
.
File metadata
- Download URL: boltzmann-policy-distribution-0.0.6.tar.gz
- Upload date:
- Size: 48.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3b9a9ac0de55bd7dbd79e7e2b1a0a5fe5f4cc7b396a28a38efac50ef163e9e8 |
|
MD5 | 69b60433e7c684d3c34bd38d33fe9605 |
|
BLAKE2b-256 | 51ac40ea1424349a98a396ce4dffd58590a45575c9bd7eae04aff29a2ef0f7ff |
File details
Details for the file boltzmann_policy_distribution-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: boltzmann_policy_distribution-0.0.6-py3-none-any.whl
- Upload date:
- Size: 54.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b9945ffef55e53d63d26d835d589b163dc7521ff3c02bfd6535a6c4e52c87cb |
|
MD5 | 9487d0120c4652595fd7047f2cb69410 |
|
BLAKE2b-256 | e9f9d7533f3c39123c1fd686e071c8889efa889544e4d4b6d56cd969b867ddb1 |