Reinforcement Learning: An Introduction

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction
Figures
Mancala
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6

Introduction

This is an implementation of concepts and algorithms described in "Reinforcement Learning: An Introduction" (Sutton and Barto, 2018, 2nd edition). It is a work in progress that I started as a personal hobby, and its state reflects the extent to which I have progressed through the text. I have implemented it with the following objectives in mind.

Complete conceptual and algorithmic coverage: Implement all concepts and algorithms described in the text, plus some.
Minimal dependencies: All computation specific to the text is implemented here.
Complete test coverage: All implementations are paired with unit tests.
Clean object-oriented design: The text often provides concise pseudocode that is not difficult to write a one-off program for; however, it is an altogether different matter to architect a reusable and extensible codebase that achieves the goals listed above in an object-oriented fashion.

As with all objectives, none of the above are fully realized. In particular, (1) is not met since I decided to make this repository public well before finishing. But the remaining objectives are fairly well satisfied.

Figures

A list of figures can be found here. Most of these are reproductions of those shown in the text; however, even the reproductions typically provide detail not shown in the text.

Mancala

This is a simple game with many rule variations, and it provides a greater challenge in terms of implementation and state-space size than the gridworld. I have implemented a fairly common variation summarized below.

One row of 6 pockets per player, each starting with 4 seeds.
Landing in the store earns another turn.
Landing in own empty pocket steals.
Game terminates when a player's pockets are clear.
Winner determined by store count.

A couple hours of Monte Carlo optimization explores more than 1 million states when playing against an equiprobable random opponent (shown here).

Key files are listed below.

Chapter 2

`rlai.environments.bandit.Arm`

Bandit arm.

`rlai.agents.q_value.EpsilonGreedy`

Nonassociative, epsilon-greedy agent.

`rlai.agents.q_value.QValue`

Nonassociative, q-value agent.

`rlai.environments.bandit.KArmedBandit`

K-armed bandit.

`rlai.utils.IncrementalSampleAverager`

An incremental, constant-time and -memory sample averager. Supports both decreasing (i.e., unweighted sample
    average) and constant (i.e., exponential recency-weighted average, pp. 32-33) step sizes.

`rlai.agents.q_value.UpperConfidenceBound`

Nonassociatve, upper-confidence-bound agent.

`rlai.agents.h_value.PreferenceGradient`

Preference-gradient agent.

Chapter 3

`rlai.environments.mancala.MancalaState`

State of the mancala game. In charge of representing the entirety of the game state and advancing to the next state.

`rlai.environments.mancala.Mancala`

Environment for the mancala game.

`rlai.environments.mdp.MdpEnvironment`

MDP environment.

`rlai.states.mdp.MdpState`

Model-free MDP state.

`rlai.states.mdp.ModelBasedMdpState`

Model-based MDP state. Adds the specification of a probability distribution over next states and rewards.

`rlai.environments.mdp.Gridworld`

Gridworld MDP environment.

Chapter 4

`rlai.gpi.dynamic_programming.evaluation.evaluate_v_pi`

Perform iterative policy evaluation of an agent's policy within an environment, returning state values.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.states.mdp.ModelBasedMdpState`.
    :param theta: Minimum tolerated change in state value estimates, below which evaluation terminates. Either `theta`
    or `num_iterations` (or both) can be specified, but passing neither will raise an exception.
    :param num_iterations: Number of evaluation iterations to execute.  Either `theta` or `num_iterations` (or both)
    can be specified, but passing neither will raise an exception.
    :param update_in_place: Whether or not to update value estimates in place.
    :param initial_v_S: Initial guess at state-value, or None for no guess.
    :return: 2-tuple of (1) dictionary of MDP states and their estimated values under the agent's policy, and (2) final
    value of delta.

`rlai.gpi.dynamic_programming.evaluation.evaluate_q_pi`

Perform iterative policy evaluation of an agent's policy within an environment, returning state-action values.

    :param agent: MDP agent.
    :param theta: Minimum tolerated change in state value estimates, below which evaluation terminates. Either `theta`
    or `num_iterations` (or both) can be specified, but passing neither will raise an exception.
    :param num_iterations: Number of evaluation iterations to execute.  Either `theta` or `num_iterations` (or both)
    can be specified, but passing neither will raise an exception.
    :param update_in_place: Whether or not to update value estimates in place.
    :param initial_q_S_A: Initial guess at state-action value, or None for no guess.
    :return: 2-tuple of (1) dictionary of MDP states, actions, and their estimated values under the agent's policy, and
    (2) final value of delta.

`rlai.gpi.dynamic_programming.improvement.improve_policy_with_v_pi`

Improve an agent's policy according to its state-value estimates. This makes the policy greedy with respect to the
    state-value estimates. In cases where multiple such greedy actions exist for a state, each of the greedy actions
    will be assigned equal probability.

    Note that the present function resides within `rlai.gpi.dynamic_programming.improvement` and requires state-value
    estimates of states that are model-based. These are the case because policy improvement from state values is only
    possible if we have a model of the environment. Compare with `rlai.gpi.improvement.improve_policy_with_q_pi`, which
    accepts model-free states since state-action values are estimated directly.

    :param agent: Agent.
    :param v_pi: State-value estimates for the agent's policy.
    :return: Number of states in which the policy was updated.

`rlai.gpi.improvement.improve_policy_with_q_pi`

Improve an agent's policy according to its state-action value estimates. This makes the policy greedy with respect
    to the state-action value estimates. In cases where multiple such greedy actions exist for a state, each of the
    greedy actions will be assigned equal probability.

    :param agent: Agent.
    :param q_pi: State-action value estimates for the agent's policy.
    :param epsilon: Total probability mass to spread across all actions, resulting in an epsilon-greedy policy. Must
    be >= 0 if provided.
    :return: Number of states in which the policy was updated.

`rlai.gpi.dynamic_programming.iteration.iterate_policy_q_pi`

Run policy iteration on an agent using state-value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.states.mdp.ModelBasedMdpState`.
    :param theta: See `evaluate_q_pi`.
    :param update_in_place: See `evaluate_q_pi`.
    :return: Final state-action value estimates.

`rlai.gpi.dynamic_programming.iteration.iterate_policy_v_pi`

Run policy iteration on an agent using state-value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.states.mdp.ModelBasedMdpState`.
    :param theta: See `evaluate_v_pi`.
    :param update_in_place: See `evaluate_v_pi`.
    :return: Final state-value estimates.

`rlai.gpi.dynamic_programming.iteration.iterate_value_v_pi`

Run dynamic programming value iteration on an agent using state-value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.states.mdp.ModelBasedMdpState`.
    :param theta: See `evaluate_v_pi`.
    :param evaluation_iterations_per_improvement: Number of policy evaluation iterations to execute for each iteration
    of improvement (e.g., passing 1 results in Equation 4.10).
    :param update_in_place: See `evaluate_v_pi`.
    :return: Final state-value estimates.

`rlai.environments.mdp.GamblersProblem`

Gambler's problem MDP environment.

`rlai.gpi.dynamic_programming.iteration.iterate_value_q_pi`

Run value iteration on an agent using state-action value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.states.mdp.ModelBasedMdpState`.
    :param theta: See `evaluate_q_pi`.
    :param evaluation_iterations_per_improvement: Number of policy evaluation iterations to execute for each iteration
    of improvement.
    :param update_in_place: See `evaluate_q_pi`.
    :return: Final state-action value estimates.

Chapter 5

`rlai.gpi.monte_carlo.evaluation.evaluate_v_pi`

Perform Monte Carlo evaluation of an agent's policy within an environment, returning state values. Uses a random
    action on the first time step to maintain exploration (exploring starts). This evaluation approach is only
    marginally useful in practice, as the state-value estimates require a model of the environmental dynamics (i.e.,
    the transition-reward probability distribution) in order to be applied. See `evaluate_q_pi` in this module for a
    more feature-rich and useful evaluation approach (i.e., state-action value estimation). This evaluation function
    operates over rewards obtained at the end of episodes, so it is only appropriate for episodic tasks.

    :param agent: Agent.
    :param environment: Environment.
    :param num_episodes: Number of episodes to execute.
    :return: Dictionary of MDP states and their estimated values under the agent's policy.

`rlai.gpi.monte_carlo.evaluation.evaluate_q_pi`

Perform Monte Carlo evaluation of an agent's policy within an environment, returning state-action values. This
    evaluation function operates over rewards obtained at the end of episodes, so it is only appropriate for episodic
    tasks.

    :param agent: Agent containing target policy to be optimized.
    :param environment: Environment.
    :param num_episodes: Number of episodes to execute.
    :param exploring_starts: Whether or not to use exploring starts, forcing a random action in the first time step.
    This maintains exploration in the first state; however, unless each state has some nonzero probability of being
    selected as the first state, there is no assurance that all state-action pairs will be sampled. If the initial state
    is deterministic, consider passing False here and shifting the burden of exploration to the improvement step with
    a nonzero epsilon (see `rlai.gpi.improvement.improve_policy_with_q_pi`).
    :param update_upon_every_visit: True to update each state-action pair upon each visit within an episode, or False to
    update each state-action pair upon the first visit within an episode.
    :param off_policy_agent: Agent containing behavioral policy used to generate learning episodes. To ensure that the
    state-action value estimates converge to those of the target policy, the policy of the `off_policy_agent` must be
    soft (i.e., have positive probability for all state-action pairs that have positive probabilities in the agent's
    target policy).
    :param initial_q_S_A: Initial guess at state-action value, or None for no guess.
    :return: 3-tuple of (1) dictionary of all MDP states and their action-value averagers under the agent's policy, (2)
    set of only those states that were evaluated, and (3) the average reward obtained per episode.

`rlai.gpi.monte_carlo.iteration.iterate_value_q_pi`

Run Monte Carlo value iteration on an agent using state-action value estimates. This iteration function operates
    over rewards obtained at the end of episodes, so it is only appropriate for episodic tasks.

    :param agent: Agent.
    :param environment: Environment.
    :param num_improvements: Number of policy improvements to make.
    :param num_episodes_per_improvement: Number of policy evaluation episodes to execute for each iteration of
    improvement. Passing `1` will result in the Monte Carlo ES (Exploring Starts) algorithm.
    :param update_upon_every_visit: See `rlai.gpi.monte_carlo.evaluation.evaluate_q_pi`.
    :param epsilon: Total probability mass to spread across all actions, resulting in an epsilon-greedy policy. Must
    be >= 0 if provided.
    :param off_policy_agent: See `rlai.gpi.monte_carlo.evaluation.evaluate_q_pi`. The policy of this agent will not
    updated by this function.
    :param num_improvements_per_plot: Number of improvements to make before plotting the per-improvement average. Pass
    None to turn off all plotting.
    :param num_improvements_per_checkpoint: Number of improvements per checkpoint save.
    :param checkpoint_path: Checkpoint path. Must be provided if `num_improvements_per_checkpoint` is provided.
    :param initial_q_S_A: Initial state-action value estimates (primarily useful for restarting from a checkpoint).
    :return: State-action value estimates from final iteration of improvement.

Chapter 6

`rlai.gpi.temporal_difference.evaluation.Mode`

Evaluation modes for temporal-difference evaluation:  SARSA (on-policy, Q-Learning (off-policy), and Expected SARSA
    (off-policy).

`rlai.gpi.temporal_difference.evaluation.evaluate_q_pi`

Perform temporal-difference (TD) evaluation of an agent's policy within an environment, returning state-action
    values. This evaluation function implements both on-policy TD learning (SARSA) as well as off-policy TD learning
    (Q-learning and expected SARSA).

    :param agent: Agent containing target policy to be optimized.
    :param environment: Environment.
    :param num_episodes: Number of episodes to execute.
    :param alpha: Constant step size to use when updating Q-values, or None for 1/n step size.
    :param mode: Evaluation mode (see `rlai.gpi.temporal_difference.evaluation.Mode`).
    :param initial_q_S_A: Initial guess at state-action value, or None for no guess.
    :return: 3-tuple of (1) dictionary of all MDP states and their action-value averagers under the agent's policy, (2)
    set of only those states that were evaluated, and (3) the average reward obtained per episode.

`rlai.gpi.temporal_difference.iteration.iterate_value_q_pi`

Run temporal-difference value iteration on an agent using state-action value estimates.

    :param agent: Agent.
    :param environment: Environment.
    :param num_improvements: Number of policy improvements to make.
    :param num_episodes_per_improvement: Number of policy evaluation episodes to execute for each iteration of
    improvement.
    :param alpha: Constant step size to use when updating Q-values, or None for 1/n step size.
    :param mode: Evaluation mode (see `rlai.gpi.temporal_difference.evaluation.Mode`).
    :param epsilon: Total probability mass to spread across all actions, resulting in an epsilon-greedy policy. Must
    be strictly > 0.
    :param num_improvements_per_plot: Number of improvements to make before plotting the per-improvement average. Pass
    None to turn off all plotting.
    :param num_improvements_per_checkpoint: Number of improvements per checkpoint save.
    :param checkpoint_path: Checkpoint path. Must be provided if `num_improvements_per_checkpoint` is provided.
    :param initial_q_S_A: Initial state-action value estimates (primarily useful for restarting from a checkpoint).
    :return: State-action value estimates from final iteration of improvement.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.0

May 26, 2022

0.23.0

Dec 16, 2021

0.22.0

Dec 15, 2021

0.21.0

Sep 27, 2021

0.21.0.dev0 pre-release

Sep 26, 2021

0.20.0

Sep 5, 2021

0.19.0

Aug 15, 2021

0.19.0.dev0 pre-release

Aug 15, 2021

0.17.0

Feb 17, 2021

0.16.0

Feb 15, 2021

0.15.0

Jan 24, 2021

0.14.0

Dec 27, 2020

0.13.0

Dec 10, 2020

0.12.0

Dec 7, 2020

0.11.0

Nov 30, 2020

0.10.0

Nov 27, 2020

0.10.0.dev0 pre-release

Nov 27, 2020

This version

0.8.0

Nov 6, 2020

0.7.0

Nov 4, 2020

0.6.0

Nov 2, 2020

0.5.0

Oct 30, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rlai-0.8.0.tar.gz (39.0 kB view hashes)

Uploaded Nov 6, 2020 Source

Built Distribution

rlai-0.8.0-py3-none-any.whl (48.8 kB view hashes)

Uploaded Nov 6, 2020 Python 3

Hashes for rlai-0.8.0.tar.gz

Hashes for rlai-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`61c92aa393fa5f9d5292ba612870b205ebd244b3ac0f6493be829959ba614c6b`
MD5	`fc9096f2471359957cf6144059f565b3`
BLAKE2b-256	`5ac15048ef266d033ce22d74e6f22e2f6e0d8b18bb3e6e46fa51a708ee5659aa`

Hashes for rlai-0.8.0-py3-none-any.whl

Hashes for rlai-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6bc96ec711ecd1a2016fae293a9d5699ff2cc8233acf38618eaa28e2bac8d4f9`
MD5	`5a067aeea6c14c0ac31ff2e2d483bce4`
BLAKE2b-256	`8b9c12d5dec99c63c0ff825df84a07d23b03f4dbefa8c808c82af0eec592bb48`

rlai 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

Table of Contents

Introduction

Figures

Mancala

Chapter 2

rlai.environments.bandit.Arm

rlai.agents.q_value.EpsilonGreedy

rlai.agents.q_value.QValue

rlai.environments.bandit.KArmedBandit

rlai.utils.IncrementalSampleAverager

rlai.agents.q_value.UpperConfidenceBound

rlai.agents.h_value.PreferenceGradient

Chapter 3

rlai.environments.mancala.MancalaState

rlai.environments.mancala.Mancala

rlai.environments.mdp.MdpEnvironment

rlai.states.mdp.MdpState

rlai.states.mdp.ModelBasedMdpState

rlai.environments.mdp.Gridworld

Chapter 4

rlai.gpi.dynamic_programming.evaluation.evaluate_v_pi

rlai.gpi.dynamic_programming.evaluation.evaluate_q_pi

rlai.gpi.dynamic_programming.improvement.improve_policy_with_v_pi

rlai.gpi.improvement.improve_policy_with_q_pi

rlai.gpi.dynamic_programming.iteration.iterate_policy_q_pi

rlai.gpi.dynamic_programming.iteration.iterate_policy_v_pi

rlai.gpi.dynamic_programming.iteration.iterate_value_v_pi

rlai.environments.mdp.GamblersProblem

rlai.gpi.dynamic_programming.iteration.iterate_value_q_pi

Chapter 5

rlai.gpi.monte_carlo.evaluation.evaluate_v_pi

rlai.gpi.monte_carlo.evaluation.evaluate_q_pi

rlai.gpi.monte_carlo.iteration.iterate_value_q_pi

Chapter 6

rlai.gpi.temporal_difference.evaluation.Mode

rlai.gpi.temporal_difference.evaluation.evaluate_q_pi

rlai.gpi.temporal_difference.iteration.iterate_value_q_pi

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`rlai.environments.bandit.Arm`

`rlai.agents.q_value.EpsilonGreedy`

`rlai.agents.q_value.QValue`

`rlai.environments.bandit.KArmedBandit`

`rlai.utils.IncrementalSampleAverager`

`rlai.agents.q_value.UpperConfidenceBound`

`rlai.agents.h_value.PreferenceGradient`

`rlai.environments.mancala.MancalaState`

`rlai.environments.mancala.Mancala`

`rlai.environments.mdp.MdpEnvironment`

`rlai.states.mdp.MdpState`

`rlai.states.mdp.ModelBasedMdpState`

`rlai.environments.mdp.Gridworld`

`rlai.gpi.dynamic_programming.evaluation.evaluate_v_pi`

`rlai.gpi.dynamic_programming.evaluation.evaluate_q_pi`

`rlai.gpi.dynamic_programming.improvement.improve_policy_with_v_pi`

`rlai.gpi.improvement.improve_policy_with_q_pi`

`rlai.gpi.dynamic_programming.iteration.iterate_policy_q_pi`

`rlai.gpi.dynamic_programming.iteration.iterate_policy_v_pi`

`rlai.gpi.dynamic_programming.iteration.iterate_value_v_pi`

`rlai.environments.mdp.GamblersProblem`

`rlai.gpi.dynamic_programming.iteration.iterate_value_q_pi`

`rlai.gpi.monte_carlo.evaluation.evaluate_v_pi`

`rlai.gpi.monte_carlo.evaluation.evaluate_q_pi`

`rlai.gpi.monte_carlo.iteration.iterate_value_q_pi`

`rlai.gpi.temporal_difference.evaluation.Mode`

`rlai.gpi.temporal_difference.evaluation.evaluate_q_pi`

`rlai.gpi.temporal_difference.iteration.iterate_value_q_pi`