A Python library for reinforcement learning algorithms.

These details have not been verified by PyPI

Project links

Homepage

Project description

PyMAB

Icon description

Python Multi-Armed Bandit Library Tame the randomness, pull the right levers! PyMab: Your trusty sidekick in the wild world of exploration and exploitation.

PyMAB offers an exploratory framework to compare the performance of multiple Multi Armed Bandit algorithms in a variety of scenarios. The library is designed to be flexible and easy to use, allowing users to quickly set up and run experiments with different configurations.

Simple Example

from pymab.policies.greedy import GreedyPolicy
  from pymab.policies.thompson_sampling import ThompsonSamplingPolicy
  from pymab.game import Game

  n_bandits = 5
  
  # Define the policies
  greedy_policy = GreedyPolicy(
                      optimistic_initialization=1,
                      n_bandits=n_bandits
                  )
  ts_policy = ThompsonSamplingPolicy(n_bandits=n_bandits)

  # Define the game
  game = Game(
       n_episodes=2000,
       n_steps=1000,
       policies=[greedy_policy, ts_policy],
       n_bandits=n_bandits
  )

  # Run the game
  game.game_loop()

  # Plot the results
  game.plot_average_reward_by_step()

Other examples

In ./examples/ you can find detailed examples of how to use the library:

Basic Usage: A simple example of how to use the library with the Greedy Policy.
Bayesian UCB Bernoulli: An example comparing the multiple configurations of the UCB and the Bayesian UCB policies with a Bernoulli reward distribution.
Comparing Policies Gaussian: An example comparing the multiple configurations of the Greedy, the Epsilon Greedy, the Bayesian UCB, and the Thompson Sampling policies with a Gaussian reward distribution.
Contextual Bandits Proxies: An example comparing the multiple configurations of the Greedy abd the Contextual Bandits for proxy server selection based on latency, bandwidth, downtime rate, success rate, proximity and load.
Non-Stationary Policies: An example comparing the multiple configurations of the UCB, the Sliding Window UCB, and the Discounted UCB policies in a stationary environment, non-stationary with gradual changes, non-stationary with abrupt changes, and non-stationary with random arm swapping.
Thompson Sampling Bernoulli: An example comparing the multiple configurations of the Greedy, and the Thompson Sampling policies with a Bernoulli reward distribution, showing the evolution of the reward distribution estimations.
Thompson Sampling Gaussian: An example comparing the multiple configurations of the Greedy, and the Thompson Sampling policies with a Gaussian reward distribution, showing the evolution of the reward distribution estimations.

Features

Design to compare different algorithms in the same environment.
Built-in plotting functions to visualize the results.
Support for several Multi-Armed Bandit algorithms.
Support for different types of reward distributions (Gaussian, Bernoulli).
Support for different types of environments (Stationary, Non-Stationary).

Environments

Stationary

The reward distribution remains the same during the whole execution.

Non-Stationary

The reward distribution changes during the execution. This library contains a mixin to create non-stationary environments that change the reward distribution in multiple ways, and easily extensible.
Gradual Change: The reward distribution will change slightly each step.
Abrupt Change: The reward distribution will change more, periodically.
Random Arm Swapping: The reward distribution will change by swapping the rewards between arms and at random steps.

Policies

Multi-Armed Bandit algorithms and reward distributions

Basic Exploration-Exploitation Algorithms

Greedy:
- Always selects the arm with the highest estimated reward.
- Very simple but can get stuck on suboptimal arms if initial estimates are inaccurate.
- No exploration, all exploitation.
Epsilon-Greedy:
- Most of the time (1-ε), selects the best arm like Greedy.
- Sometimes (ε), randomly selects an arm to explore.
- Balances exploration and exploitation, but exploration doesn't decrease over time.

Upper Confidence Bound (UCB) Algorithms

UCB:
- Selects the arm with the highest upper confidence bound.
- Automatically balances exploration and exploitation.
- Explores less-pulled arms more, but focuses on promising arms over time.
- Has adaptations for non-stationary environments.
  - SlidingWindowUCB:
    - Like UCB, but only considers recent observations.
    - Adapts better to abrupt changes in reward distributions.
  - DiscountedUCB:
    - Like UCB, but gives more weight to recent observations.
    - Adapts better to gradual changes in reward distributions.
Bayesian UCB:
- Can incorporate prior knowledge about reward distributions.
- Has adaptations for Bernoulli and Gaussian reward distributions.

Bayesian Methods

Thompson Sampling:
- Samples from estimated reward distributions and picks the highest sample.
- Naturally balances exploration and exploitation.
- Often performs very well in practice, especially with enough samples.
- Has adaptations for Bernoulli and Gaussian reward distributions.

Softmax and Gradient-Based Methods

Softmax Selection:
- To be implemented
- Selects arms probabilistically based on their estimated values.
- Higher estimated values have higher probability of being selected.
- Temperature parameter controls exploration-exploitation trade-off.
Gradient:
- To be implemented
- Updates a preference for each arm based on the reward received.
- Doesn't maintain estimates of actual reward values.
- Can work well in relative reward scenarios.

Contextual Bandits

Contextual Bandits:
- Takes into account additional information (context) when making decisions.
- Can learn to select different arms in different contexts.
- More powerful but also more complex than standard bandits.

Roadmap

Add implementation for otimised Greedy policies for non-stationary environments.
Add more complex policy adaptions for non-stationary environments.
- https://towardsdatascience.com/reinforcement-learning-basics-stationary-and-non-stationary-multi-armed-bandit-problem-cfe06d33b815
- https://gdmarmerola.github.io/non-stationary-bandits/
  - Exponentially weighted means
  - Weighted block means
  - Fitting a time series model to find an indicator of when a distribution changes and tune the exploration rate accordingly
Add more complex non-stationary environments, like changing the variance, mean, random abrupt changes, machine learning, ...
Add unit tests for non-stationary environments and policies.
Make mixin for optimistic initialisation, since not all policies use it.
Add implementation for softmax_selection policy.
Add implementation for gradient policy.
Github Project Board:

This is an ongoing project, and we are always looking for suggestions and contributions. If you have any ideas or want to help, please reach out to us!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Oct 31, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymab-0.1.0.tar.gz (13.0 MB view details)

Uploaded Oct 31, 2024 Source

Built Distribution

pymab-0.1.0-py3-none-any.whl (39.6 kB view details)

Uploaded Oct 31, 2024 Python 3

File details

Details for the file pymab-0.1.0.tar.gz.

File metadata

Download URL: pymab-0.1.0.tar.gz
Upload date: Oct 31, 2024
Size: 13.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pymab-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a38439acb5ac9a05d1954251cf39e770f5821853c111465cd0df24e8fc19a878`
MD5	`630c85c25fb61e45a1a5bfde76cca527`
BLAKE2b-256	`ed9c71452da545e5bd3687fb1ecf2e2f2d226f6ac5ce1ce1932dead2d40e2583`

See more details on using hashes here.

File details

Details for the file pymab-0.1.0-py3-none-any.whl.

File metadata

Download URL: pymab-0.1.0-py3-none-any.whl
Upload date: Oct 31, 2024
Size: 39.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pymab-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ff84f0cc15fbaa2b3ca5dba8b2c89081943bf9b80f552b126e62248585b9da5b`
MD5	`7c55c14bd44fd47c47d1c538fc87e8cf`
BLAKE2b-256	`2a0061365262ad249cfaf9af60d73bc270c35d9838ec5dbc39995cf36909908a`