A Python library for reinforcement learning algorithms.
Project description
PyMAB
Python Multi-Armed Bandit Library Tame the randomness, pull the right levers! PyMab: Your trusty sidekick in the wild world of exploration and exploitation.
PyMAB offers an exploratory framework to compare the performance of multiple Multi Armed Bandit algorithms in a variety of scenarios. The library is designed to be flexible and easy to use, allowing users to quickly set up and run experiments with different configurations.
Simple Example
from pymab.policies.greedy import GreedyPolicy
from pymab.policies.thompson_sampling import ThompsonSamplingPolicy
from pymab.game import Game
n_bandits = 5
# Define the policies
greedy_policy = GreedyPolicy(
optimistic_initialization=1,
n_bandits=n_bandits
)
ts_policy = ThompsonSamplingPolicy(n_bandits=n_bandits)
# Define the game
game = Game(
n_episodes=2000,
n_steps=1000,
policies=[greedy_policy, ts_policy],
n_bandits=n_bandits
)
# Run the game
game.game_loop()
# Plot the results
game.plot_average_reward_by_step()
Other examples
In ./examples/ you can find detailed examples of how to use the library:
- Basic Usage: A simple example of how to use the library with the Greedy Policy.
- Bayesian UCB Bernoulli: An example comparing the multiple configurations of the UCB and the Bayesian UCB policies with a Bernoulli reward distribution.
- Comparing Policies Gaussian: An example comparing the multiple configurations of the Greedy, the Epsilon Greedy, the Bayesian UCB, and the Thompson Sampling policies with a Gaussian reward distribution.
- Contextual Bandits Proxies: An example comparing the multiple configurations of the Greedy abd the Contextual Bandits for proxy server selection based on latency, bandwidth, downtime rate, success rate, proximity and load.
- Non-Stationary Policies: An example comparing the multiple configurations of the UCB, the Sliding Window UCB, and the Discounted UCB policies in a stationary environment, non-stationary with gradual changes, non-stationary with abrupt changes, and non-stationary with random arm swapping.
- Thompson Sampling Bernoulli: An example comparing the multiple configurations of the Greedy, and the Thompson Sampling policies with a Bernoulli reward distribution, showing the evolution of the reward distribution estimations.
- Thompson Sampling Gaussian: An example comparing the multiple configurations of the Greedy, and the Thompson Sampling policies with a Gaussian reward distribution, showing the evolution of the reward distribution estimations.
Features
- Design to compare different algorithms in the same environment.
- Built-in plotting functions to visualize the results.
- Support for several Multi-Armed Bandit algorithms.
- Support for different types of reward distributions (Gaussian, Bernoulli).
- Support for different types of environments (Stationary, Non-Stationary).
Environments
Stationary
- The reward distribution remains the same during the whole execution.
Non-Stationary
-
The reward distribution changes during the execution. This library contains a mixin to create non-stationary environments that change the reward distribution in multiple ways, and easily extensible.
-
Gradual Change: The reward distribution will change slightly each step.
-
Abrupt Change: The reward distribution will change more, periodically.
-
Random Arm Swapping: The reward distribution will change by swapping the rewards between arms and at random steps.
Policies
Multi-Armed Bandit algorithms and reward distributions
Basic Exploration-Exploitation Algorithms
-
Greedy:
- Always selects the arm with the highest estimated reward.
- Very simple but can get stuck on suboptimal arms if initial estimates are inaccurate.
- No exploration, all exploitation.
-
Epsilon-Greedy:
- Most of the time (1-ε), selects the best arm like Greedy.
- Sometimes (ε), randomly selects an arm to explore.
- Balances exploration and exploitation, but exploration doesn't decrease over time.
Upper Confidence Bound (UCB) Algorithms
-
UCB:
- Selects the arm with the highest upper confidence bound.
- Automatically balances exploration and exploitation.
- Explores less-pulled arms more, but focuses on promising arms over time.
- Has adaptations for non-stationary environments.
- SlidingWindowUCB:
- Like UCB, but only considers recent observations.
- Adapts better to abrupt changes in reward distributions.
- DiscountedUCB:
- Like UCB, but gives more weight to recent observations.
- Adapts better to gradual changes in reward distributions.
- SlidingWindowUCB:
-
Bayesian UCB:
- Can incorporate prior knowledge about reward distributions.
- Has adaptations for Bernoulli and Gaussian reward distributions.
Bayesian Methods
- Thompson Sampling:
- Samples from estimated reward distributions and picks the highest sample.
- Naturally balances exploration and exploitation.
- Often performs very well in practice, especially with enough samples.
- Has adaptations for Bernoulli and Gaussian reward distributions.
Softmax and Gradient-Based Methods
-
Softmax Selection:
- To be implemented
- Selects arms probabilistically based on their estimated values.
- Higher estimated values have higher probability of being selected.
- Temperature parameter controls exploration-exploitation trade-off.
-
Gradient:
- To be implemented
- Updates a preference for each arm based on the reward received.
- Doesn't maintain estimates of actual reward values.
- Can work well in relative reward scenarios.
Contextual Bandits
- Contextual Bandits:
- Takes into account additional information (context) when making decisions.
- Can learn to select different arms in different contexts.
- More powerful but also more complex than standard bandits.
Roadmap
-
Add implementation for otimised Greedy policies for non-stationary environments.
-
Add more complex policy adaptions for non-stationary environments.
- https://towardsdatascience.com/reinforcement-learning-basics-stationary-and-non-stationary-multi-armed-bandit-problem-cfe06d33b815
- https://gdmarmerola.github.io/non-stationary-bandits/
- Exponentially weighted means
- Weighted block means
- Fitting a time series model to find an indicator of when a distribution changes and tune the exploration rate accordingly
-
Add more complex non-stationary environments, like changing the variance, mean, random abrupt changes, machine learning, ...
-
Add unit tests for non-stationary environments and policies.
-
Make mixin for optimistic initialisation, since not all policies use it.
-
Add implementation for softmax_selection policy.
-
Add implementation for gradient policy.
This is an ongoing project, and we are always looking for suggestions and contributions. If you have any ideas or want to help, please reach out to us!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pymab-0.1.0.tar.gz
.
File metadata
- Download URL: pymab-0.1.0.tar.gz
- Upload date:
- Size: 13.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a38439acb5ac9a05d1954251cf39e770f5821853c111465cd0df24e8fc19a878 |
|
MD5 | 630c85c25fb61e45a1a5bfde76cca527 |
|
BLAKE2b-256 | ed9c71452da545e5bd3687fb1ecf2e2f2d226f6ac5ce1ce1932dead2d40e2583 |
File details
Details for the file pymab-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pymab-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff84f0cc15fbaa2b3ca5dba8b2c89081943bf9b80f552b126e62248585b9da5b |
|
MD5 | 7c55c14bd44fd47c47d1c538fc87e8cf |
|
BLAKE2b-256 | 2a0061365262ad249cfaf9af60d73bc270c35d9838ec5dbc39995cf36909908a |