Skip to main content

A Python library for reinforcement learning algorithms.

Project description

PyMAB

Icon description

Python Multi-Armed Bandit Library Tame the randomness, pull the right levers! PyMab: Your trusty sidekick in the wild world of exploration and exploitation.

PyMAB offers an exploratory framework to compare the performance of multiple Multi Armed Bandit algorithms in a variety of scenarios. The library is designed to be flexible and easy to use, allowing users to quickly set up and run experiments with different configurations.

Simple Example

from pymab.policies.greedy import GreedyPolicy
  from pymab.policies.thompson_sampling import ThompsonSamplingPolicy
  from pymab.game import Game

  n_bandits = 5
  
  # Define the policies
  greedy_policy = GreedyPolicy(
                      optimistic_initialization=1,
                      n_bandits=n_bandits
                  )
  ts_policy = ThompsonSamplingPolicy(n_bandits=n_bandits)

  # Define the game
  game = Game(
       n_episodes=2000,
       n_steps=1000,
       policies=[greedy_policy, ts_policy],
       n_bandits=n_bandits
  )

  # Run the game
  game.game_loop()

  # Plot the results
  game.plot_average_reward_by_step()

Other examples

In ./examples/ you can find detailed examples of how to use the library:

  • Basic Usage: A simple example of how to use the library with the Greedy Policy.
  • Bayesian UCB Bernoulli: An example comparing the multiple configurations of the UCB and the Bayesian UCB policies with a Bernoulli reward distribution.
  • Comparing Policies Gaussian: An example comparing the multiple configurations of the Greedy, the Epsilon Greedy, the Bayesian UCB, and the Thompson Sampling policies with a Gaussian reward distribution.
  • Contextual Bandits Proxies: An example comparing the multiple configurations of the Greedy abd the Contextual Bandits for proxy server selection based on latency, bandwidth, downtime rate, success rate, proximity and load.
  • Non-Stationary Policies: An example comparing the multiple configurations of the UCB, the Sliding Window UCB, and the Discounted UCB policies in a stationary environment, non-stationary with gradual changes, non-stationary with abrupt changes, and non-stationary with random arm swapping.
  • Thompson Sampling Bernoulli: An example comparing the multiple configurations of the Greedy, and the Thompson Sampling policies with a Bernoulli reward distribution, showing the evolution of the reward distribution estimations.
  • Thompson Sampling Gaussian: An example comparing the multiple configurations of the Greedy, and the Thompson Sampling policies with a Gaussian reward distribution, showing the evolution of the reward distribution estimations.

Features

  • Design to compare different algorithms in the same environment.
  • Built-in plotting functions to visualize the results.
  • Support for several Multi-Armed Bandit algorithms.
  • Support for different types of reward distributions (Gaussian, Bernoulli).
  • Support for different types of environments (Stationary, Non-Stationary).

Environments

Stationary

  • The reward distribution remains the same during the whole execution.

Non-Stationary

  • The reward distribution changes during the execution. This library contains a mixin to create non-stationary environments that change the reward distribution in multiple ways, and easily extensible.

  • Gradual Change: The reward distribution will change slightly each step.

  • Abrupt Change: The reward distribution will change more, periodically.

  • Random Arm Swapping: The reward distribution will change by swapping the rewards between arms and at random steps.

Policies

Multi-Armed Bandit algorithms and reward distributions

Basic Exploration-Exploitation Algorithms
  • Greedy:

    • Always selects the arm with the highest estimated reward.
    • Very simple but can get stuck on suboptimal arms if initial estimates are inaccurate.
    • No exploration, all exploitation.
  • Epsilon-Greedy:

    • Most of the time (1-ε), selects the best arm like Greedy.
    • Sometimes (ε), randomly selects an arm to explore.
    • Balances exploration and exploitation, but exploration doesn't decrease over time.
Upper Confidence Bound (UCB) Algorithms
  • UCB:

    • Selects the arm with the highest upper confidence bound.
    • Automatically balances exploration and exploitation.
    • Explores less-pulled arms more, but focuses on promising arms over time.
    • Has adaptations for non-stationary environments.
      • SlidingWindowUCB:
        • Like UCB, but only considers recent observations.
        • Adapts better to abrupt changes in reward distributions.
      • DiscountedUCB:
        • Like UCB, but gives more weight to recent observations.
        • Adapts better to gradual changes in reward distributions.
  • Bayesian UCB:

    • Can incorporate prior knowledge about reward distributions.
    • Has adaptations for Bernoulli and Gaussian reward distributions.
Bayesian Methods
  • Thompson Sampling:
    • Samples from estimated reward distributions and picks the highest sample.
    • Naturally balances exploration and exploitation.
    • Often performs very well in practice, especially with enough samples.
    • Has adaptations for Bernoulli and Gaussian reward distributions.
Softmax and Gradient-Based Methods
  • Softmax Selection:

    • To be implemented
    • Selects arms probabilistically based on their estimated values.
    • Higher estimated values have higher probability of being selected.
    • Temperature parameter controls exploration-exploitation trade-off.
  • Gradient:

    • To be implemented
    • Updates a preference for each arm based on the reward received.
    • Doesn't maintain estimates of actual reward values.
    • Can work well in relative reward scenarios.
Contextual Bandits
  • Contextual Bandits:
    • Takes into account additional information (context) when making decisions.
    • Can learn to select different arms in different contexts.
    • More powerful but also more complex than standard bandits.

Roadmap

This is an ongoing project, and we are always looking for suggestions and contributions. If you have any ideas or want to help, please reach out to us!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymab-0.1.0.tar.gz (13.0 MB view details)

Uploaded Source

Built Distribution

pymab-0.1.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file pymab-0.1.0.tar.gz.

File metadata

  • Download URL: pymab-0.1.0.tar.gz
  • Upload date:
  • Size: 13.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pymab-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a38439acb5ac9a05d1954251cf39e770f5821853c111465cd0df24e8fc19a878
MD5 630c85c25fb61e45a1a5bfde76cca527
BLAKE2b-256 ed9c71452da545e5bd3687fb1ecf2e2f2d226f6ac5ce1ce1932dead2d40e2583

See more details on using hashes here.

File details

Details for the file pymab-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pymab-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for pymab-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff84f0cc15fbaa2b3ca5dba8b2c89081943bf9b80f552b126e62248585b9da5b
MD5 7c55c14bd44fd47c47d1c538fc87e8cf
BLAKE2b-256 2a0061365262ad249cfaf9af60d73bc270c35d9838ec5dbc39995cf36909908a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page