Buffalo Gym environment

Project description

Buffalo Gym

A multi-armed bandit (MAB) environment for the gymnasium API. One-armed Bandit is a reference to slot machines, and Buffalo is a reference to one such slot machine that I am fond of. MABs are an excellent playground for theoretical exercise and debugging of RL agents as they provide an environment that can be reasoned about easily. It helped me once to step back and write an MAB to debug my DQN agent. But there was a lack of native gymnasium environments, so I wrote Buffalo, an easy-to-use environment that it might help someone else.

Check out the WIKI for fuller documentation.

Standard Bandit Problems

Buffalo ("Buffalo-v0" | "Bandit-v0")

Default multi-armed bandit environment. Arm center values are drawn from a normal distribution (0, arms). When an arm is pulled, a random value is drawn from a normal distribution (0, 1) and added to the chosen arm center value. This is not intended to be challenging for an agent but easy for the debugger to reason about.

Multi-Buffalo ("MultiBuffalo-v0" | "ContextualBandit-v0")

This serves as a contextual bandit implementation. It is a k-armed bandit with n states. These states are indicated to the agent in the observation and the two states have different reward offsets for each arm. The goal of the agent is to learn and contextualize best action for a given state. This is a good stepping stone to Markov Decision Processes.

This module had an extra parameter, pace. By default (None), a new state is chosen for every step of the environment. It can be set to any integer to determine how many steps between randomly choosing a new state. Of course, transitioning to a new state is not guaranteed as the next state is random.

DuelingBuffalo ("DuelingBuffalo-v0" | "DuelingBandit-v0")

Yue et al. (2012) introduced the dueling bandit variant to model situations with only relative feedback. The agent pulls two levers simultaneously; the feedback is whichever lever provides the best reward. This restriction means the agent cannot observe rewards and must continually compare arms to determine the best. Given the reward-centric structure of gymnasium returns, we instead give a reward of 1 if the first arm chosen was higher than the second. The agent must choose two arms, which cannot be the same.

BoundlessBuffalo ("BoundlessBuffalo-v0" | "InfiniteArmedBandit-v0")

Built from the Wikipedia entry based on Agrawal, 1995 (Paywalled), BoundlessBuffalo approximates the InfiniteArmedBandit problem.
The reward for this bandit is the action put into a polynomial of degree n, with the coefficients randomly sampled from (-0.1, 0.1).
This environment tests the ability of an algorithm to find an optimal input in a continuous space. The dynamic drawing of new coefficients challenges algorithms to adapt to a changing landscape continually.

Nonstandard Bandit Problems

Buffalo Trail ("BuffaloTrail-v0" | "StatefulBandit-v0")

A Stateful Bandit builds on the Contextual Bandit by relaxing the assumption that rewards depend only on the current state. In this framework, the environment incorporates a memory of past states, rewarding the maximum to an agent only if it encounters a specific sequence of states and selects the correct action.

This setup isolates an agent's ability to track history and infer belief states, without introducing the confounding factor of exploration, as the agent cannot control state transitions. Stateful Bandits provide a targeted environment for studying history-dependent decision-making and state estimation.

Symbolic State ("SymbolicStateBandit-v0")

In real slots, the state of the bandit has little to no impact on the underlying rewards. Plenty of flashing lights and game modes serve only to keep the player engaged. This SymbolicStateBandit (SSB) formulation simulates this. The states do not correlate with the underlying rewards in this contextual bandit.

By setting dynamic_rate to None, the rewards are always the same despite the changing states; dynamic_rate == pace randomly changes the arms with each state, and any other values produce further uncorrelated behavior. This configuration serves as a test bed for the "worst case" scenario for a bandit/reinforcement learner. It measures the agent's ability to generalize well and/or how it performs when the environment breaks the typical assumptions.

Tired Buffalo ("TiredBuffalo-v0" | "FatigueBandit-v0")

I asked ChatGPT for a novel bandit formulation. This bandit is what it came up with. It's hardly novel, though, as it is a special case of the "Recovering Bandits" (Pike-Burke & Grünewälder, 2019) problem where all arms have the same function. It's on the list of non-standard bandits because it's not their problem, but it's hardly new.

This bandit problem models resource depletion and recovery. Pulling an arm reduces its expected reward ("fatigue"), while unused arms gradually recover. Each arm has a unique maximum mean reward, requiring the agent to balance immediate rewards against long-term sustainability.

Using

Install via pip and import buffalo_gym along with gymnasium.

import gymnasium  
import buffalo_gym

env = gym.make("Buffalo-v0")

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Sep 6, 2025

0.3.1

Mar 20, 2025

0.3.0

Feb 28, 2025

0.2.0

Dec 26, 2024

0.1.0

Nov 30, 2024

0.0.3

Apr 26, 2024

0.0.2

Apr 14, 2024

0.0.1

Apr 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buffalo_gym-0.4.0.tar.gz (14.5 kB view details)

Uploaded Sep 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

buffalo_gym-0.4.0-py3-none-any.whl (17.0 kB view details)

Uploaded Sep 6, 2025 Python 3

File details

Details for the file buffalo_gym-0.4.0.tar.gz.

File metadata

Download URL: buffalo_gym-0.4.0.tar.gz
Upload date: Sep 6, 2025
Size: 14.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for buffalo_gym-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`bd1be96871bedf6a181f73024271708db4e8c0a51ffb5023aa7748a2128a4cc1`
MD5	`5db09d465cace0cbdef08844c374526b`
BLAKE2b-256	`85e8bb76864154dc19d98bea26b9bd35166b614434a957e3f10f617b823e5adb`

See more details on using hashes here.

File details

Details for the file buffalo_gym-0.4.0-py3-none-any.whl.

File metadata

Download URL: buffalo_gym-0.4.0-py3-none-any.whl
Upload date: Sep 6, 2025
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for buffalo_gym-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01b6ccd060ca1a6c25f0ebc3a26e552fef79010a04be197e23e68922bacd2b31`
MD5	`cfc04d78350a83a753a623ed4484a4c9`
BLAKE2b-256	`a87f16c2d76f234078e1cb515ce888a9e700d503d1437035cd04ff265a80d8c5`

See more details on using hashes here.

buffalo-gym 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Buffalo Gym

Check out the WIKI for fuller documentation.

Standard Bandit Problems

Buffalo ("Buffalo-v0" | "Bandit-v0")

Multi-Buffalo ("MultiBuffalo-v0" | "ContextualBandit-v0")

DuelingBuffalo ("DuelingBuffalo-v0" | "DuelingBandit-v0")

BoundlessBuffalo ("BoundlessBuffalo-v0" | "InfiniteArmedBandit-v0")

Nonstandard Bandit Problems

Buffalo Trail ("BuffaloTrail-v0" | "StatefulBandit-v0")

Symbolic State ("SymbolicStateBandit-v0")

Tired Buffalo ("TiredBuffalo-v0" | "FatigueBandit-v0")

Using

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes