A no limit hold'em environment for training RL agents.

These details have not been verified by PyPI

Project links

Homepage

Project description

Pokerenv

Pokerenv is an openAI gym (https://gym.openai.com/docs/) compliant reinforcement learning environment for No Limit Texas Hold'em. It supports 2-6 player tables.

The environment can be configured to output hand history files, which can be viewed with any pokerstars compatible tracking software (holdem manager, pokertracker, etc.), allowing you to easily track the learning process.

Installation and dependencies

pip install numpy
pip install treys
pip install pokerenv

Usage information

The rewards are output as a numpy array, where the nth element corresponds to reward given to the agent who was playing when the value of the acting player flag contained in the observation was n.

The acting player flag contained in the observation does not mean the agents position in the table. Each player inside the table gets a unique id when the table instance is created, and this id is passed as the acting player flag in the observation. This way agents can keep reacting to the same acting player flag value, while still playing from all possible table positions.

Invalid actions

The environment deals with invalid actions by ignoring them, and either checking or folding automatically. If configured to do so, the environment also applies an invalid action penalty to the corresponding reward. The observation contains entries which can be used to implement invalid action masking.

All of the required (from the learning loop perspective) observation entries have human readable index definitions in the obs_indices.py module.

Toy example

Define an agent

import numpy as np
import pokerenv.obs_indices as indices
from pokerenv.table import Table
from pokerenv.common import PlayerAction, Action, action_list


class ExampleRandomAgent:
    def __init__(self):
        self.actions = []
        self.observations = []
        self.rewards = []

    def get_action(self, observation):
        self.observations.append(observation)
        valid_actions = np.argwhere(observation[indices.VALID_ACTIONS] == 1).flatten()
        valid_bet_low = observation[indices.VALID_BET_LOW]
        valid_bet_high = observation[indices.VALID_BET_HIGH]
        chosen_action = PlayerAction(np.random.choice(valid_actions))
        bet_size = 0
        if chosen_action is PlayerAction.BET:
            bet_size = np.random.uniform(valid_bet_low, valid_bet_high)
        table_action = Action(chosen_action, bet_size)
        self.actions.append(table_action)
        return table_action

    def reset(self):
        self.actions = []
        self.observations = []
        self.rewards = []

Create an environment

active_players = 6
agents = [ExampleRandomAgent() for _ in range(6)]
player_names = {0: 'TrackedAgent1', 1: 'Agent2'} # Rest are defaulted to player3, player4...
# Should we only log the 0th players (here TrackedAgent1) private cards to hand history files
track_single_player = True 
# Bounds for randomizing player stack sizes in reset()
low_stack_bbs = 50
high_stack_bbs = 200
hand_history_location = 'hands/'
invalid_action_penalty = 0
table = Table(active_players, 
              player_names=player_names,
              track_single_player=track_single_player,
              stack_low=low_stack_bbs,
              stack_high=high_stack_bbs,
              hand_history_location=hand_history_location,
              invalid_action_penalty=invalid_action_penalty
)
table.seed(1)

Implement learning loop

iteration = 1
while True:
    if iteration % 50 == 0:
        table.hand_history_enabled = True
    active_players = np.random.randint(2, 7)
    table.n_players = active_players
    obs = table.reset()
    for agent in agents:
        agent.reset()
    acting_player = int(obs[indices.ACTING_PLAYER])
    while True:
        action = agents[acting_player].get_action(obs)
        obs, reward, done, _ = table.step(action)
        if  done:
            # Distribute final rewards
            for i in range(active_players):
                agents[i].rewards.append(reward[i])
            break
        else:
            # This step can be skipped unless invalid action penalty is enabled, 
            # since we only get a reward when the pot is distributed, and the done flag is set
            agents[acting_player].rewards.append(reward[acting_player])
            acting_player = int(obs[indices.ACTING_PLAYER])
    iteration += 1
    table.hand_history_enabled = False

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.1

Oct 12, 2021

1.0.0

Oct 11, 2021

0.1.8

Jul 11, 2021

0.1.6

Jul 1, 2021

0.1.5

Jun 13, 2021

0.1.4

Jun 13, 2021

0.1.3

Jun 12, 2021

0.1.2

Jun 11, 2021

0.1.1

Jun 9, 2021

0.1.0

Jun 9, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pokerenv-1.0.1.tar.gz (11.4 kB view hashes)

Uploaded Oct 12, 2021 Source

Built Distribution

pokerenv-1.0.1-py3-none-any.whl (11.0 kB view hashes)

Uploaded Oct 12, 2021 Python 3

Hashes for pokerenv-1.0.1.tar.gz

Hashes for pokerenv-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`b34a5c14e3f8460a540543d2fa1f0638ba55c96f6ee7aa3c5e21d2180d4ea0fa`
MD5	`1a33077216ef106a7eca465de4425340`
BLAKE2b-256	`70a39c170eb35599580e42566967ece85b4e72ad4daa49cd7c0673fa6127b362`

Hashes for pokerenv-1.0.1-py3-none-any.whl

Hashes for pokerenv-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e548b91a57b9dc4478292e435f30ba84a53a2c5c464ca896ff96190ca983866e`
MD5	`85aad3226ee928a6c1048809348b01f0`
BLAKE2b-256	`e0b343a0c7c3fe3e47dd6776bf70b278cb429cfc807158fb4fb47b88c2f9939c`