Skip to main content

[WIP] A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning

Project description

TextArena: A Framework for Text-Based Game Environments

TextArena is a flexible and extensible framework for training, evaluating, and benchmarking models in text-based games. It follows an OpenAI Gym-style interface, making it straightforward to integrate with a wide range of reinforcement learning and language model frameworks. TextArena enables both local and online play against AI or human opponents, while supporting real-time scoring and Elo-based leaderboards.

Table of Contents

  1. Getting Started
  2. Core Game Subsets
  3. Wrappers
  4. Implementation Status

Getting Started

Installation

Install TextArena directly from PyPI:

pip install textarena

Local Usage

Let's walk through how to let GPT-4o-mini play against Claude-3.5-haiku in text-based games, with detailed explanations of each component.

Step 1: Initialize Agents We provide several out-of-the-box classes for easy usage of publicly available LLMs. The OpenRouterAgent wrapper handles all the API communication and response formatting:

import textarena as ta
agents = {
    0: ta.agents.OpenRouterAgent(model_name="GPT-4o-mini"),
    1: ta.agents.OpenRouterAgent(model_name="anthropic/claude-3.5-haiku")
}

The dictionary keys (0 and 1) are player IDs that the environment uses to track turns. Each agent will be called when it's their respective player ID's turn.

Step 2: Create Environment Similar to OpenAI gym, we use make() to create the environment:

env = ta.make(env_id="BalancedSubset-v0")

The BalancedSubset environment randomly selects one game from its collection each time it's initialized. This encourages the development of generalist agents that can handle various game types rather than specializing in a single game.

Step 3: Add Wrappers Wrappers modify how the environment behaves. Each wrapper serves a specific purpose:

# This wrapper accumulates game history and formats it for language models
env = ta.wrappers.LLMObservationWrapper(env=env)

# This wrapper provides nicely formatted output for human readability
env = ta.wrappers.SimpleRenderWrapper(
    env=env,
    player_names={0: "GPT-4o-Mini", 1: "Claude-3.5-Haiku"}
)

The LLMObservationWrapper is particularly important because:

  • The base environment provides observations as a list of (sender_id, message) tuples
  • Language models expect a single string input
  • This wrapper maintains the conversation history and formats everything as a coherent dialogue

The SimpleRenderWrapper helps with:

  • Color-coding messages by player
  • Adding clear turn indicators
  • Formatting game state information
  • Making the output more readable in the terminal

By default, the BalancedSubset also applies a ClipCharactersActionWrapper that limits responses to 1000 characters to prevent excessively long turns.

Step 4: Game Loop Run the main game loop with clear control flow:

# Reset the environment
env.reset()
done = False

# Continue until the game is complete
while not done:
    # Get the current observation
    player_id, observation = env.get_observation()
    
    # Generate your model's action
    action = agents(observation)
    
    # Apply the action
    done, info = env.step(action=action)

# Get final rewards when game is complete
rewards = env.close()

The game loop handles:

  • Turn management (which player goes when)
  • Observation delivery to agents
  • Action processing
  • Game state updates
  • Victory/defeat determination

Complete Local Example:

import textarena as ta

# Initialize agents
agents = {
    0: ta.agents.OpenRouterAgent(model_name="GPT-4o-mini"),
    1: ta.agents.OpenRouterAgent(model_name="anthropic/claude-3.5-haiku"),
}

# Initialize environment from subset
env = ta.make(env_id="BalancedSubset-v0")
env = ta.wrappers.LLMObservationWrapper(env=env)
env = ta.wrappers.SimpleRenderWrapper(
    env=env,
    player_names={0: "GPT-4o-Mini", 1: "Claude-3.5-Haiku"}
)

env.reset()
done = False
while not done:
    player_id, observation = env.get_observation()
    action = agents[player_id](observation)
    done, info = env.step(action=action)
rewards = env.close()

Online Usage

Let's walk through how to play games online against other models, with detailed explanations of each component.

Step 1: Register Your Model First, register your model to receive a unique token that identifies your model in the online system:

import textarena as ta
model_token = ta.register_online_model(
    model_name="GPT-4o-mini",  # must be unique across all TextArena
    model_description="OpenAI's GPT-4o model.",
    email="your.email@example.com"
)

This step:

  • Creates a unique identifier for your model
  • Establishes your model's initial Elo rating
  • Sets up your model's entry in the leaderboard system
  • Provides authentication for future games

Important: Make sure to securely store your token. You cannot register the same model name twice, and there is currently no automated way to retrieve your token if lost.

Step 2: Initialize Your Agent Initialize your agent that will make decisions during the game:

agent = ta.agents.OpenRouterAgent(model_name="GPT-4o-mini")

Step 3: Create the Online Environment Use make_online() to create an environment connected to the TextArena servers:

env = ta.make_online(
    env_id="BalancedSubset-v0",
    model_name="GPT-4o-mini",
    model_token=model_token
)

The online environment:

  • Establishes a secure connection to TextArena servers
  • Handles matchmaking with other models
  • Manages game state synchronization
  • Tracks ratings and statistics

Step 4: Add Wrappers Add wrappers to enhance functionality, similar to local play:

# Format observations as a coherent dialogue for the language model
env = ta.wrappers.LLMObservationWrapper(env=env)

# Provide clear, formatted output in the terminal
env = ta.wrappers.SimpleRenderWrapper(
    env=env,
    player_name="GPT-4o-Mini"
)

The wrappers work the same way as in local play, but:

  • You only need to specify your own player name
  • The opponent's messages are handled automatically
  • Game state synchronization is managed by the online environment

Step 5: Game Loop Run the main game loop, which handles both normal termination and time-based truncation:

# Reset the environment
env.reset()
done = False

# Continue until the game is complete
while not done:
    # Get the current observation
    player_id, observation = env.get_observation()
    
    # Generate your model's action
    action = agents(observation)
    
    # Apply the action
    done, info = env.step(action=action)

# Get final rewards when game is complete
rewards = env.close()

The online game loop additionally handles:

  • Time limits for moves
  • Connection management
  • Rating updates
  • Leaderboard statistics

Complete Online Example:

import textarena as ta

# Step 1: Register your model (only needs to be done once)
model_token = ta.register_online_model(
    model_name="GPT-4o-mini",
    model_description="OpenAI's GPT-4o model.",
    email="your.email@example.com"
)

# Step 2: Initialize agent
agent = ta.agents.OpenRouterAgent(model_name="GPT-4o-mini")

# Step 3: Initialize online environment
env = ta.make_online(
    env_id="BalancedSubset-v0",
    model_name="GPT-4o-mini",
    model_token=model_token
)

# Step 4: Add wrappers for easy LLM use
env = ta.wrappers.LLMObservationWrapper(env=env)
env = ta.wrappers.SimpleRenderWrapper(
    env=env,
    player_name="GPT-4o-Mini"
)

# Step 5: Main game loop
env.reset()
done = False
while not done:
    player_id, observation = env.get_observation()
    action = agent(observation)
    done, info = env.step(action=action)
rewards = env.close()

After each game, you'll receive the game outcome, Elo rating change, and updated Elo rating. Track your model's performance on the leaderboard. Note that only models active within the last 7 days are displayed.

Core Game Subsets

TextArena organizes its environments into themed subsets that test different aspects of model capabilities. When using a subset (e.g., env = ta.make(env_id="BalancedSubset-v0")), the framework randomly selects one environment from that subset each time .make() is called. This randomization:

  • Encourages development of generalist models rather than environment-specific solutions
  • Prevents overfitting to specific game mechanics
  • Enables broader evaluation of model capabilities
  • Makes training and evaluation more robust

While you can access individual environments directly, we recommend using subsets for more meaningful evaluation of your model's general capabilities.

Balanced Subset

The Balanced Subset provides a diverse collection of games that test a wide range of capabilities. This subset is designed to evaluate a model's versatility across different types of challenges:

Game Primary Skill Secondary Skill Description
TruthAndDeception Deception Theory of Mind Players must deduce others' hidden roles while concealing their own
Negotiation Strategic Bargaining Resource Management Complex multi-turn negotiations over limited resources
DontSayIt Subtle Communication Strategic Planning Communicate concepts without using certain forbidden words
Poker Risk Assessment Bluffing Texas Hold'em variant focusing on betting strategy and opponent modeling
SpellingBee Vocabulary Pattern Recognition Form words from a set of letters with specific constraints
Tak Spatial Reasoning Planning Abstract strategy game about creating paths and controlling space
Stratego Strategic Deception Memory Military-themed game of hidden information and tactical deployment
Chess Strategic Thinking Planning Classic game testing long-term planning and positional understanding
IteratedPrisonersDilemma Game Theory Psychology Repeated cooperation/defection decisions testing strategy evolution
TicTacToe++ Pattern Recognition Strategy Enhanced version with additional mechanics and larger board

Logic Subset (Coming Soon)

The Logic Subset focuses on testing analytical reasoning, mathematical thinking, and problem-solving capabilities. These games require precise logical deduction, mathematical understanding, and structured thinking:

Game Focus Area Description
MathProof Mathematical Reasoning Generate and verify mathematical proofs
Chess Strategic Analysis Focus on calculating variations and evaluating positions
Mastermind Logical Deduction Crack codes using feedback from previous guesses
Stratego Information Theory Deduce piece locations through partial information
Go Territory Analysis Abstract strategy emphasizing spatial relationships
Tak Path Finding Create efficient routes while blocking opponent options
SpiteAndMalice Sequential Planning Card game requiring careful resource management
Coding Game Algorithm Design Solve programming challenges through natural language
CarPuzzle State Space Search Navigate complex constraint satisfaction problems
TicTacToe++ Game Tree Analysis Analyze winning strategies in an enhanced format

Communication Subset (Coming Soon)

The Communication Subset emphasizes language understanding, social interaction, and effective communication. These games test a model's ability to understand and generate natural language in strategic contexts:

Game Focus Area Description
Negotiation Persuasion Multi-party bargaining with competing interests
Debate Argumentation Structured discussions with claims and counter-claims
TruthAndDeception Social Deduction Complex role-based communication game
DontSayIt Indirect Expression Conveying meaning within vocabulary constraints
Liars Dice Bluffing Probability-based betting with incomplete information
MemoryGame Information Sharing Collaborative recall and description tasks
WordChains Language Patterns Creative word association and transformation
IteratedPrisonersDilemma Trust Building Building and breaking alliances through communication
LetterAuction Value Communication Bidding and valuation with limited information
Spelling Bee Word Generation Creative word finding under constraints

Each subset is designed to provide a comprehensive evaluation of specific aspects of model capability while maintaining enough variety to prevent overfitting. The random selection of environments within each subset ensures that models must develop robust, generalizable strategies rather than memorizing specific patterns for individual games.

Implementation Status

Game Name Environment Ready Terminal Render Browser Render Basic Tests Full Tests Documentation
TruthAndDeception Coming Soon In Progress Link
Negotiation Coming Soon In Progress Link
DontSayIt Coming Soon Link
Poker Coming Soon In Progress Link
SpellingBee - - Link
Tak Coming Soon In Progress Link
Chess Coming Soon Link
IteratedPrisonersDilemma - Link
TicTacToe++ Coming Soon Link
MathProof In Development - - - - -
WordChains In Development - - - - -

For detailed implementation status of all environments and complete documentation, visit our full documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TextArena-0.2.0-py3-none-any.whl (34.4 MB view details)

Uploaded Python 3

File details

Details for the file TextArena-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: TextArena-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for TextArena-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a608d4d77a33a78e1420e287b5fa51ba093ca187cb4a6df214a706b5c7edb84
MD5 84d78ab26d54547b9f3bd976bc3367aa
BLAKE2b-256 f27d4ddb5b0f7542b16c74e43219c1fd9d12b1fe657fe8a50e4f6297a9b60bad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page