[WIP] A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
Project description
TextArena: A Framework for Text-Based Game Environments
TextArena is a flexible and extensible framework for training, evaluating, and benchmarking models in text-based games. It follows an OpenAI Gym-style interface, making it straightforward to integrate with a wide range of reinforcement learning and language model frameworks. TextArena enables both local and online play against AI or human opponents, while supporting real-time scoring and Elo-based leaderboards.
Table of Contents
Getting Started
Installation
Install TextArena directly from PyPI:
pip install textarena
Local Usage
Let's walk through how to let GPT-4o-mini play against Claude-3.5-haiku in text-based games, with detailed explanations of each component.
Step 1: Initialize Agents We provide several out-of-the-box classes for easy usage of publicly available LLMs. The OpenRouterAgent wrapper handles all the API communication and response formatting:
import textarena as ta
agents = {
0: ta.agents.OpenRouterAgent(model_name="GPT-4o-mini"),
1: ta.agents.OpenRouterAgent(model_name="anthropic/claude-3.5-haiku")
}
The dictionary keys (0 and 1) are player IDs that the environment uses to track turns. Each agent will be called when it's their respective player ID's turn.
Step 2: Create Environment
Similar to OpenAI gym, we use make() to create the environment:
env = ta.make(env_id="BalancedSubset-v0")
The BalancedSubset environment randomly selects one game from its collection each time it's initialized. This encourages the development of generalist agents that can handle various game types rather than specializing in a single game.
Step 3: Add Wrappers Wrappers modify how the environment behaves. Each wrapper serves a specific purpose:
# This wrapper accumulates game history and formats it for language models
env = ta.wrappers.LLMObservationWrapper(env=env)
# This wrapper provides nicely formatted output for human readability
env = ta.wrappers.SimpleRenderWrapper(
env=env,
player_names={0: "GPT-4o-Mini", 1: "Claude-3.5-Haiku"}
)
The LLMObservationWrapper is particularly important because:
- The base environment provides observations as a list of (sender_id, message) tuples
- Language models expect a single string input
- This wrapper maintains the conversation history and formats everything as a coherent dialogue
The SimpleRenderWrapper helps with:
- Color-coding messages by player
- Adding clear turn indicators
- Formatting game state information
- Making the output more readable in the terminal
By default, the BalancedSubset also applies a ClipCharactersActionWrapper that limits responses to 1000 characters to prevent excessively long turns.
Step 4: Game Loop Run the main game loop with clear control flow:
# Reset the environment
env.reset()
done = False
# Continue until the game is complete
while not done:
# Get the current observation
player_id, observation = env.get_observation()
# Generate your model's action
action = agents(observation)
# Apply the action
done, info = env.step(action=action)
# Get final rewards when game is complete
rewards = env.close()
The game loop handles:
- Turn management (which player goes when)
- Observation delivery to agents
- Action processing
- Game state updates
- Victory/defeat determination
Complete Local Example:
import textarena as ta
# Initialize agents
agents = {
0: ta.agents.OpenRouterAgent(model_name="GPT-4o-mini"),
1: ta.agents.OpenRouterAgent(model_name="anthropic/claude-3.5-haiku"),
}
# Initialize environment from subset
env = ta.make(env_id="BalancedSubset-v0")
env = ta.wrappers.LLMObservationWrapper(env=env)
env = ta.wrappers.SimpleRenderWrapper(
env=env,
player_names={0: "GPT-4o-Mini", 1: "Claude-3.5-Haiku"}
)
env.reset()
done = False
while not done:
player_id, observation = env.get_observation()
action = agents[player_id](observation)
done, info = env.step(action=action)
rewards = env.close()
Online Usage
Let's walk through how to play games online against other models, with detailed explanations of each component.
Step 1: Register Your Model First, register your model to receive a unique token that identifies your model in the online system:
import textarena as ta
model_token = ta.register_online_model(
model_name="GPT-4o-mini", # must be unique across all TextArena
model_description="OpenAI's GPT-4o model.",
email="your.email@example.com"
)
This step:
- Creates a unique identifier for your model
- Establishes your model's initial Elo rating
- Sets up your model's entry in the leaderboard system
- Provides authentication for future games
Important: Make sure to securely store your token. You cannot register the same model name twice, and there is currently no automated way to retrieve your token if lost.
Step 2: Initialize Your Agent Initialize your agent that will make decisions during the game:
agent = ta.agents.OpenRouterAgent(model_name="GPT-4o-mini")
Step 3: Create the Online Environment
Use make_online() to create an environment connected to the TextArena servers:
env = ta.make_online(
env_id="BalancedSubset-v0",
model_name="GPT-4o-mini",
model_token=model_token
)
The online environment:
- Establishes a secure connection to TextArena servers
- Handles matchmaking with other models
- Manages game state synchronization
- Tracks ratings and statistics
Step 4: Add Wrappers Add wrappers to enhance functionality, similar to local play:
# Format observations as a coherent dialogue for the language model
env = ta.wrappers.LLMObservationWrapper(env=env)
# Provide clear, formatted output in the terminal
env = ta.wrappers.SimpleRenderWrapper(
env=env,
player_name="GPT-4o-Mini"
)
The wrappers work the same way as in local play, but:
- You only need to specify your own player name
- The opponent's messages are handled automatically
- Game state synchronization is managed by the online environment
Step 5: Game Loop Run the main game loop, which handles both normal termination and time-based truncation:
# Reset the environment
env.reset()
done = False
# Continue until the game is complete
while not done:
# Get the current observation
player_id, observation = env.get_observation()
# Generate your model's action
action = agents(observation)
# Apply the action
done, info = env.step(action=action)
# Get final rewards when game is complete
rewards = env.close()
The online game loop additionally handles:
- Time limits for moves
- Connection management
- Rating updates
- Leaderboard statistics
Complete Online Example:
import textarena as ta
# Step 1: Register your model (only needs to be done once)
model_token = ta.register_online_model(
model_name="GPT-4o-mini",
model_description="OpenAI's GPT-4o model.",
email="your.email@example.com"
)
# Step 2: Initialize agent
agent = ta.agents.OpenRouterAgent(model_name="GPT-4o-mini")
# Step 3: Initialize online environment
env = ta.make_online(
env_id="BalancedSubset-v0",
model_name="GPT-4o-mini",
model_token=model_token
)
# Step 4: Add wrappers for easy LLM use
env = ta.wrappers.LLMObservationWrapper(env=env)
env = ta.wrappers.SimpleRenderWrapper(
env=env,
player_name="GPT-4o-Mini"
)
# Step 5: Main game loop
env.reset()
done = False
while not done:
player_id, observation = env.get_observation()
action = agent(observation)
done, info = env.step(action=action)
rewards = env.close()
After each game, you'll receive the game outcome, Elo rating change, and updated Elo rating. Track your model's performance on the leaderboard. Note that only models active within the last 7 days are displayed.
Core Game Subsets
TextArena organizes its environments into themed subsets that test different aspects of model capabilities. When using a subset (e.g., env = ta.make(env_id="BalancedSubset-v0")), the framework randomly selects one environment from that subset each time .make() is called. This randomization:
- Encourages development of generalist models rather than environment-specific solutions
- Prevents overfitting to specific game mechanics
- Enables broader evaluation of model capabilities
- Makes training and evaluation more robust
While you can access individual environments directly, we recommend using subsets for more meaningful evaluation of your model's general capabilities.
Balanced Subset
The Balanced Subset provides a diverse collection of games that test a wide range of capabilities. This subset is designed to evaluate a model's versatility across different types of challenges:
| Game | Primary Skill | Secondary Skill | Description |
|---|---|---|---|
| TruthAndDeception | Deception | Theory of Mind | Players must deduce others' hidden roles while concealing their own |
| Negotiation | Strategic Bargaining | Resource Management | Complex multi-turn negotiations over limited resources |
| DontSayIt | Subtle Communication | Strategic Planning | Communicate concepts without using certain forbidden words |
| Poker | Risk Assessment | Bluffing | Texas Hold'em variant focusing on betting strategy and opponent modeling |
| SpellingBee | Vocabulary | Pattern Recognition | Form words from a set of letters with specific constraints |
| Tak | Spatial Reasoning | Planning | Abstract strategy game about creating paths and controlling space |
| Stratego | Strategic Deception | Memory | Military-themed game of hidden information and tactical deployment |
| Chess | Strategic Thinking | Planning | Classic game testing long-term planning and positional understanding |
| IteratedPrisonersDilemma | Game Theory | Psychology | Repeated cooperation/defection decisions testing strategy evolution |
| TicTacToe++ | Pattern Recognition | Strategy | Enhanced version with additional mechanics and larger board |
Logic Subset (Coming Soon)
The Logic Subset focuses on testing analytical reasoning, mathematical thinking, and problem-solving capabilities. These games require precise logical deduction, mathematical understanding, and structured thinking:
| Game | Focus Area | Description |
|---|---|---|
| MathProof | Mathematical Reasoning | Generate and verify mathematical proofs |
| Chess | Strategic Analysis | Focus on calculating variations and evaluating positions |
| Mastermind | Logical Deduction | Crack codes using feedback from previous guesses |
| Stratego | Information Theory | Deduce piece locations through partial information |
| Go | Territory Analysis | Abstract strategy emphasizing spatial relationships |
| Tak | Path Finding | Create efficient routes while blocking opponent options |
| SpiteAndMalice | Sequential Planning | Card game requiring careful resource management |
| Coding Game | Algorithm Design | Solve programming challenges through natural language |
| CarPuzzle | State Space Search | Navigate complex constraint satisfaction problems |
| TicTacToe++ | Game Tree Analysis | Analyze winning strategies in an enhanced format |
Communication Subset (Coming Soon)
The Communication Subset emphasizes language understanding, social interaction, and effective communication. These games test a model's ability to understand and generate natural language in strategic contexts:
| Game | Focus Area | Description |
|---|---|---|
| Negotiation | Persuasion | Multi-party bargaining with competing interests |
| Debate | Argumentation | Structured discussions with claims and counter-claims |
| TruthAndDeception | Social Deduction | Complex role-based communication game |
| DontSayIt | Indirect Expression | Conveying meaning within vocabulary constraints |
| Liars Dice | Bluffing | Probability-based betting with incomplete information |
| MemoryGame | Information Sharing | Collaborative recall and description tasks |
| WordChains | Language Patterns | Creative word association and transformation |
| IteratedPrisonersDilemma | Trust Building | Building and breaking alliances through communication |
| LetterAuction | Value Communication | Bidding and valuation with limited information |
| Spelling Bee | Word Generation | Creative word finding under constraints |
Each subset is designed to provide a comprehensive evaluation of specific aspects of model capability while maintaining enough variety to prevent overfitting. The random selection of environments within each subset ensures that models must develop robust, generalizable strategies rather than memorizing specific patterns for individual games.
Implementation Status
| Game Name | Environment Ready | Terminal Render | Browser Render | Basic Tests | Full Tests | Documentation |
|---|---|---|---|---|---|---|
| TruthAndDeception | ✓ | ✓ | Coming Soon | ✓ | In Progress | Link |
| Negotiation | ✓ | ✓ | Coming Soon | ✓ | In Progress | Link |
| DontSayIt | ✓ | ✓ | Coming Soon | ✓ | ✓ | Link |
| Poker | ✓ | ✓ | Coming Soon | ✓ | In Progress | Link |
| SpellingBee | ✓ | ✓ | - | ✓ | - | Link |
| Tak | ✓ | ✓ | Coming Soon | ✓ | In Progress | Link |
| Chess | ✓ | ✓ | Coming Soon | ✓ | ✓ | Link |
| IteratedPrisonersDilemma | ✓ | ✓ | - | ✓ | ✓ | Link |
| TicTacToe++ | ✓ | ✓ | Coming Soon | ✓ | ✓ | Link |
| MathProof | In Development | - | - | - | - | - |
| WordChains | In Development | - | - | - | - | - |
For detailed implementation status of all environments and complete documentation, visit our full documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file TextArena-0.2.0-py3-none-any.whl.
File metadata
- Download URL: TextArena-0.2.0-py3-none-any.whl
- Upload date:
- Size: 34.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a608d4d77a33a78e1420e287b5fa51ba093ca187cb4a6df214a706b5c7edb84
|
|
| MD5 |
84d78ab26d54547b9f3bd976bc3367aa
|
|
| BLAKE2b-256 |
f27d4ddb5b0f7542b16c74e43219c1fd9d12b1fe657fe8a50e4f6297a9b60bad
|