SPIRAL: Self-Play Reinforcement Learning framework for training LLMs on competitive games
Project description
SPIRAL-on-Tinker
Self-play reinforcement learning framework for training language models on competitive games, powered by Tinker.
Overview
SPIRAL-on-Tinker provides a scalable implementation of self-play RL training for LLMs using Tinker's distributed training infrastructure. The system trains models to play competitive two-player zero-sum games, developing reasoning and strategic capabilities through continuous self-improvement.
Key Features
- Actor-Learner Architecture: Parallel actors sample trajectories while a centralized learner processes them
- Role-conditioned Advantage Estimation (RAE): Separate advantage calculation for each player role
- Population-based Self-Play (FSP): Train against historical checkpoints for robust policies
- Multi-environment Support: TicTacToe, Kuhn Poker, Liars Dice, Simple Negotiation, etc.
- Async Training: Optional actor-learner decoupling with replay buffer
- Tinker Integration: Leverage Tinker's LoRA training, vLLM inference, and distributed infrastructure
Installation
From PyPI (Recommended)
# Create environment
conda create -y -n spiral python=3.10
conda activate spiral
# Install with Tinker backend (lightweight, recommended)
pip install spiral-rl[tinker]
# Install with OAT backend (requires GPU)
pip install spiral-rl[oat]
# Install with both backends
pip install spiral-rl[all]
# Install with development tools
pip install spiral-rl[full]
From Source
# Clone repository
git clone https://github.com/spiral-rl/spiral-on-tinker.git
cd spiral-on-tinker
# Create environment
conda create -y -n spiral python=3.10
conda activate spiral
# Install in editable mode
pip install -e .
# Or install with extras
pip install -e ".[tinker]" # Tinker backend only
pip install -e ".[full]" # Everything including dev tools
Quick Start
Basic Training
python train_spiral_tinker.py \
model_name="Qwen/Qwen3-8B-Base" \
renderer_name=qwen3 \
env_ids='TicTacToe-v0,KuhnPoker-v1' \
batch_size=128 \
learning_rate=4e-5 \
wandb_project=spiral
Population-based Self-Play (FSP)
python train_spiral_tinker.py \
model_name="Qwen/Qwen3-8B-Base" \
env_ids='KuhnPoker-v1,LiarsDice-v1' \
fsp_enabled=True \
fsp_pool_size=25 \
fsp_update_interval=5 \
wandb_project=spiral
Resume Training with FSP
python train_spiral_tinker.py \
model_name="Qwen/Qwen3-8B-Base" \
load_checkpoint_path="tinker://xxx/weights/000180" \
fsp_resume_checkpoint_base="tinker://xxx/sampler_weights/" \
fsp_enabled=True \
fsp_pool_size=25
See RESUME_FSP.md for detailed instructions on resuming FSP training.
Architecture
Package Structure
The codebase uses a modular three-tier architecture:
spiral/
├── core/ # Shared components (used by both backends)
│ ├── envs/ # Custom TextArena game implementations
│ ├── agents/ # Agent implementations (RandomAgent, utils)
│ ├── template.py # Prompt templates for different model types
│ └── utils.py # Basic utilities (EMA, GameState, extract_boxed_answer)
├── oat/ # OAT-specific implementation (vLLM-based)
│ ├── components.py # SelfPlayCollector, MATHOracle
│ └── metrics.py # EvaluationMetrics
└── tinker/ # Tinker-specific implementation (imports from spiral.core)
├── dataset.py # SpiralRLDatasetBuilder
├── renderer.py # Prompt rendering with template selection
├── utils.py # Tinker-specific utils (logging, metrics)
├── training/ # Training loops and environment management
│ ├── env.py # SpiralTwoPlayerEnv, TwoPlayerCoordinator
│ ├── rollouts.py # Trajectory collection with draw retry
│ ├── train.py # Main training loop factory
│ ├── train_step.py # Single training step logic
│ ├── population.py # PopulationManager for FSP
│ └── async_actor_learner/ # Async architecture with replay buffer
└── eval/ # Evaluation framework
├── evaluator.py # GameEvaluator for game-based evaluation
└── math_test.py # Math benchmark evaluation
Key Architecture Points:
spiral/corecontains all shared components: game environments, agents, templates, and basic utilitiesspiral/tinkerandspiral/oatimport fromspiral.core(no code duplication)spiral/tinker/utils.pycontains Tinker-specific utilities (logging, trajectory metrics, JSON serialization)
Two Training Backends
train_spiral.py: OAT backend with vLLM, multi-GPU support, original SPIRAL implementationtrain_spiral_tinker.py: Tinker backend with distributed training, LoRA, FSP support
Training Pipeline (Tinker Backend)
- Environment Setup:
SpiralTwoPlayerEnvwraps TextArena games with observation formatters - Self-Play Collection:
do_group_rollout()generates trajectories with both players using current policy - Dataset Building:
SpiralRLDatasetBuilderprocesses trajectories into training data - Advantage Estimation: RAE computes separate advantages for each role using role-specific baselines
- Policy Updates: Tinker's PPO learner updates policy using collected trajectories
- Evaluation:
GameEvaluatortracks win rates against opponents, optional math test evaluation
Configuration
Key training parameters:
# Model settings
model_name: str = "Qwen/Qwen3-8B-Base"
renderer_name: str = "qwen3" # qwen3, llama3, deepseek, etc.
lora_rank: int = 64
# Training
batch_size: int = 128
learning_rate: fl = 4e-5
max_tokens: int = 16384
loss_fn: str = "importance_sampling" # or "ppo"
# SPIRAL-specific
use_role_baseline: bool = True
role_baseline_ema_gamma: float = 0.95
filter_draw: bool = False
max_draw_retries: int = 5
# FSP (Population-based)
fsp_enabled: bool = False
fsp_pool_size: int = 25
fsp_start_from: int = 0
fsp_update_interval: int = 5
# Async Actor-Learner
use_async_actor_learner: bool = False
replay_buffer_max_staleness: int = 5
See train_spiral_tinker.py for full configuration options.
Examples
Training scripts for different model sizes are in examples/:
# Qwen3-4B training
bash examples/qwen3_4b/train.sh
# Qwen3-8B training
bash examples/qwen3_8b/train.sh
# Qwen3-8B with FSP (pool=25)
bash examples/qwen3_8b/train_fsp_pool_25.sh
# Qwen3-8B with async actor-learner
bash examples/qwen3_8b/train_async_actor_learner.sh
# Resume FSP training
bash examples/qwen3_8b/resume_fsp.sh
Supported Environments
From TextArena:
- TicTacToe-v0: Classic tic-tac-toe
- KuhnPoker-v1: Simplified poker variant
- LiarsDice-v1: Bluffing dice game
- SimpleNegotiation-v2: Resource negotiation
- ConnectFour-v0: Connect 4 game
- And more...
Key Algorithms
Role-conditioned Advantage Estimation (RAE)
In self-play, both players' trajectories come from the same policy, but with different roles. RAE computes separate advantages for each role:
# Player 0 advantages
adv_P0 = returns_P0 - baseline_P0
# Player 1 advantages
adv_P1 = returns_P1 - baseline_P1
This prevents conflating the two roles and improves training stability.
Population-based Self-Play (FSP)
Instead of pure self-play, FSP trains against a pool of historical checkpoints:
- Current policy plays against randomly sampled opponents from the pool
- Pool is updated at regular intervals with new checkpoints
- Provides more diverse training signal and robustness
See spiral_tinker/training/population.py for implementation.
Development
Testing
# Run tests
pytest tests/
# Run specific test
pytest tests/test_training.py -k test_population_manager
Linting
# Format code
black spiral_tinker/
isort spiral_tinker/
# Check
flake8 spiral_tinker/
Citation
If you use this code in your research, please cite:
@software{spiral_tinker2025,
title={SPIRAL-on-Tinker: Self-play RL for LLMs},
author={SPIRAL Team},
year={2025},
url={https://github.com/spiral-rl/spiral-on-tinker}
}
License
Apache 2.0 - See LICENSE for details.
Links
- Main SPIRAL Repository: https://github.com/spiral-rl/spiral
- Tinker Platform: https://tinker-docs.thinkingmachines.ai/
- TextArena: https://github.com/LeonGuertler/TextArena
- Documentation: docs/
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spiral_rl-0.2.0.tar.gz.
File metadata
- Download URL: spiral_rl-0.2.0.tar.gz
- Upload date:
- Size: 96.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c49f1b2a6728c27dcbebd5d05d964c9130bc71e763490bb745c5c62eadb34438
|
|
| MD5 |
c7e1bd581943d3d3b23f9b6a0bafb48e
|
|
| BLAKE2b-256 |
847e59fd6906bb38d3ae4361bf3fba1469536b43e8abc2aad947d3a1e8d714ce
|
File details
Details for the file spiral_rl-0.2.0-py3-none-any.whl.
File metadata
- Download URL: spiral_rl-0.2.0-py3-none-any.whl
- Upload date:
- Size: 102.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74b3a4cb96078c138dc3148311aac03ee4a607a057f2d2fa7b3b7b4e682627af
|
|
| MD5 |
dca27e1dc48b7fe49a26f15378ace487
|
|
| BLAKE2b-256 |
2dbd5289210fb82c5a1c3a4f3abe858bbbd3fbec8269390c12b00f88110ec32b
|