Skip to main content

SPIRAL: Self-Play Reinforcement Learning framework for training LLMs on competitive games

Project description

SPIRAL-on-Tinker

Self-play reinforcement learning framework for training language models on competitive games, powered by Tinker.

Overview

SPIRAL-on-Tinker provides a scalable implementation of self-play RL training for LLMs using Tinker's distributed training infrastructure. The system trains models to play competitive two-player zero-sum games, developing reasoning and strategic capabilities through continuous self-improvement.

Key Features

  • Actor-Learner Architecture: Parallel actors sample trajectories while a centralized learner processes them
  • Role-conditioned Advantage Estimation (RAE): Separate advantage calculation for each player role
  • Population-based Self-Play (FSP): Train against historical checkpoints for robust policies
  • Multi-environment Support: TicTacToe, Kuhn Poker, Liars Dice, Simple Negotiation, etc.
  • Async Training: Optional actor-learner decoupling with replay buffer
  • Tinker Integration: Leverage Tinker's LoRA training, vLLM inference, and distributed infrastructure

Installation

From PyPI (Recommended)

# Create environment
conda create -y -n spiral python=3.10
conda activate spiral

# Install with Tinker backend (lightweight, recommended)
pip install spiral-rl[tinker]

# Install with OAT backend (requires GPU)
pip install spiral-rl[oat]

# Install with both backends
pip install spiral-rl[all]

# Install with development tools
pip install spiral-rl[full]

From Source

# Clone repository
git clone https://github.com/spiral-rl/spiral-on-tinker.git
cd spiral-on-tinker

# Create environment
conda create -y -n spiral python=3.10
conda activate spiral

# Install in editable mode
pip install -e .

# Or install with extras
pip install -e ".[tinker]"  # Tinker backend only
pip install -e ".[full]"    # Everything including dev tools

Quick Start

Basic Training

python train_spiral_tinker.py \
    model_name="Qwen/Qwen3-8B-Base" \
    renderer_name=qwen3 \
    env_ids='TicTacToe-v0,KuhnPoker-v1' \
    batch_size=128 \
    learning_rate=4e-5 \
    wandb_project=spiral

Population-based Self-Play (FSP)

python train_spiral_tinker.py \
    model_name="Qwen/Qwen3-8B-Base" \
    env_ids='KuhnPoker-v1,LiarsDice-v1' \
    fsp_enabled=True \
    fsp_pool_size=25 \
    fsp_update_interval=5 \
    wandb_project=spiral

Resume Training with FSP

python train_spiral_tinker.py \
    model_name="Qwen/Qwen3-8B-Base" \
    load_checkpoint_path="tinker://xxx/weights/000180" \
    fsp_resume_checkpoint_base="tinker://xxx/sampler_weights/" \
    fsp_enabled=True \
    fsp_pool_size=25

See RESUME_FSP.md for detailed instructions on resuming FSP training.

Architecture

Package Structure

The codebase uses a modular three-tier architecture:

spiral/
├── core/           # Shared components (used by both backends)
│   ├── envs/      # Custom TextArena game implementations
│   ├── agents/    # Agent implementations (RandomAgent, utils)
│   ├── template.py # Prompt templates for different model types
│   └── utils.py   # Basic utilities (EMA, GameState, extract_boxed_answer)
├── oat/           # OAT-specific implementation (vLLM-based)
│   ├── components.py # SelfPlayCollector, MATHOracle
│   └── metrics.py    # EvaluationMetrics
└── tinker/        # Tinker-specific implementation (imports from spiral.core)
    ├── dataset.py        # SpiralRLDatasetBuilder
    ├── renderer.py       # Prompt rendering with template selection
    ├── utils.py          # Tinker-specific utils (logging, metrics)
    ├── training/         # Training loops and environment management
    │   ├── env.py           # SpiralTwoPlayerEnv, TwoPlayerCoordinator
    │   ├── rollouts.py      # Trajectory collection with draw retry
    │   ├── train.py         # Main training loop factory
    │   ├── train_step.py    # Single training step logic
    │   ├── population.py    # PopulationManager for FSP
    │   └── async_actor_learner/ # Async architecture with replay buffer
    └── eval/          # Evaluation framework
        ├── evaluator.py  # GameEvaluator for game-based evaluation
        └── math_test.py  # Math benchmark evaluation

Key Architecture Points:

  • spiral/core contains all shared components: game environments, agents, templates, and basic utilities
  • spiral/tinker and spiral/oat import from spiral.core (no code duplication)
  • spiral/tinker/utils.py contains Tinker-specific utilities (logging, trajectory metrics, JSON serialization)

Two Training Backends

  • train_spiral.py: OAT backend with vLLM, multi-GPU support, original SPIRAL implementation
  • train_spiral_tinker.py: Tinker backend with distributed training, LoRA, FSP support

Training Pipeline (Tinker Backend)

  1. Environment Setup: SpiralTwoPlayerEnv wraps TextArena games with observation formatters
  2. Self-Play Collection: do_group_rollout() generates trajectories with both players using current policy
  3. Dataset Building: SpiralRLDatasetBuilder processes trajectories into training data
  4. Advantage Estimation: RAE computes separate advantages for each role using role-specific baselines
  5. Policy Updates: Tinker's PPO learner updates policy using collected trajectories
  6. Evaluation: GameEvaluator tracks win rates against opponents, optional math test evaluation

Configuration

Key training parameters:

# Model settings
model_name: str = "Qwen/Qwen3-8B-Base"
renderer_name: str = "qwen3"  # qwen3, llama3, deepseek, etc.
lora_rank: int = 64

# Training
batch_size: int = 128
learning_rate: fl = 4e-5
max_tokens: int = 16384
loss_fn: str = "importance_sampling"  # or "ppo"

# SPIRAL-specific
use_role_baseline: bool = True
role_baseline_ema_gamma: float = 0.95
filter_draw: bool = False
max_draw_retries: int = 5

# FSP (Population-based)
fsp_enabled: bool = False
fsp_pool_size: int = 25
fsp_start_from: int = 0
fsp_update_interval: int = 5

# Async Actor-Learner
use_async_actor_learner: bool = False
replay_buffer_max_staleness: int = 5

See train_spiral_tinker.py for full configuration options.

Examples

Training scripts for different model sizes are in examples/:

# Qwen3-4B training
bash examples/qwen3_4b/train.sh

# Qwen3-8B training
bash examples/qwen3_8b/train.sh

# Qwen3-8B with FSP (pool=25)
bash examples/qwen3_8b/train_fsp_pool_25.sh

# Qwen3-8B with async actor-learner
bash examples/qwen3_8b/train_async_actor_learner.sh

# Resume FSP training
bash examples/qwen3_8b/resume_fsp.sh

Supported Environments

From TextArena:

  • TicTacToe-v0: Classic tic-tac-toe
  • KuhnPoker-v1: Simplified poker variant
  • LiarsDice-v1: Bluffing dice game
  • SimpleNegotiation-v2: Resource negotiation
  • ConnectFour-v0: Connect 4 game
  • And more...

Key Algorithms

Role-conditioned Advantage Estimation (RAE)

In self-play, both players' trajectories come from the same policy, but with different roles. RAE computes separate advantages for each role:

# Player 0 advantages
adv_P0 = returns_P0 - baseline_P0

# Player 1 advantages
adv_P1 = returns_P1 - baseline_P1

This prevents conflating the two roles and improves training stability.

Population-based Self-Play (FSP)

Instead of pure self-play, FSP trains against a pool of historical checkpoints:

  • Current policy plays against randomly sampled opponents from the pool
  • Pool is updated at regular intervals with new checkpoints
  • Provides more diverse training signal and robustness

See spiral_tinker/training/population.py for implementation.

Development

Testing

# Run tests
pytest tests/

# Run specific test
pytest tests/test_training.py -k test_population_manager

Linting

# Format code
black spiral_tinker/
isort spiral_tinker/

# Check
flake8 spiral_tinker/

Citation

If you use this code in your research, please cite:

@software{spiral_tinker2025,
  title={SPIRAL-on-Tinker: Self-play RL for LLMs},
  author={SPIRAL Team},
  year={2025},
  url={https://github.com/spiral-rl/spiral-on-tinker}
}

License

Apache 2.0 - See LICENSE for details.

Links

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiral_rl-0.2.0.tar.gz (96.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spiral_rl-0.2.0-py3-none-any.whl (102.3 kB view details)

Uploaded Python 3

File details

Details for the file spiral_rl-0.2.0.tar.gz.

File metadata

  • Download URL: spiral_rl-0.2.0.tar.gz
  • Upload date:
  • Size: 96.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for spiral_rl-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c49f1b2a6728c27dcbebd5d05d964c9130bc71e763490bb745c5c62eadb34438
MD5 c7e1bd581943d3d3b23f9b6a0bafb48e
BLAKE2b-256 847e59fd6906bb38d3ae4361bf3fba1469536b43e8abc2aad947d3a1e8d714ce

See more details on using hashes here.

File details

Details for the file spiral_rl-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: spiral_rl-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 102.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for spiral_rl-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74b3a4cb96078c138dc3148311aac03ee4a607a057f2d2fa7b3b7b4e682627af
MD5 dca27e1dc48b7fe49a26f15378ace487
BLAKE2b-256 2dbd5289210fb82c5a1c3a4f3abe858bbbd3fbec8269390c12b00f88110ec32b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page