SPIRAL: Self-Play Reinforcement Learning framework for training LLMs on competitive games

These details have not been verified by PyPI

Project links

Project description

SPIRAL-on-Tinker

Self-play reinforcement learning framework for training language models on competitive games, powered by Tinker.

Overview

SPIRAL-on-Tinker provides a scalable implementation of self-play RL training for LLMs using Tinker's distributed training infrastructure. The system trains models to play competitive two-player zero-sum games, developing reasoning and strategic capabilities through continuous self-improvement.

Key Features

Actor-Learner Architecture: Parallel actors sample trajectories while a centralized learner processes them
Role-conditioned Advantage Estimation (RAE): Separate advantage calculation for each player role
Population-based Self-Play (FSP): Train against historical checkpoints for robust policies
Multi-environment Support: TicTacToe, Kuhn Poker, Liars Dice, Simple Negotiation, etc.
Async Training: Optional actor-learner decoupling with replay buffer
Tinker Integration: Leverage Tinker's LoRA training, vLLM inference, and distributed infrastructure

Installation

From PyPI (Recommended)

# Create environment
conda create -y -n spiral python=3.10
conda activate spiral

# Install with Tinker backend (lightweight, recommended)
pip install spiral-rl[tinker]

# Install with OAT backend (requires GPU)
pip install spiral-rl[oat]

# Install with both backends
pip install spiral-rl[all]

# Install with development tools
pip install spiral-rl[full]

From Source

# Clone repository
git clone https://github.com/spiral-rl/spiral-on-tinker.git
cd spiral-on-tinker

# Create environment
conda create -y -n spiral python=3.10
conda activate spiral

# Install in editable mode
pip install -e .

# Or install with extras
pip install -e ".[tinker]"  # Tinker backend only
pip install -e ".[full]"    # Everything including dev tools

Quick Start

Basic Training

python train_spiral_tinker.py \
    model_name="Qwen/Qwen3-8B-Base" \
    renderer_name=qwen3 \
    env_ids='TicTacToe-v0,KuhnPoker-v1' \
    batch_size=128 \
    learning_rate=4e-5 \
    wandb_project=spiral

Population-based Self-Play (FSP)

python train_spiral_tinker.py \
    model_name="Qwen/Qwen3-8B-Base" \
    env_ids='KuhnPoker-v1,LiarsDice-v1' \
    fsp_enabled=True \
    fsp_pool_size=25 \
    fsp_update_interval=5 \
    wandb_project=spiral

Resume Training with FSP

python train_spiral_tinker.py \
    model_name="Qwen/Qwen3-8B-Base" \
    load_checkpoint_path="tinker://xxx/weights/000180" \
    fsp_resume_checkpoint_base="tinker://xxx/sampler_weights/" \
    fsp_enabled=True \
    fsp_pool_size=25

See RESUME_FSP.md for detailed instructions on resuming FSP training.

Architecture

Package Structure

The codebase uses a modular three-tier architecture:

spiral/
├── core/           # Shared components (used by both backends)
│   ├── envs/      # Custom TextArena game implementations
│   ├── agents/    # Agent implementations (RandomAgent, utils)
│   ├── template.py # Prompt templates for different model types
│   └── utils.py   # Basic utilities (EMA, GameState, extract_boxed_answer)
├── oat/           # OAT-specific implementation (vLLM-based)
│   ├── components.py # SelfPlayCollector, MATHOracle
│   └── metrics.py    # EvaluationMetrics
└── tinker/        # Tinker-specific implementation (imports from spiral.core)
    ├── dataset.py        # SpiralRLDatasetBuilder
    ├── renderer.py       # Prompt rendering with template selection
    ├── utils.py          # Tinker-specific utils (logging, metrics)
    ├── training/         # Training loops and environment management
    │   ├── env.py           # SpiralTwoPlayerEnv, TwoPlayerCoordinator
    │   ├── rollouts.py      # Trajectory collection with draw retry
    │   ├── train.py         # Main training loop factory
    │   ├── train_step.py    # Single training step logic
    │   ├── population.py    # PopulationManager for FSP
    │   └── async_actor_learner/ # Async architecture with replay buffer
    └── eval/          # Evaluation framework
        ├── evaluator.py  # GameEvaluator for game-based evaluation
        └── math_test.py  # Math benchmark evaluation

Key Architecture Points:

spiral/core contains all shared components: game environments, agents, templates, and basic utilities
spiral/tinker and spiral/oat import from spiral.core (no code duplication)
spiral/tinker/utils.py contains Tinker-specific utilities (logging, trajectory metrics, JSON serialization)

Two Training Backends

train_spiral.py: OAT backend with vLLM, multi-GPU support, original SPIRAL implementation
train_spiral_tinker.py: Tinker backend with distributed training, LoRA, FSP support

Training Pipeline (Tinker Backend)

Environment Setup: SpiralTwoPlayerEnv wraps TextArena games with observation formatters
Self-Play Collection: do_group_rollout() generates trajectories with both players using current policy
Dataset Building: SpiralRLDatasetBuilder processes trajectories into training data
Advantage Estimation: RAE computes separate advantages for each role using role-specific baselines
Policy Updates: Tinker's PPO learner updates policy using collected trajectories
Evaluation: GameEvaluator tracks win rates against opponents, optional math test evaluation

Configuration

Key training parameters:

# Model settings
model_name: str = "Qwen/Qwen3-8B-Base"
renderer_name: str = "qwen3"  # qwen3, llama3, deepseek, etc.
lora_rank: int = 64

# Training
batch_size: int = 128
learning_rate: fl = 4e-5
max_tokens: int = 16384
loss_fn: str = "importance_sampling"  # or "ppo"

# SPIRAL-specific
use_role_baseline: bool = True
role_baseline_ema_gamma: float = 0.95
filter_draw: bool = False
max_draw_retries: int = 5

# FSP (Population-based)
fsp_enabled: bool = False
fsp_pool_size: int = 25
fsp_start_from: int = 0
fsp_update_interval: int = 5

# Async Actor-Learner
use_async_actor_learner: bool = False
replay_buffer_max_staleness: int = 5

See train_spiral_tinker.py for full configuration options.

Examples

Training scripts for different model sizes are in examples/:

# Qwen3-4B training
bash examples/qwen3_4b/train.sh

# Qwen3-8B training
bash examples/qwen3_8b/train.sh

# Qwen3-8B with FSP (pool=25)
bash examples/qwen3_8b/train_fsp_pool_25.sh

# Qwen3-8B with async actor-learner
bash examples/qwen3_8b/train_async_actor_learner.sh

# Resume FSP training
bash examples/qwen3_8b/resume_fsp.sh

Supported Environments

From TextArena:

TicTacToe-v0: Classic tic-tac-toe
KuhnPoker-v1: Simplified poker variant
LiarsDice-v1: Bluffing dice game
SimpleNegotiation-v2: Resource negotiation
ConnectFour-v0: Connect 4 game
And more...

Key Algorithms

Role-conditioned Advantage Estimation (RAE)

In self-play, both players' trajectories come from the same policy, but with different roles. RAE computes separate advantages for each role:

# Player 0 advantages
adv_P0 = returns_P0 - baseline_P0

# Player 1 advantages
adv_P1 = returns_P1 - baseline_P1

This prevents conflating the two roles and improves training stability.

Population-based Self-Play (FSP)

Instead of pure self-play, FSP trains against a pool of historical checkpoints:

Current policy plays against randomly sampled opponents from the pool
Pool is updated at regular intervals with new checkpoints
Provides more diverse training signal and robustness

See spiral_tinker/training/population.py for implementation.

Development

Testing

# Run tests
pytest tests/

# Run specific test
pytest tests/test_training.py -k test_population_manager

Linting

# Format code
black spiral_tinker/
isort spiral_tinker/

# Check
flake8 spiral_tinker/

Citation

If you use this code in your research, please cite:

@software{spiral_tinker2025,
  title={SPIRAL-on-Tinker: Self-play RL for LLMs},
  author={SPIRAL Team},
  year={2025},
  url={https://github.com/spiral-rl/spiral-on-tinker}
}

License

Apache 2.0 - See LICENSE for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Support

Issues: https://github.com/spiral-rl/spiral-on-tinker/issues
Discussions: https://github.com/spiral-rl/spiral-on-tinker/discussions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Nov 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiral_rl-0.2.0.tar.gz (96.4 kB view details)

Uploaded Nov 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spiral_rl-0.2.0-py3-none-any.whl (102.3 kB view details)

Uploaded Nov 7, 2025 Python 3

File details

Details for the file spiral_rl-0.2.0.tar.gz.

File metadata

Download URL: spiral_rl-0.2.0.tar.gz
Upload date: Nov 7, 2025
Size: 96.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for spiral_rl-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c49f1b2a6728c27dcbebd5d05d964c9130bc71e763490bb745c5c62eadb34438`
MD5	`c7e1bd581943d3d3b23f9b6a0bafb48e`
BLAKE2b-256	`847e59fd6906bb38d3ae4361bf3fba1469536b43e8abc2aad947d3a1e8d714ce`

See more details on using hashes here.

File details

Details for the file spiral_rl-0.2.0-py3-none-any.whl.

File metadata

Download URL: spiral_rl-0.2.0-py3-none-any.whl
Upload date: Nov 7, 2025
Size: 102.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for spiral_rl-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74b3a4cb96078c138dc3148311aac03ee4a607a057f2d2fa7b3b7b4e682627af`
MD5	`dca27e1dc48b7fe49a26f15378ace487`
BLAKE2b-256	`2dbd5289210fb82c5a1c3a4f3abe858bbbd3fbec8269390c12b00f88110ec32b`

See more details on using hashes here.

spiral-rl 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SPIRAL-on-Tinker

Overview

Key Features

Installation

From PyPI (Recommended)

From Source

Quick Start

Basic Training

Population-based Self-Play (FSP)

Resume Training with FSP

Architecture

Package Structure

Two Training Backends

Training Pipeline (Tinker Backend)

Configuration

Examples

Supported Environments

Key Algorithms

Role-conditioned Advantage Estimation (RAE)

Population-based Self-Play (FSP)

Development

Testing

Linting

Citation

License

Links

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes