Skip to main content

Enterprise-grade reinforcement learning for large-scale model training.

Project description

Miles Logo

Enterprise-Grade Reinforcement Learning for Large-Scale Model Training

High-Performance Rollout • Low Precision Training • Production Stability

GitHub Repo License Slack

Latest Updates | Quick Start | Key Features | Documentation


Latest Updates

  • [2026/02] 💡 Miles Detailed Arguments: We've added a detailed command-line argument guide used to configure Miles for RL training and inference. These arguments enable precise control over cluster resources, training backends (Megatron/FSDP), inference optimization via SGLang, and RL algorithmic hyperparameters. Link
  • [2026/01] 💎 INT4 Quantization-Aware Training (QAT): Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. Blog
  • [2026/01] 💎 Unified VLM/LLM Multi-Turn Training: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized rollout function to easily start multi-turn RL for VLM, just like training LLM. Blog
  • [2026/01] 🤖 Multi-Agent Co-Evolution: Miles now supports MrlX, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [Link]
  • [2025/12] 🔄 Rollout Routing Replay (R3): In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [Paper] [Docs]
  • [2025/11] 🔥 Unified FP8 Release: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [Blog]
  • [2025/11]Speculative Decoding in RL: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [Blog]
  • [2025/11] 🎉 Miles Project Launch: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [Announcement]

What is Miles?

Miles is a high-performance, enterprise-ready reinforcement learning (RL) framework specifically optimized for Large-Scale model Post-Training. Built as a powerful fork of slime, Miles bridges the gap between research-grade RL and production-grade reliability by integrating SGLang for high-throughput rollout and Megatron-LM for scalable training.

"A journey of a thousand miles begins with a single rollout." — Miles focuses on the low-level system optimizations that make large-scale RL stable, efficient, and reproducible.


Key Features

🌪️ Advanced MoE & Low-Precision Training

  • Unified FP8 Pipeline: The first framework to implement end-to-end FP8 sampling and training. By unifying precision across rollout and training, Miles eliminates the quantization-induced discrepancy that causes RL collapse in large MoE models.
  • Rollout Routing Replay (R3): Records expert routing decisions during SGLang inference and replays them during training to ensure bit-wise expert alignment.
  • INT4 QAT Support: Recommendation for 1TB+ models to enable single-machine (e.g., H200) deployment by significantly reducing memory footprint.

🛡️ Eliminating Train-Inference Mismatch

  • Bit-wise Identical Training and Inference Log Probs: System-level solution achieving deterministic forward/backward passes through kernel-level optimization (FlashAttention-3, DeepGEMM).
  • Algorithmic Correction (TIS/MIS): When mismatch is unavoidable, Miles provides Truncated Importance Sampling (TIS) and Masked Importance Sampling (MIS) to mitigate off-policy bias and prevent training divergence.

⚡ Extreme Performance & Efficiency

  • Speculative RL Training: Achieve 25%+ rollout speedup by using an Online SFT Draft Model. Unlike frozen draft models, Miles updates the draft policy during RL to prevent policy drift.
  • Zero-Copy Weight Sync: Optimized weight refit via CUDA IPC zero-copy mapping, async tensor gathering, and bucketed flattening. Sync time reduced by 50% compared to standard HTTP/RPC transfers.
  • Partial Rollout & Over-Sampling: Handles the "Long-Tail Effect" in multi-turn RL by over-sampling requests and recycling half-finished trajectories to maximize GPU utilization.

Model Support & Training Diversity

🏗️ Supported Models

Miles supports a wide range of state-of-the-art architectures, with a special emphasis on DeepSeek, Qwen, Llama and mainstream models.

Family Supported Models
DeepSeek R1, V3, V3.2
Qwen Qwen 2, 2.5, 3
Llama Llama 3, 3.1, 3.3, 4
Gemma Gemma 2, 3, 3N
GLM GLM-4.5, GLM-4.6, GLM-4.7
MiniMax M2, M2.1
Others Mistral, Mixtral, Phi, gpt-oss and any model supported by SGLang and Megatron

🧩 Diverse Training Scenarios

Miles is designed to handle the complexity of modern RL workloads across various dimensions:

  • Multi-Turn Interaction: Optimized for complex, multi-round conversations and tool-use scenarios.
  • VLM & LLM Support: Unified framework for both Vision-Language and pure Text models.
  • Reasoning & Coding: Specific recipes and optimizations for Reasoning (Math/Logic) and Coding Agent tasks.
  • Multi-Agent Training: Support for advanced co-training and collaborative multi-agent reinforcement learning.

Quick Start

Installation

We recommend using our official Docker image for the best performance and compatibility:

# Pull the latest image
docker pull radixark/miles:latest

# Or install from source
pip install -r requirements.txt
pip install -e .

Launch Training

Miles provides a unified entry point for complex RL tasks. Here is an example of FP8 GRPO training for Qwen3:

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

For comprehensive guides on environment setup and custom reward functions, see the Quick Start Guide.


Roadmap

✅ Completed

  • Unified FP8 E2E Training & Rollout
  • INT4 Quantization-Aware Training (QAT): Single-machine 1TB models
  • Speculative RL with Online SFT
  • Multi-Agent RL (Co-evolutionary frameworks like MrlX)
  • Support DeepSeek V3.2 Models
  • VLM Multi-Turn Training
  • Aligning SGLang with Megatron in Dense Models
  • Rollout Routing Replay (R3)

🏗️ In Progress & Planned

  • Zero mismatch for MoE RL
  • Aligning SGLang with Megatron in MoE Models
  • Diffusion RL Support
  • Omni RL Support
  • Diffusion LLM RL Support
  • Elastic Resource Scheduling: Dynamic scaling of rollout vs. training workers

Acknowledgements

Miles is built upon the shoulders of giants in the LLM infrastructure ecosystem:

  • slime: The core modular architecture and inspiration.
  • SGLang: The high-performance inference engine.
  • Megatron-LM: Robust large-scale training components.

Special thanks to InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and the Miles Team. We also thank DataCrunch for compute sponsorship and NVIDIA for technical support on Transformer Engine (TE).


Links

Give Miles a ⭐️ Star if it helps your RL journey!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

miles_rl-0.0.2.tar.gz (6.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

miles_rl-0.0.2-py3-none-any.whl (8.2 MB view details)

Uploaded Python 3

File details

Details for the file miles_rl-0.0.2.tar.gz.

File metadata

  • Download URL: miles_rl-0.0.2.tar.gz
  • Upload date:
  • Size: 6.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for miles_rl-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6cf59e4ae7e5e8c2985c1db55e7e0346a7541caa4f3657957019870ae2d1db27
MD5 8d48a9f488781fd0906d3b567892b694
BLAKE2b-256 18afe2cf632191f58094fbd5fb1b6c729d362f05839ad8edce660cc097aa2ee4

See more details on using hashes here.

Provenance

The following attestation bundles were made for miles_rl-0.0.2.tar.gz:

Publisher: publish-pypi.yml on radixark/miles

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file miles_rl-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: miles_rl-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for miles_rl-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 452c5c0d8b7fc3729f225e63194f16fa7bbe587b6936386c6501793fce7cf820
MD5 d1a9d187b7de99d344db4dbc099d16f2
BLAKE2b-256 4fc2df3eab82841a0a885c9127daa54647dbf67f03626bcf07de00a4a865a66d

See more details on using hashes here.

Provenance

The following attestation bundles were made for miles_rl-0.0.2-py3-none-any.whl:

Publisher: publish-pypi.yml on radixark/miles

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page