Skip to main content

Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework

Project description

LightRFT

LightRFT Logo

Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework

Version Python PyTorch License

English | 简体中文


📖 Introduction

LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning fine-tuning framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). This framework provides efficient and scalable RLHF (Reinforcement Learning from Human Feedback) and RLVR training capabilities, supporting multiple state-of-the-art algorithms and distributed training strategies.

✨ Key Features

  • 🚀 High-Performance Inference Engines

    • Integrated vLLM and SGLang for efficient sampling and inference
    • FP8 inference optimization for significantly reduced latency and memory usage
    • Flexible engine sleep/wake mechanisms for optimal resource utilization
  • 🧠 Rich Algorithm Ecosystem

    • Policy Optimization: GRPO, GSPO, GMPO, Dr.GRPO
    • Advantage Estimation: REINFORCE++, CPGD
    • Reward Processing: Reward Norm/Clip
    • Sampling Strategy: FIRE Sampling, Token-Level Policy
    • Stability Enhancement: DAPO, select_high_entropy_tokens
  • 🔧 Flexible Training Strategies

    • FSDP (Fully Sharded Data Parallel) v2 support
    • DeepSpeed ZeRO (Stage 1/2/3) support
    • Gradient checkpointing and mixed precision training (BF16/FP16)
    • Adam Offload and memory optimization techniques
  • 🎯 Innovative Resource Collaboration

    • Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
      • Support multiple reward models for parallel inference on the same device
      • Dynamic memory management with automatic training/inference phase switching
      • Reduced cross-device communication overhead for improved end-to-end training efficiency
    • Balance Anything 🚧 (Under Development): Intelligent load balancing system
      • Adaptive task scheduling and resource allocation
      • Automatic load balancing for multi-node training
      • Performance optimization for heterogeneous hardware environments
  • 🌐 Comprehensive Multimodal Support

    • Native Vision-Language Model (VLM) Training
      • Support for mainstream VLMs like Qwen-VL
      • Parallel processing of multimodal image-text data
      • Efficient multimodal tokenization and batching
    • Multimodal Reward Modeling
      • Support for multiple visual reward models working in collaboration
      • Joint optimization of image understanding and text generation
    • Complete Vision-Language Alignment Training Pipeline
      • Optimized for multimodal RLVR/RLHF training
      • Built-in support for vision-language model fine-tuning
  • 📊 Complete Experimental Toolkit

    • Weights & Biases (W&B) integration
    • Math capability benchmarking (GSM8K, Geo3K, etc.)
    • Trajectory saving and analysis tools
    • Automatic checkpoint management

🎯 Supported Algorithms

For detailed algorithm descriptions, implementation details, and usage guide, see Algorithm Documentation.

Algorithm Type Key Improvement Paper
GRPO Policy Optimization Group normalized advantage estimation arXiv:2402.03300
GSPO Policy Optimization Generalized surrogate objectives arXiv:2507.18071
GMPO (WIP) Policy Optimization Generalized mirror policy optimization arXiv:2507.20673
Dr.GRPO Policy Optimization Length bias mitigation arXiv:2503.20783
DAPO Policy Optimization Decoupled clip and dynamic sampling policy optimization arXiv:2503.14476
REINFORCE++ Advantage Estimation Improved baseline estimation arXiv:2501.03262
CPGD Advantage Estimation KL-based drift constraint arXiv:2505.12504
FIRE Sampling Sampling Strategy Filtering and ranking strategies arXiv:2410.21236

🚀 Quick Start

Requirements

  • Python >= 3.10
  • CUDA >= 12.8
  • PyTorch >= 2.5.1

Docker Images

TO BE DONE

Installation

Clone and install LightRFT:

# Clone the repository
git clone https://github.com/opendilab/LightRFT.git
cd LightRFT

# Install dependencies
pip install -r requirements.txt

# Install LightRFT
pip install -e .

📚 Usage Guide

Basic Example: GRPO Training

# Single node, 8 GPU training example
cd LightRFT

# Run GRPO training (GSM8K math reasoning task)
bash examples/gsm8k_geo3k/run_grpo_gsm8k_qwen2.5_0.5b.sh

# Or run Geo3K geometry problem training (VLM multimodal)
bash examples/gsm8k_geo3k/run_grpo_geo3k_qwen2.5_vl_7b.sh

🏗️ Project Structure

LightRFT/
├── lightrft/                      # Core library
│   ├── strategy/                  # Training & inference strategies
│   │   ├── fsdp/                  # FSDP implementation
│   │   ├── deepspeed/             # DeepSpeed implementation
│   │   ├── vllm_utils/            # vLLM utilities
│   │   └── sglang_utils/          # SGLang utilities
│   ├── models/                    # Model definitions
│   │   ├── actor_language.py      # Language model actor
│   │   ├── actor_vl.py            # Vision-language model actor
│   │   └── monkey_patch/          # Model adaptation patches
│   ├── trainer/                   # Trainer implementations
│   │   ├── ppo_trainer.py         # PPO trainer
│   │   ├── ppo_trainer_vl.py      # VLM PPO trainer
│   │   ├── fast_exp_maker.py      # Experience generator
│   │   ├── experience_maker.py    # Base experience generator
│   │   ├── experience_maker_vl.py # VLM experience generator
│   │   └── spmd_ppo_trainer.py    # SPMD PPO trainer
│   ├── datasets/                  # Dataset processing
│   └── utils/                     # Utility functions
│       └── ckpt_scripts/          # Checkpoint processing scripts
│
├── examples/                      # Usage examples
│   ├── gsm8k_geo3k/               # GSM8K/Geo3K math reasoning training examples
│   ├── grm_training/              # Generative reward model training examples
│   ├── srm_training/              # Scalar reward model training examples
│   └── chat/                      # Model dialogue examples
│
├── docs/                          # 📚 Sphinx documentation
│   └── source/
│       ├── installation/          # Installation guides
│       ├── quick_start/           # Quick start & user guides
│       │   ├── algorithms.md      # Algorithm documentation (English)
│       │   ├── algorithms_cn.md   # Algorithm documentation (Chinese)
│       │   └── configuration.md   # Configuration reference
│       └── best_practice/         # Best practices & resources
│           ├── strategy_usage.rst   # Training strategy usage (English)
│           ├── strategy_usage_zh.md # Training strategy usage (Chinese)
│           ├── faq.md              # Frequently asked questions
│           ├── troubleshooting.md  # Troubleshooting guide
│           └── contributing.md     # Contribution guidelines
│
├── assets/                        # Assets
│   └── logo.png                   # Project logo
│
├── results/                       # Training results
├── rft_logs/                      # Training logs
└── README.md                      # Project documentation

🔑 Key Directory Descriptions

  • lightrft/: LightRFT core library, providing training strategies, model definitions, and trainer implementations
  • examples/: Complete training examples and scripts
    • gsm8k_geo3k/: GSM8K and Geo3K math reasoning training examples
    • grm_training/: Generative reward model training examples
    • srm_training/: Scalar reward model training examples
    • chat/: Model dialogue examples
  • docs/: Sphinx documentation with complete user guides and API documentation

⚙️ Key Configuration Parameters

Batch Size Configuration

TBS=128                           # Training batch size
RBS=128                            # Rollout batch size
micro_train_batch_size=1          # Micro batch size per GPU
micro_rollout_batch_size=2        # Rollout micro batch size

Algorithm Parameters

--advantage_estimator group_norm  # Advantage estimator: group_norm, reinforce, cpgd
--n_samples_per_prompt 8          # Number of samples per prompt
--max_epochs 1                    # Training epochs per episode
--num_episodes 3                  # Total training episodes
--kl_estimator k3                 # KL estimator type
--init_kl_coef 0.001              # KL penalty coefficient

Distributed Training

--fsdp                            # Enable FSDP
--zero_stage 3                    # DeepSpeed ZeRO Stage
--gradient_checkpointing          # Gradient checkpointing
--adam_offload                    # Adam optimizer offload
--bf16                            # BF16 mixed precision

Inference Engine

--rm_use_engine                   # Use inference engine (vLLM/SGLang)
--engine_mem_util 0.4             # Engine memory utilization
--engine_tp_size 1                # Engine tensor parallelism degree
--enable_engine_sleep             # Enable engine sleep mechanism

🔧 Troubleshooting

See training scripts for detailed parameter validation logic.

1. OOM (Out of Memory)

Solutions:

  • Reduce micro_train_batch_size and micro_rollout_batch_size
  • Enable --gradient_checkpointing
  • Lower --engine_mem_util
  • Use ZeRO Stage 3

2. Training Instability

Solutions:

  • Enable Reward Normalization: --normalize_reward
  • Lower learning rate
  • Use --advantage_estimator group_norm
  • Try DAPO algorithm


📖 Documentation

📚 Complete Documentation Guide

Quick Start:

Best Practices:

Build Documentation Locally

Install documentation dependencies:

pip install -r requirements-doc.txt

Generate HTML documentation:

make docs
# Open docs/build/index.html to view documentation

Live documentation preview:

make docs-live
# Visit http://localhost:8000

🤝 Contributing

We welcome community contributions! Please follow these steps:

  1. Fork this repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Code Standards

# Install development dependencies
pip install -r requirements-dev.txt

# Code formatting (YAPF)
yapf -i -r lightrft/

# Code linting (Pylint)
pylint lightrft/

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


🙏 Acknowledgments

LightRFT is developed based on OpenRLHF. We extend our sincere gratitude to the OpenRLHF team for their excellent work. Some files and implementations in this project are adapted and reused from OpenRLHF.

Collaboration

This project is developed in collaboration with colleagues from the System Platform Center and Safe and Trustworthy AI Center at Shanghai AI Laboratory. We sincerely thank them for their contributions and support.

Open Source Dependencies

This project builds upon the following outstanding open-source projects (including but not limited):

  • OpenRLHF, verl - Core RL framework foundation (parts of key components adapted and reused)
  • vLLM - High-performance inference engine
  • SGLang - Structured generation language runtime
  • DeepSpeed - Distributed training optimization
  • PyTorch FSDP - Fully Sharded Data Parallel

Thanks to all contributors and supporters!


🗓️ RoadMap

We are actively working on the following improvements and features:

Core Feature Enhancements

  • Trajectory Functionality Extension

    • Add more analysis metrics
    • Enhanced trajectory saving and analysis capabilities
  • Reward Mechanism Refactoring

    • Refactor rule-based and model-based reward computation
    • Optimize reward dataset processing pipeline

Algorithm Optimization & Integration

  • More Algorithm Integration

    • Entropy-based token selection
    • GMPO (Generalized Mirror Policy Optimization)
    • GSPO (Generalized Surrogate Policy Optimization)
  • Advantage Computation Refactoring

    • Optimize advantage estimation module architecture
    • Unify advantage computation interface across algorithms
  • Loss-Filter Mechanism Optimization

    • Refactor loss filtering implementation
    • Complete GSM8K/Geo3K benchmark experiments
    • Document experimental results and analysis

Community contributions and feedback are welcome!


📮 Contact

For questions or suggestions, please contact us via:


⭐ If this project helps you, please give us a star!

Made with ❤️ by LightRFT Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightrft-0.1.0.tar.gz (244.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightrft-0.1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file lightrft-0.1.0.tar.gz.

File metadata

  • Download URL: lightrft-0.1.0.tar.gz
  • Upload date:
  • Size: 244.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for lightrft-0.1.0.tar.gz
Algorithm Hash digest
SHA256 518183cc2cc23893df31806202f878fb1cecb452c7e8cb9b84a0be0ec58b2777
MD5 2a56e2a94acdcbcd4240c281903e8c62
BLAKE2b-256 74331ff0ce5dbf40784bbb6d99333c378e7202b3a23d6ab4a5c35f2194476017

See more details on using hashes here.

File details

Details for the file lightrft-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lightrft-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for lightrft-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e33601a5f8fca011f64be0ccf78e88235f337e636478e7472110312df2cf969
MD5 81e7f9fe2ac5943038579e0631a37ede
BLAKE2b-256 a6e90497a571b89124b28f2b0c2705bc2b01d8384a994d0a1a5b227bfb0e644a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page