A lightweight post-training framework for LLMs
Project description
oxRL
Post-train any model under 10 lines of code.
A lightweight post-training framework for LLMs, VLMs, and VLAs. Maximizing developer speed. Scales to billions of parameters with DeepSpeed, vLLM, and Ray.
๐ New in v1.1: Reasoning & Multimodal RL
We've significantly expanded oxRL's capabilities to support the latest trending architectures and training recipes:
- Verifiable Reasoning (Open-R1): Native support for reasoning models with
<thought>and<answer>tag enforcement and rule-based correctness rewards. - Simple Preference Optimization (SimPO): State-of-the-art reference-free alignment that reduces VRAM by 40% and improves logical reasoning.
- Multimodal RL: Support for Vision-Language (VLM) and Audio-Language models. Seamless base64-to-tensor pipeline for on-policy rollouts.
- GPQA & ScienceQA: Integrated high-difficulty reasoning and multimodal datasets.
- Memory-Efficient LoRA: Built-in PEFT integration allows post-training 14B+ models on restricted hardware.
Usage (Python API)
Post-train any model in under 10 lines of code. oxRL auto-detects your hardware, auto-prepares datasets, and scales to multi-GPU automatically.
from oxrl import Trainer
# 1. Initialize with any HuggingFace model
trainer = Trainer(model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
# 2. Start reasoning post-training (Open-R1 recipe)
trainer.train(task="reasoning")
Supported Models
The following models have been verified and onboarded using our automated pipeline. You can find ready-to-use scripts in the examples/onboarded_models/ directory.
| Model | Size | Task | Strategy | Status |
|---|---|---|---|---|
| DeepSeek-R1-Distill-Llama-8B | 8.0B | Reasoning | LoRA | โ Verified |
| DeepSeek-R1-Distill-Qwen-7B | 7.0B | Reasoning | LoRA | โ Verified |
| Qwen2.5-Coder-7B-Instruct | 7.6B | Coding | LoRA | โ Verified |
| Qwen2-Audio-7B-Instruct | 7.0B | Audio | LoRA | โ Verified |
| Qwen2-VL-7B-Instruct | 7.0B | Vision | LoRA | โ Verified |
| Gemma-3-1b-it | 1.0B | Multimodal | Full-tuning | โ Verified |
| Mistral-7B-Instruct-v0.3 | 7.0B | Instruct | LoRA | โ Verified |
| Qwen2.5-7B-Instruct | 7.0B | Math | LoRA | โ Verified |
| SmolLM2-1.7B-Instruct | 1.7B | Instruct | Full-tuning | โ Verified |
System Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ oxRL Framework โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโค
โ Training Engines โ Rollout Engines โ Config + Data โ
โ (Ray + DeepSpeed) โ (Ray + vLLM) โ (Pydantic + HF) โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ โ โ
โ algs/grpo.py โ rollouts/ โ configs/load.py โ
โ SGRPO loss โ vllm_engine.py โ configs/*.yaml โ
โ LoRA / PEFT โ replay_buffer.pyโ โ
โ algs/PPO/ppo.py โ โ datasets/ โ
โ algs/SFT/sft.py โ โ prompt_only.py โ
โ โ โ (Multimodal Ready) โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโค
โ swarm/ โ utils/logging.py โ rewards/compute_score โ
โ orchestrator.py โ utils/setup.py โ (Reasoning / Code) โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
RL Training Workflow
- Scout Agent: Discovers model metadata and ensures
chat_templatecompatibility. - Multimodal Pipeline: Converts base64 images/audio into PIL/NumPy for vLLM rollouts.
- LoRA Lifecycle: Train with adapters, save with gathered ZeRO-3 weights, and auto-strip PEFT prefixes for immediate vLLM compatibility.
- Verifiable Rewards: Programmatic verification of CoT tags and mathematical correctness.
Quick Start
Installation
pip install oxrl
Post-train a Reasoning Model
# config.yaml
model:
name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
lora:
enabled: true
reward:
reward_func: "reasoning_reward_func"
data:
dataset: "openr1_math"
python main_rl.py --config-file config.yaml
Algorithms
| Algorithm | File | Description |
|---|---|---|
| SGRPO | algs/grpo.py |
Stable GRPO โ Clipped surrogate loss with LoRA support and reference-free variants. |
| SimPO | algs/simpo.py |
Simple Preference Optimization โ Reference-free and length-normalized alignment. |
| CISPO | algs/grpo.py |
Clipped importance-sampling policy optimization. |
| PPO | algs/PPO/ppo.py |
Proximal Policy Optimization with GAE and value clipping. |
Project Structure
oxRL/
โโโ main_rl.py RL training loop (Ray + DeepSpeed)
โโโ swarm/ Autonomous model onboarding (Scout, Bugfixer)
โโโ preprocessing/ Reasoning (OpenR1), Multimodal (Vision/Audio) preprocessors
โโโ rollouts/ vLLM inference with structured prompt support
โโโ rewards/ Verifiable reasoning and coding rewards
design-principles
Debuggability over Pipelining. oxRL avoids complex async pipelining to ensure that failure states are 100% reproducible and logs are clear.
LoRA-first for 7B+. We default to LoRA for larger models to enable high-quality research on consumer-grade and restricted high-end hardware.
Verification-driven RL. We prioritize datasets where the reward is verifiable (Math, Code, Format) to drive logical discovery.
Contributing
Contributions are welcome. Please follow the existing architectural patterns and style.
FAQ
Check out the FAQ for details on LoRA merging and Multimodal input formatting.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oxrl-1.2.0.tar.gz.
File metadata
- Download URL: oxrl-1.2.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc8281792202e78522edd33e91577b08711bec395919c83a5de4118089d567c1
|
|
| MD5 |
e904c99467e063acefb397bd519d3f9e
|
|
| BLAKE2b-256 |
004db4b6ef29ef4826f8513092d6e5b0abe3c13eda8aa41ce55d3b271fc0840f
|
File details
Details for the file oxrl-1.2.0-py3-none-any.whl.
File metadata
- Download URL: oxrl-1.2.0-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a1fc754bc1be257f8d62a3903e38ca65a9ef94184be64411c403e62419418e3
|
|
| MD5 |
7e6ccf697440a88321daebfc47530cf1
|
|
| BLAKE2b-256 |
7aca5f148ffbfaf0c422adafa9afb43c8aebe790667102effe61f0f9389ee0c5
|