A lightweight post-training framework for LLMs and VLMs
Project description
An agent-friendly framework for any post-training
A lightweight post-training framework for LLMs and VLMs. Maximizing developer speed. Scales to billions of parameters with DeepSpeed, vLLM, and Ray.
Usage
Post-train any model in under 10 lines of code. oxRL auto-detects your hardware, auto-prepares datasets, and scales to multi-GPU automatically.
from oxrl import Trainer
# 1. Initialize with any HuggingFace model
trainer = Trainer(model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
# 2. Start reasoning post-training (Open-R1 recipe)
trainer.train(task="reasoning")
Supported Models
The following models have been verified and onboarded using our automated pipeline. You can find ready-to-use scripts in the examples/recipes/ directory.
| Model | Size | Task | Strategy | Status |
|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | Instruct | Full-tuning | ✅ Verified |
| Qwen3.5-35B-A3B | 35.0B (3B active) | Reasoning | LoRA | ✅ Verified |
| DeepSeek-R1-Distill-Llama-8B | 8.0B | Reasoning | LoRA | ✅ Verified |
| DeepSeek-R1-Distill-Qwen-7B | 7.0B | Reasoning | LoRA | ✅ Verified |
| Qwen2.5-Coder-7B-Instruct | 7.6B | Coding | LoRA | ✅ Verified |
| Qwen2-Audio-7B-Instruct | 7.0B | Audio | LoRA | ✅ Verified |
| Qwen2-VL-7B-Instruct | 7.0B | Vision | LoRA | ✅ Verified |
| Gemma-3-1b-it | 1.0B | Multimodal | Full-tuning | ✅ Verified |
| Mistral-7B-Instruct-v0.3 | 7.0B | Instruct | LoRA | ✅ Verified |
| Qwen2.5-7B-Instruct | 7.0B | Math | LoRA | ✅ Verified |
| SmolLM2-1.7B-Instruct | 1.7B | Instruct | Full-tuning | ✅ Verified |
System Architecture
┌──────────────────────────────────────────────────────────────────┐
│ oxRL Framework │
├────────────────────────────────┬─────────────────────────────────┤
│ RL Path (main_rl.py) │ SL Path (main_sl.py) │
│ SGRPO / CISPO / PPO │ SFT / DPO / ORPO / KTO │
│ RLHF / RLAIF │ CPT / KD / RM / RFT │
│ Ray actors + vLLM rollouts │ OnlineDPO / SPIN / IPO / SimPO │
│ │ DeepSpeed distributed training │
├────────────────────────────────┴─────────────────────────────────┤
│ oxrl/algs/ Algorithms │ oxrl/rollouts/ vLLM + Replay│
│ oxrl/configs/ Pydantic cfg │ oxrl/rewards/ Verifiable │
│ oxrl/datasets/ HF loaders │ oxrl/utils/ Setup + Logs │
└──────────────────────────────────────────────────────────────────┘
RL Training Workflow
- Scout Agent: Discovers model metadata and ensures
chat_templatecompatibility. - Multimodal Pipeline: Converts base64 images/audio into PIL/NumPy for vLLM rollouts.
- LoRA Lifecycle: Train with adapters, save with gathered ZeRO-3 weights, and auto-strip PEFT prefixes for immediate vLLM compatibility.
- Verifiable Rewards: Programmatic verification of CoT tags and mathematical correctness.
Getting Started
Installation
# From source (recommended for development)
git clone https://github.com/warlockee/oxRL.git
cd oxRL
pip install -e .
# Or from PyPI
pip install oxrl
Run Tests
pip install pytest
pytest tests/test_bugs.py -v
Environment Diagnostics
Before starting a long training run, verify your environment (GPUs, CUDA Toolkit, DeepSpeed, Ray) with our diagnostic tool:
oxrl doctor
Configuration
oxRL uses YAML config files. See oxrl/configs/rl_args.yaml (RL) and oxrl/configs/sl_args.yaml (SL) for all available options with documentation. Example configs are in examples/.
Key environment variables:
OXRL_DATA_DIR— Override default data directory (default:./data)OXRL_CHECKPOINT_DIR— Override default checkpoint directory (default:./checkpoints)HF_TOKEN— HuggingFace token for gated modelsGITHUB_TOKEN— For autonomous bug reporting (optional)
Post-train a Reasoning Model
# config.yaml
model:
name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
lora:
enabled: true
reward:
reward_func: "reasoning_reward_func"
data:
dataset: "openr1_math"
python main_rl.py --config-file config.yaml
Algorithms
Reinforcement Learning (via Ray + vLLM rollouts)
| Algorithm | File | Description |
|---|---|---|
| SGRPO | oxrl/algs/grpo.py |
Stable GRPO — Clipped surrogate loss with LoRA support and reference-free variants. |
| CISPO | oxrl/algs/grpo.py |
Clipped importance-sampling policy optimization. |
| PPO | oxrl/algs/ppo.py |
Proximal Policy Optimization with GAE, value clipping, and shared-backbone critic. |
| RLHF | oxrl/algs/grpo.py |
Reinforcement Learning from Human Feedback — GRPO with a trained reward model. |
| RLAIF | oxrl/algs/grpo.py |
Reinforcement Learning from AI Feedback — GRPO with AI-generated rewards. |
Supervised Learning (via DeepSpeed)
| Algorithm | File | Description |
|---|---|---|
| SFT | oxrl/algs/sft.py |
Supervised Fine-Tuning — Cross-entropy loss with masking and normalization. |
| DPO | oxrl/algs/dpo.py |
Direct Preference Optimization — Pairwise preference learning with a reference model. |
| ORPO | oxrl/algs/orpo.py |
Odds Ratio Preference Optimization — Reference-free preference alignment via log-odds. |
| KTO | oxrl/algs/kto.py |
Kahneman-Tversky Optimization — Prospect-theory-inspired alignment with moving-average KL baseline. |
| CPT | oxrl/algs/cpt.py |
Continued Pre-Training — Full-sequence language modeling on domain-specific text. |
| KD | oxrl/algs/kd.py |
Knowledge Distillation — Teacher-student training with combined CE and KL divergence loss. |
| RM | oxrl/algs/rm.py |
Reward Model Training — Bradley-Terry pairwise ranking with a learned scalar head. |
| OnlineDPO | oxrl/algs/online_dpo.py |
Online DPO — DPO with on-the-fly rejection sampling in the data pipeline. |
| RFT | oxrl/algs/rft.py |
Rejection Sampling Fine-Tuning — SFT on reward-filtered responses above a threshold. |
| SPIN | oxrl/algs/spin.py |
Self-Play Improvement — DPO where rejected samples are the model's own prior outputs. |
| IPO | oxrl/algs/ipo.py |
Identity Preference Optimization — Squared-loss variant of DPO for improved stability. |
| SimPO | oxrl/algs/simpo.py |
Simple Preference Optimization — Reference-free, length-normalized preference alignment. |
Project Structure
oxRL/
├── oxrl/ # Core Framework Package
│ ├── trainer.py # High-level Trainer API
│ ├── rewards/ # Verifiable reasoning and coding rewards (math, code, etc.)
│ ├── algs/ # 17 algorithm implementations (see tables above)
│ ├── swarm/ # Autonomous model onboarding (Scout, Bugfixer)
│ ├── preprocessing/ # Reasoning (OpenR1), Multimodal (Vision/Audio) preprocessors
│ ├── rollouts/ # vLLM inference with structured prompt support
│ └── datasets/ # Dataset loaders and samplers
├── main_rl.py # RL training loop (Ray + DeepSpeed)
├── main_sl.py # SL training loop (DeepSpeed) — 12 algorithms
├── examples/ # Ready-to-use recipes and training scripts
└── setup.py # Packaging and Installation
design-principles
Debuggability over Pipelining. oxRL avoids complex async pipelining to ensure that failure states are 100% reproducible and logs are clear.
Robust Environment Handling. oxRL is designed to work even in constrained environments. It automatically handles common CUDA/DeepSpeed mismatches by providing actionable warnings instead of fatal crashes.
Autonomous Bug Reporting. On framework failure, oxRL provides structured diagnostic signals for AI agents to automatically generate and submit GitHub issues (requires GITHUB_TOKEN environment variable).
LoRA-first for 7B+. We default to LoRA for larger models to enable high-quality research on consumer-grade and restricted high-end hardware.
Verification-driven RL. We prioritize datasets where the reward is verifiable (Math, Code, Format) to drive logical discovery.
LLM Developer Map
This repository is optimized for LLM-assisted development (Claude/Gemini). If you are asking an AI to work on this framework, refer them to these "High-Signal" files:
- Bug Reporting: See
BUG_REPORTING.mdfor instructions on autonomous issue submission. - Adding a New Algorithm: See
oxrl/algs/base.py(Base Class) andoxrl/algs/grpo.py(Implementation). - Adding a Reward Function: Add to
oxrl/rewards/using the signature inoxrl/rewards/base.py. - Changing Model Loading: See
oxrl/utils/setup.py->load_model_and_ref. - Training Logic: The main loop resides in
main_rl.py. - Config Validation: Logic is in
oxrl/configs/load.py.
Contributing
Contributions are welcome. Please follow the existing architectural patterns and style.
FAQ
Check out the FAQ for details on LoRA merging and Multimodal input formatting.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oxrl-1.5.0.tar.gz.
File metadata
- Download URL: oxrl-1.5.0.tar.gz
- Upload date:
- Size: 107.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3bcbd02c2e309aa2192f73bbb7bcef8cfbdda78f024ec2f041a1d600dc26798
|
|
| MD5 |
0fc57303b1ac2e822565a923a4889d71
|
|
| BLAKE2b-256 |
42c93b7d1de95ff2ff49bf12c2bdc0447bbd5fc3738f47bb4176d44d9d0cb5cc
|
File details
Details for the file oxrl-1.5.0-py3-none-any.whl.
File metadata
- Download URL: oxrl-1.5.0-py3-none-any.whl
- Upload date:
- Size: 124.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5950d9f78a40c3503820dd3beb785c365521e48526323c14c4041e4e15c4099
|
|
| MD5 |
9ded17769a280c7174aa403b27251df1
|
|
| BLAKE2b-256 |
e83164b7ae64c9d935bf270ca419e5c40620a4fdf64eea47f5b3c5ac7127c13b
|