Skip to main content

A lightweight post-training framework for LLMs and VLMs

Project description

oxRL

An agent-friendly framework for any post-training

A lightweight post-training framework for LLMs and VLMs. Maximizing developer speed. Scales to billions of parameters with DeepSpeed, vLLM, and Ray.


Usage

Post-train any model in under 10 lines of code. oxRL auto-detects your hardware, auto-prepares datasets, and scales to multi-GPU automatically.

from oxrl import Trainer

# 1. Initialize with any HuggingFace model
trainer = Trainer(model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B")

# 2. Start reasoning post-training (Open-R1 recipe)
trainer.train(task="reasoning")

Supported Models

The following models have been verified and onboarded using our automated pipeline. You can find ready-to-use scripts in the examples/recipes/ directory.

Model Size Task Strategy Status
Qwen3-0.6B 0.6B Instruct Full-tuning ✅ Verified
Qwen3.5-35B-A3B 35.0B (3B active) Reasoning LoRA ✅ Verified
DeepSeek-R1-Distill-Llama-8B 8.0B Reasoning LoRA ✅ Verified
DeepSeek-R1-Distill-Qwen-7B 7.0B Reasoning LoRA ✅ Verified
Qwen2.5-Coder-7B-Instruct 7.6B Coding LoRA ✅ Verified
Qwen2-Audio-7B-Instruct 7.0B Audio LoRA ✅ Verified
Qwen2-VL-7B-Instruct 7.0B Vision LoRA ✅ Verified
Gemma-3-1b-it 1.0B Multimodal Full-tuning ✅ Verified
Mistral-7B-Instruct-v0.3 7.0B Instruct LoRA ✅ Verified
Qwen2.5-7B-Instruct 7.0B Math LoRA ✅ Verified
SmolLM2-1.7B-Instruct 1.7B Instruct Full-tuning ✅ Verified

System Architecture

┌──────────────────────────────────────────────────────────────────┐
│                          oxRL Framework                          │
├────────────────────────────────┬─────────────────────────────────┤
│     RL Path (main_rl.py)       │     SL Path (main_sl.py)        │
│  SGRPO / CISPO / PPO          │  SFT / DPO / ORPO / KTO         │
│  RLHF / RLAIF                 │  CPT / KD / RM / RFT            │
│  Ray actors + vLLM rollouts    │  OnlineDPO / SPIN / IPO / SimPO │
│                                │  DeepSpeed distributed training  │
├────────────────────────────────┴─────────────────────────────────┤
│  oxrl/algs/        Algorithms    │  oxrl/rollouts/   vLLM + Replay│
│  oxrl/configs/     Pydantic cfg  │  oxrl/rewards/    Verifiable   │
│  oxrl/datasets/    HF loaders    │  oxrl/utils/      Setup + Logs │
└──────────────────────────────────────────────────────────────────┘

RL Training Workflow

  1. Scout Agent: Discovers model metadata and ensures chat_template compatibility.
  2. Multimodal Pipeline: Converts base64 images/audio into PIL/NumPy for vLLM rollouts.
  3. LoRA Lifecycle: Train with adapters, save with gathered ZeRO-3 weights, and auto-strip PEFT prefixes for immediate vLLM compatibility.
  4. Verifiable Rewards: Programmatic verification of CoT tags and mathematical correctness.

Getting Started

Installation

# From source (recommended for development)
git clone https://github.com/warlockee/oxRL.git
cd oxRL
pip install -e .

# Or from PyPI
pip install oxrl

Run Tests

pip install pytest
pytest tests/test_bugs.py -v

Environment Diagnostics

Before starting a long training run, verify your environment (GPUs, CUDA Toolkit, DeepSpeed, Ray) with our diagnostic tool:

oxrl doctor

Configuration

oxRL uses YAML config files. See oxrl/configs/rl_args.yaml (RL) and oxrl/configs/sl_args.yaml (SL) for all available options with documentation. Example configs are in examples/.

Key environment variables:

  • OXRL_DATA_DIR — Override default data directory (default: ./data)
  • OXRL_CHECKPOINT_DIR — Override default checkpoint directory (default: ./checkpoints)
  • HF_TOKEN — HuggingFace token for gated models
  • GITHUB_TOKEN — For autonomous bug reporting (optional)

Post-train a Reasoning Model

# config.yaml
model:
  name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
lora:
  enabled: true
reward:
  reward_func: "reasoning_reward_func"
data:
  dataset: "openr1_math"
python main_rl.py --config-file config.yaml

Algorithms

Reinforcement Learning (via Ray + vLLM rollouts)

Algorithm File Description
SGRPO oxrl/algs/grpo.py Stable GRPO — Clipped surrogate loss with LoRA support and reference-free variants.
CISPO oxrl/algs/grpo.py Clipped importance-sampling policy optimization.
PPO oxrl/algs/ppo.py Proximal Policy Optimization with GAE, value clipping, and shared-backbone critic.
RLHF oxrl/algs/grpo.py Reinforcement Learning from Human Feedback — GRPO with a trained reward model.
RLAIF oxrl/algs/grpo.py Reinforcement Learning from AI Feedback — GRPO with AI-generated rewards.

Supervised Learning (via DeepSpeed)

Algorithm File Description
SFT oxrl/algs/sft.py Supervised Fine-Tuning — Cross-entropy loss with masking and normalization.
DPO oxrl/algs/dpo.py Direct Preference Optimization — Pairwise preference learning with a reference model.
ORPO oxrl/algs/orpo.py Odds Ratio Preference Optimization — Reference-free preference alignment via log-odds.
KTO oxrl/algs/kto.py Kahneman-Tversky Optimization — Prospect-theory-inspired alignment with moving-average KL baseline.
CPT oxrl/algs/cpt.py Continued Pre-Training — Full-sequence language modeling on domain-specific text.
KD oxrl/algs/kd.py Knowledge Distillation — Teacher-student training with combined CE and KL divergence loss.
RM oxrl/algs/rm.py Reward Model Training — Bradley-Terry pairwise ranking with a learned scalar head.
OnlineDPO oxrl/algs/online_dpo.py Online DPO — DPO with on-the-fly rejection sampling in the data pipeline.
RFT oxrl/algs/rft.py Rejection Sampling Fine-Tuning — SFT on reward-filtered responses above a threshold.
SPIN oxrl/algs/spin.py Self-Play Improvement — DPO where rejected samples are the model's own prior outputs.
IPO oxrl/algs/ipo.py Identity Preference Optimization — Squared-loss variant of DPO for improved stability.
SimPO oxrl/algs/simpo.py Simple Preference Optimization — Reference-free, length-normalized preference alignment.

Project Structure

oxRL/
├── oxrl/                   # Core Framework Package
│   ├── trainer.py          # High-level Trainer API
│   ├── rewards/            # Verifiable reasoning and coding rewards (math, code, etc.)
│   ├── algs/               # 17 algorithm implementations (see tables above)
│   ├── swarm/              # Autonomous model onboarding (Scout, Bugfixer)
│   ├── preprocessing/      # Reasoning (OpenR1), Multimodal (Vision/Audio) preprocessors
│   ├── rollouts/           # vLLM inference with structured prompt support
│   └── datasets/           # Dataset loaders and samplers
├── main_rl.py              # RL training loop (Ray + DeepSpeed)
├── main_sl.py              # SL training loop (DeepSpeed) — 12 algorithms
├── examples/               # Ready-to-use recipes and training scripts
└── setup.py                # Packaging and Installation

design-principles

Debuggability over Pipelining. oxRL avoids complex async pipelining to ensure that failure states are 100% reproducible and logs are clear.

Robust Environment Handling. oxRL is designed to work even in constrained environments. It automatically handles common CUDA/DeepSpeed mismatches by providing actionable warnings instead of fatal crashes.

Autonomous Bug Reporting. On framework failure, oxRL provides structured diagnostic signals for AI agents to automatically generate and submit GitHub issues (requires GITHUB_TOKEN environment variable).

LoRA-first for 7B+. We default to LoRA for larger models to enable high-quality research on consumer-grade and restricted high-end hardware.

Verification-driven RL. We prioritize datasets where the reward is verifiable (Math, Code, Format) to drive logical discovery.

LLM Developer Map

This repository is optimized for LLM-assisted development (Claude/Gemini). If you are asking an AI to work on this framework, refer them to these "High-Signal" files:

  • Bug Reporting: See BUG_REPORTING.md for instructions on autonomous issue submission.
  • Adding a New Algorithm: See oxrl/algs/base.py (Base Class) and oxrl/algs/grpo.py (Implementation).
  • Adding a Reward Function: Add to oxrl/rewards/ using the signature in oxrl/rewards/base.py.
  • Changing Model Loading: See oxrl/utils/setup.py -> load_model_and_ref.
  • Training Logic: The main loop resides in main_rl.py.
  • Config Validation: Logic is in oxrl/configs/load.py.

Contributing

Contributions are welcome. Please follow the existing architectural patterns and style.

FAQ

Check out the FAQ for details on LoRA merging and Multimodal input formatting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxrl-1.5.0.tar.gz (107.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oxrl-1.5.0-py3-none-any.whl (124.8 kB view details)

Uploaded Python 3

File details

Details for the file oxrl-1.5.0.tar.gz.

File metadata

  • Download URL: oxrl-1.5.0.tar.gz
  • Upload date:
  • Size: 107.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for oxrl-1.5.0.tar.gz
Algorithm Hash digest
SHA256 f3bcbd02c2e309aa2192f73bbb7bcef8cfbdda78f024ec2f041a1d600dc26798
MD5 0fc57303b1ac2e822565a923a4889d71
BLAKE2b-256 42c93b7d1de95ff2ff49bf12c2bdc0447bbd5fc3738f47bb4176d44d9d0cb5cc

See more details on using hashes here.

File details

Details for the file oxrl-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: oxrl-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 124.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for oxrl-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5950d9f78a40c3503820dd3beb785c365521e48526323c14c4041e4e15c4099
MD5 9ded17769a280c7174aa403b27251df1
BLAKE2b-256 e83164b7ae64c9d935bf270ca419e5c40620a4fdf64eea47f5b3c5ac7127c13b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page