Skip to main content

A lightweight post-training framework for LLMs

Project description

oxRL

oxRL

Post-train any model under 10 lines of code.

A lightweight post-training framework for LLMs, VLMs, and VLAs. Maximizing developer speed. Scales to billions of parameters with DeepSpeed, vLLM, and Ray.


๐Ÿš€ New in v1.1: Reasoning & Multimodal RL

We've significantly expanded oxRL's capabilities to support the latest trending architectures and training recipes:

  • Verifiable Reasoning (Open-R1): Native support for reasoning models with <thought> and <answer> tag enforcement and rule-based correctness rewards.
  • Simple Preference Optimization (SimPO): State-of-the-art reference-free alignment that reduces VRAM by 40% and improves logical reasoning.
  • Multimodal RL: Support for Vision-Language (VLM) and Audio-Language models. Seamless base64-to-tensor pipeline for on-policy rollouts.
  • GPQA & ScienceQA: Integrated high-difficulty reasoning and multimodal datasets.
  • Memory-Efficient LoRA: Built-in PEFT integration allows post-training 14B+ models on restricted hardware.

Usage (Python API)

Post-train any model in under 10 lines of code. oxRL auto-detects your hardware, auto-prepares datasets, and scales to multi-GPU automatically.

from oxrl import Trainer

# 1. Initialize with any HuggingFace model
trainer = Trainer(model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B")

# 2. Start reasoning post-training (Open-R1 recipe)
trainer.train(task="reasoning")

Supported Models

The following models have been verified and onboarded using our automated pipeline. You can find ready-to-use scripts in the examples/onboarded_models/ directory.

Model Size Task Strategy Status
DeepSeek-R1-Distill-Llama-8B 8.0B Reasoning LoRA โœ… Verified
DeepSeek-R1-Distill-Qwen-7B 7.0B Reasoning LoRA โœ… Verified
Qwen2.5-Coder-7B-Instruct 7.6B Coding LoRA โœ… Verified
Qwen2-Audio-7B-Instruct 7.0B Audio LoRA โœ… Verified
Qwen2-VL-7B-Instruct 7.0B Vision LoRA โœ… Verified
Gemma-3-1b-it 1.0B Multimodal Full-tuning โœ… Verified
Mistral-7B-Instruct-v0.3 7.0B Instruct LoRA โœ… Verified
Qwen2.5-7B-Instruct 7.0B Math LoRA โœ… Verified
SmolLM2-1.7B-Instruct 1.7B Instruct Full-tuning โœ… Verified

System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         oxRL Framework                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Training Engines  โ”‚  Rollout Engines  โ”‚    Config + Data      โ”‚
โ”‚   (Ray + DeepSpeed) โ”‚  (Ray + vLLM)     โ”‚    (Pydantic + HF)    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                     โ”‚                   โ”‚                       โ”‚
โ”‚  algs/grpo.py       โ”‚ rollouts/         โ”‚ configs/load.py       โ”‚
โ”‚    SGRPO loss       โ”‚   vllm_engine.py  โ”‚ configs/*.yaml        โ”‚
โ”‚    LoRA / PEFT      โ”‚   replay_buffer.pyโ”‚                       โ”‚
โ”‚  algs/PPO/ppo.py    โ”‚                   โ”‚ datasets/             โ”‚
โ”‚  algs/SFT/sft.py    โ”‚                   โ”‚   prompt_only.py      โ”‚
โ”‚                     โ”‚                   โ”‚   (Multimodal Ready)  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  swarm/             โ”‚  utils/logging.py  โ”‚  rewards/compute_score  โ”‚
โ”‚    orchestrator.py  โ”‚  utils/setup.py    โ”‚  (Reasoning / Code)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

RL Training Workflow

  1. Scout Agent: Discovers model metadata and ensures chat_template compatibility.
  2. Multimodal Pipeline: Converts base64 images/audio into PIL/NumPy for vLLM rollouts.
  3. LoRA Lifecycle: Train with adapters, save with gathered ZeRO-3 weights, and auto-strip PEFT prefixes for immediate vLLM compatibility.
  4. Verifiable Rewards: Programmatic verification of CoT tags and mathematical correctness.

Quick Start

Installation

pip install oxrl

Post-train a Reasoning Model

# config.yaml
model:
  name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
lora:
  enabled: true
reward:
  reward_func: "reasoning_reward_func"
data:
  dataset: "openr1_math"
python main_rl.py --config-file config.yaml

Algorithms

Algorithm File Description
SGRPO algs/grpo.py Stable GRPO โ€” Clipped surrogate loss with LoRA support and reference-free variants.
SimPO algs/simpo.py Simple Preference Optimization โ€” Reference-free and length-normalized alignment.
CISPO algs/grpo.py Clipped importance-sampling policy optimization.
PPO algs/PPO/ppo.py Proximal Policy Optimization with GAE and value clipping.

Project Structure

oxRL/
โ”œโ”€โ”€ main_rl.py              RL training loop (Ray + DeepSpeed)
โ”œโ”€โ”€ swarm/                  Autonomous model onboarding (Scout, Bugfixer)
โ”œโ”€โ”€ preprocessing/          Reasoning (OpenR1), Multimodal (Vision/Audio) preprocessors
โ”œโ”€โ”€ rollouts/               vLLM inference with structured prompt support
โ”œโ”€โ”€ rewards/                Verifiable reasoning and coding rewards

design-principles

Debuggability over Pipelining. oxRL avoids complex async pipelining to ensure that failure states are 100% reproducible and logs are clear.

LoRA-first for 7B+. We default to LoRA for larger models to enable high-quality research on consumer-grade and restricted high-end hardware.

Verification-driven RL. We prioritize datasets where the reward is verifiable (Math, Code, Format) to drive logical discovery.

Contributing

Contributions are welcome. Please follow the existing architectural patterns and style.

FAQ

Check out the FAQ for details on LoRA merging and Multimodal input formatting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxrl-1.2.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oxrl-1.2.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file oxrl-1.2.0.tar.gz.

File metadata

  • Download URL: oxrl-1.2.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for oxrl-1.2.0.tar.gz
Algorithm Hash digest
SHA256 fc8281792202e78522edd33e91577b08711bec395919c83a5de4118089d567c1
MD5 e904c99467e063acefb397bd519d3f9e
BLAKE2b-256 004db4b6ef29ef4826f8513092d6e5b0abe3c13eda8aa41ce55d3b271fc0840f

See more details on using hashes here.

File details

Details for the file oxrl-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: oxrl-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for oxrl-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a1fc754bc1be257f8d62a3903e38ca65a9ef94184be64411c403e62419418e3
MD5 7e6ccf697440a88321daebfc47530cf1
BLAKE2b-256 7aca5f148ffbfaf0c422adafa9afb43c8aebe790667102effe61f0f9389ee0c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page