A lightweight post-training framework for LLMs and VLMs

These details have not been verified by PyPI

Project links

Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

oxRL

An agent-friendly framework for any post-training

A lightweight post-training framework for LLMs and VLMs. Maximizing developer speed. Scales to billions of parameters with DeepSpeed, vLLM, and Ray.

Usage

Post-train any model in under 10 lines of code. oxRL auto-detects your hardware, auto-prepares datasets, and scales to multi-GPU automatically.

from oxrl import Trainer

# 1. Initialize with any HuggingFace model
trainer = Trainer(model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B")

# 2. Start reasoning post-training (Open-R1 recipe)
trainer.train(task="reasoning")

Supported Models

The following models have been verified and onboarded using our automated pipeline. You can find ready-to-use scripts in the examples/recipes/ directory.

Model	Size	Task	Strategy	Status
Qwen3-0.6B	0.6B	Instruct	Full-tuning	✅ Verified
Qwen3.5-35B-A3B	35.0B (3B active)	Reasoning	LoRA	✅ Verified
DeepSeek-R1-Distill-Llama-8B	8.0B	Reasoning	LoRA	✅ Verified
DeepSeek-R1-Distill-Qwen-7B	7.0B	Reasoning	LoRA	✅ Verified
Qwen2.5-Coder-7B-Instruct	7.6B	Coding	LoRA	✅ Verified
Qwen2-Audio-7B-Instruct	7.0B	Audio	LoRA	✅ Verified
Qwen2-VL-7B-Instruct	7.0B	Vision	LoRA	✅ Verified
Gemma-3-1b-it	1.0B	Multimodal	Full-tuning	✅ Verified
Mistral-7B-Instruct-v0.3	7.0B	Instruct	LoRA	✅ Verified
Qwen2.5-7B-Instruct	7.0B	Math	LoRA	✅ Verified
SmolLM2-1.7B-Instruct	1.7B	Instruct	Full-tuning	✅ Verified

System Architecture

┌──────────────────────────────────────────────────────────────────┐
│                          oxRL Framework                          │
├────────────────────────────────┬─────────────────────────────────┤
│     RL Path (main_rl.py)       │     SL Path (main_sl.py)        │
│  SGRPO / CISPO / PPO          │  SFT / DPO / ORPO / KTO         │
│  RLHF / RLAIF                 │  CPT / KD / RM / RFT            │
│  Ray actors + vLLM rollouts    │  OnlineDPO / SPIN / IPO / SimPO │
│                                │  DeepSpeed distributed training  │
├────────────────────────────────┴─────────────────────────────────┤
│  oxrl/algs/        Algorithms    │  oxrl/rollouts/   vLLM + Replay│
│  oxrl/configs/     Pydantic cfg  │  oxrl/rewards/    Verifiable   │
│  oxrl/datasets/    HF loaders    │  oxrl/utils/      Setup + Logs │
└──────────────────────────────────────────────────────────────────┘

RL Training Workflow

Scout Agent: Discovers model metadata and ensures chat_template compatibility.
Multimodal Pipeline: Converts base64 images/audio into PIL/NumPy for vLLM rollouts.
LoRA Lifecycle: Train with adapters, save with gathered ZeRO-3 weights, and auto-strip PEFT prefixes for immediate vLLM compatibility.
Verifiable Rewards: Programmatic verification of CoT tags and mathematical correctness.

Getting Started

Installation

# From source (recommended for development)
git clone https://github.com/warlockee/oxRL.git
cd oxRL
pip install -e .

# Or from PyPI
pip install oxrl

Run Tests

pip install pytest
pytest tests/test_bugs.py -v

Environment Diagnostics

Before starting a long training run, verify your environment (GPUs, CUDA Toolkit, DeepSpeed, Ray) with our diagnostic tool:

oxrl doctor

Configuration

oxRL uses YAML config files. See oxrl/configs/rl_args.yaml (RL) and oxrl/configs/sl_args.yaml (SL) for all available options with documentation. Example configs are in examples/.

Key environment variables:

OXRL_DATA_DIR — Override default data directory (default: ./data)
OXRL_CHECKPOINT_DIR — Override default checkpoint directory (default: ./checkpoints)
HF_TOKEN — HuggingFace token for gated models
GITHUB_TOKEN — For autonomous bug reporting (optional)

Post-train a Reasoning Model

# config.yaml
model:
  name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
lora:
  enabled: true
reward:
  reward_func: "reasoning_reward_func"
data:
  dataset: "openr1_math"

python main_rl.py --config-file config.yaml

Algorithms

Reinforcement Learning (via Ray + vLLM rollouts)

Algorithm	File	Description
SGRPO	`oxrl/algs/grpo.py`	Stable GRPO — Clipped surrogate loss with LoRA support and reference-free variants.
CISPO	`oxrl/algs/grpo.py`	Clipped importance-sampling policy optimization.
PPO	`oxrl/algs/ppo.py`	Proximal Policy Optimization with GAE, value clipping, and shared-backbone critic.
RLHF	`oxrl/algs/grpo.py`	Reinforcement Learning from Human Feedback — GRPO with a trained reward model.
RLAIF	`oxrl/algs/grpo.py`	Reinforcement Learning from AI Feedback — GRPO with AI-generated rewards.

Supervised Learning (via DeepSpeed)

Algorithm	File	Description
SFT	`oxrl/algs/sft.py`	Supervised Fine-Tuning — Cross-entropy loss with masking and normalization.
DPO	`oxrl/algs/dpo.py`	Direct Preference Optimization — Pairwise preference learning with a reference model.
ORPO	`oxrl/algs/orpo.py`	Odds Ratio Preference Optimization — Reference-free preference alignment via log-odds.
KTO	`oxrl/algs/kto.py`	Kahneman-Tversky Optimization — Prospect-theory-inspired alignment with moving-average KL baseline.
CPT	`oxrl/algs/cpt.py`	Continued Pre-Training — Full-sequence language modeling on domain-specific text.
KD	`oxrl/algs/kd.py`	Knowledge Distillation — Teacher-student training with combined CE and KL divergence loss.
RM	`oxrl/algs/rm.py`	Reward Model Training — Bradley-Terry pairwise ranking with a learned scalar head.
OnlineDPO	`oxrl/algs/online_dpo.py`	Online DPO — DPO with on-the-fly rejection sampling in the data pipeline.
RFT	`oxrl/algs/rft.py`	Rejection Sampling Fine-Tuning — SFT on reward-filtered responses above a threshold.
SPIN	`oxrl/algs/spin.py`	Self-Play Improvement — DPO where rejected samples are the model's own prior outputs.
IPO	`oxrl/algs/ipo.py`	Identity Preference Optimization — Squared-loss variant of DPO for improved stability.
SimPO	`oxrl/algs/simpo.py`	Simple Preference Optimization — Reference-free, length-normalized preference alignment.

Project Structure

oxRL/
├── oxrl/                   # Core Framework Package
│   ├── trainer.py          # High-level Trainer API
│   ├── rewards/            # Verifiable reasoning and coding rewards (math, code, etc.)
│   ├── algs/               # 17 algorithm implementations (see tables above)
│   ├── swarm/              # Autonomous model onboarding (Scout, Bugfixer)
│   ├── preprocessing/      # Reasoning (OpenR1), Multimodal (Vision/Audio) preprocessors
│   ├── rollouts/           # vLLM inference with structured prompt support
│   └── datasets/           # Dataset loaders and samplers
├── main_rl.py              # RL training loop (Ray + DeepSpeed)
├── main_sl.py              # SL training loop (DeepSpeed) — 12 algorithms
├── examples/               # Ready-to-use recipes and training scripts
└── setup.py                # Packaging and Installation

design-principles

Debuggability over Pipelining. oxRL avoids complex async pipelining to ensure that failure states are 100% reproducible and logs are clear.

Robust Environment Handling. oxRL is designed to work even in constrained environments. It automatically handles common CUDA/DeepSpeed mismatches by providing actionable warnings instead of fatal crashes.

Autonomous Bug Reporting. On framework failure, oxRL provides structured diagnostic signals for AI agents to automatically generate and submit GitHub issues (requires GITHUB_TOKEN environment variable).

LoRA-first for 7B+. We default to LoRA for larger models to enable high-quality research on consumer-grade and restricted high-end hardware.

Verification-driven RL. We prioritize datasets where the reward is verifiable (Math, Code, Format) to drive logical discovery.

LLM Developer Map

This repository is optimized for LLM-assisted development (Claude/Gemini). If you are asking an AI to work on this framework, refer them to these "High-Signal" files:

Bug Reporting: See BUG_REPORTING.md for instructions on autonomous issue submission.
Adding a New Algorithm: See oxrl/algs/base.py (Base Class) and oxrl/algs/grpo.py (Implementation).
Adding a Reward Function: Add to oxrl/rewards/ using the signature in oxrl/rewards/base.py.
Changing Model Loading: See oxrl/utils/setup.py -> load_model_and_ref.
Training Logic: The main loop resides in main_rl.py.
Config Validation: Logic is in oxrl/configs/load.py.

Contributing

Contributions are welcome. Please follow the existing architectural patterns and style.

FAQ

Check out the FAQ for details on LoRA merging and Multimodal input formatting.

Project details

These details have not been verified by PyPI

Project links

Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

1.7.2

Mar 4, 2026

1.7.1

Mar 4, 2026

1.7.0

Mar 4, 2026

This version

1.5.0

Feb 27, 2026

1.2.0

Feb 25, 2026

0.8.1

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxrl-1.5.0.tar.gz (107.0 kB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oxrl-1.5.0-py3-none-any.whl (124.8 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file oxrl-1.5.0.tar.gz.

File metadata

Download URL: oxrl-1.5.0.tar.gz
Upload date: Feb 27, 2026
Size: 107.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for oxrl-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`f3bcbd02c2e309aa2192f73bbb7bcef8cfbdda78f024ec2f041a1d600dc26798`
MD5	`0fc57303b1ac2e822565a923a4889d71`
BLAKE2b-256	`42c93b7d1de95ff2ff49bf12c2bdc0447bbd5fc3738f47bb4176d44d9d0cb5cc`

See more details on using hashes here.

File details

Details for the file oxrl-1.5.0-py3-none-any.whl.

File metadata

Download URL: oxrl-1.5.0-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 124.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for oxrl-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5950d9f78a40c3503820dd3beb785c365521e48526323c14c4041e4e15c4099`
MD5	`9ded17769a280c7174aa403b27251df1`
BLAKE2b-256	`e83164b7ae64c9d935bf270ca419e5c40620a4fdf64eea47f5b3c5ac7127c13b`

See more details on using hashes here.

oxrl 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

An agent-friendly framework for any post-training

Usage

Supported Models

System Architecture

RL Training Workflow

Getting Started

Installation

Run Tests

Environment Diagnostics

Configuration

Post-train a Reasoning Model

Algorithms

Reinforcement Learning (via Ray + vLLM rollouts)

Supervised Learning (via DeepSpeed)

Project Structure

design-principles

LLM Developer Map

Contributing

FAQ

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes