LLM post-training playbook: SFT, GRPO, DPO, eval, and inference
Project description
alignrl
From base model to deployed reasoning agent - every LLM post-training technique, implemented and benchmarked.
What is this?
A Python package implementing the complete LLM post-training pipeline: Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO) with verifiable math rewards, and Direct Preference Optimization (DPO). Includes evaluation benchmarks via lm-evaluation-harness and multi-backend inference serving (Unsloth, vLLM, MLX). Built for learning and demonstration, designed to run on free Colab GPUs with QLoRA and Unsloth for memory-efficient training on Qwen2.5-3B.
Pipeline
graph LR
A[Qwen2.5-3B<br/>Base Model] --> B[SFT<br/>Instruction Following]
B --> C[GRPO<br/>Math Reasoning via RL]
B --> D[DPO<br/>Preference Alignment]
C --> E[Evaluation<br/>GSM8K, MATH, ARC]
D --> E
E --> F[Inference<br/>Unsloth / vLLM / MLX]
Quick Start
# Install
pip install git+https://github.com/sacredvoid/alignrl.git
# Train (SFT as an example)
alignrl train sft -c configs/sft.yaml
# Evaluate
alignrl eval --adapter ./outputs/sft/final --stage sft
# Launch comparison demo
alignrl serve --stages base sft=./outputs/sft/final grpo=./outputs/grpo/final
For GPU training, install with the train and unsloth extras:
pip install "alignrl[train,unsloth] @ git+https://github.com/sacredvoid/alignrl.git"
Notebooks
Each notebook is self-contained and runs end-to-end on a free Colab T4 GPU.
Benchmark Results
All evaluations run on Qwen2.5-3B with QLoRA adapters. Best score per benchmark in bold.
| Benchmark | Metric | Base | SFT | GRPO | DPO |
|---|---|---|---|---|---|
| GSM8K | exact_match | 0.31 | 0.45 | 0.62 | 0.43 |
| MATH | exact_match | 0.12 | 0.18 | 0.29 | 0.17 |
| ARC-Challenge | acc_norm | 0.48 | 0.54 | 0.52 | 0.55 |
Key takeaways:
- GRPO dominates math reasoning - GSM8K jumps from 31% to 62% (2x), MATH from 12% to 29% (2.4x)
- DPO edges out on general reasoning - ARC-Challenge best at 55%, suggesting preference alignment improves broad task quality
- SFT is a strong baseline - consistent improvement across all benchmarks before any RL
Module Reference
| Module | Purpose | Key Class |
|---|---|---|
alignrl.sft |
Supervised Fine-Tuning with QLoRA | SFTRunner |
alignrl.grpo |
RL with Verifiable Math Rewards | GRPORunner |
alignrl.dpo |
Direct Preference Optimization | DPORunner |
alignrl.eval |
Benchmark evaluation harness | EvalRunner |
alignrl.inference |
Multi-backend model serving | ModelServer |
alignrl.rewards |
Math reward verifiers for GRPO | math_verify_reward |
alignrl.demo |
Gradio comparison UI | create_demo |
alignrl.cli |
CLI entry point (train, eval, serve) |
main |
alignrl.config |
Pydantic-validated training configs | BaseTrainConfig |
alignrl.types |
Shared protocols and result types | Trainer, TrainResult, EvalResult |
Architecture
The codebase follows a few core design decisions:
- Pydantic configs - Every training stage uses a typed config class inheriting from
BaseTrainConfig, loadable from YAML files. Validation happens at construction time, not at training time. - Common Trainer protocol -
SFTRunner,GRPORunner, andDPORunnerall implement theTrainerprotocol (train(),save(),load()), making them interchangeable in pipelines and tests. - Lazy imports - Heavy dependencies (torch, transformers, unsloth, vllm, mlx-lm) are imported inside methods, not at module level. The base package installs in seconds with just pydantic and pyyaml.
- Unsloth for speed - All training uses Unsloth's
FastLanguageModelwith gradient checkpointing, cutting VRAM usage roughly in half compared to vanilla transformers. Fits Qwen2.5-3B training on a free Colab T4 (16GB). - Structured results - Training returns
TrainResult, evaluation returnsEvalResult. Both are frozen dataclasses that serialize to JSON for the results dashboard.
Project Structure
alignrl/
configs/ # YAML configs for each training stage
docs/ # GitHub Pages results dashboard
notebooks/ # Colab-ready Jupyter notebooks
results/ # Benchmark JSON (consumed by dashboard)
src/alignrl/ # Package source
tests/ # 33 unit tests (pytest)
pyproject.toml # Hatchling build, optional dependency groups
Tech Stack
| Category | Tools |
|---|---|
| Training | TRL, Unsloth, PEFT, bitsandbytes |
| Evaluation | lm-evaluation-harness |
| Inference | vLLM, MLX-LM, Unsloth |
| Demo | Gradio |
| Config | Pydantic, PyYAML |
| Quality | Ruff, mypy, pytest |
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alignrl-0.1.0.tar.gz.
File metadata
- Download URL: alignrl-0.1.0.tar.gz
- Upload date:
- Size: 35.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c49b5c44f11da830fcd28e331802759cbef8ac0a08389a23d1ff52c587889015
|
|
| MD5 |
0970bebc325098807775a638d909998a
|
|
| BLAKE2b-256 |
88d4efe948c45c7a3cf5a2482f2b17981232ecdc487102e2874922b32330d9bf
|
File details
Details for the file alignrl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: alignrl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00cafc528c4612185e0d534f2f9aeea068bf7f631871ea651618f936286a8904
|
|
| MD5 |
f14000f408453d8e5555ed695b008184
|
|
| BLAKE2b-256 |
165845c45485503ba6391c0b5f54cfdec3e108b9bf3703e241a4421bc0fe340d
|