LLM post-training playbook: SFT, GRPO, DPO, eval, and inference

These details have not been verified by PyPI

Project description

alignrl

From base model to deployed reasoning agent - every LLM post-training technique, implemented and benchmarked.

What is this?

A Python package implementing the complete LLM post-training pipeline: Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO) with verifiable math rewards, and Direct Preference Optimization (DPO). Includes evaluation benchmarks via lm-evaluation-harness and multi-backend inference serving (Unsloth, vLLM, MLX). Built for learning and demonstration, designed to run on free Colab GPUs with QLoRA and Unsloth for memory-efficient training on Qwen2.5-3B.

Pipeline

graph LR
    A[Qwen2.5-3B<br/>Base Model] --> B[SFT<br/>Instruction Following]
    B --> C[GRPO<br/>Math Reasoning via RL]
    B --> D[DPO<br/>Preference Alignment]
    C --> E[Evaluation<br/>GSM8K, MATH, ARC]
    D --> E
    E --> F[Inference<br/>Unsloth / vLLM / MLX]

Quick Start

# Install
pip install git+https://github.com/sacredvoid/alignrl.git

# Train (SFT as an example)
alignrl train sft -c configs/sft.yaml

# Evaluate
alignrl eval --adapter ./outputs/sft/final --stage sft

# Launch comparison demo
alignrl serve --stages base sft=./outputs/sft/final grpo=./outputs/grpo/final

For GPU training, install with the train and unsloth extras:

pip install "alignrl[train,unsloth] @ git+https://github.com/sacredvoid/alignrl.git"

Notebooks

Each notebook is self-contained and runs end-to-end on a free Colab T4 GPU.

#	Notebook	Technique
01	SFT on OpenHermes-2.5	Supervised Fine-Tuning with QLoRA
02	GRPO on GSM8K	RL with Verifiable Math Rewards
03	DPO on UltraFeedback	Direct Preference Optimization
04	Benchmark Evaluation	lm-evaluation-harness across stages
05	Inference Comparison	Side-by-side Gradio demo

Benchmark Results

All evaluations run on Qwen2.5-3B with QLoRA adapters. Best score per benchmark in bold.

Benchmark	Metric	Base	SFT	GRPO	DPO
GSM8K	exact_match	0.31	0.45	0.62	0.43
MATH	exact_match	0.12	0.18	0.29	0.17
ARC-Challenge	acc_norm	0.48	0.54	0.52	0.55

Key takeaways:

GRPO dominates math reasoning - GSM8K jumps from 31% to 62% (2x), MATH from 12% to 29% (2.4x)
DPO edges out on general reasoning - ARC-Challenge best at 55%, suggesting preference alignment improves broad task quality
SFT is a strong baseline - consistent improvement across all benchmarks before any RL

Module Reference

Module	Purpose	Key Class
`alignrl.sft`	Supervised Fine-Tuning with QLoRA	`SFTRunner`
`alignrl.grpo`	RL with Verifiable Math Rewards	`GRPORunner`
`alignrl.dpo`	Direct Preference Optimization	`DPORunner`
`alignrl.eval`	Benchmark evaluation harness	`EvalRunner`
`alignrl.inference`	Multi-backend model serving	`ModelServer`
`alignrl.rewards`	Math reward verifiers for GRPO	`math_verify_reward`
`alignrl.demo`	Gradio comparison UI	`create_demo`
`alignrl.cli`	CLI entry point (`train`, `eval`, `serve`)	`main`
`alignrl.config`	Pydantic-validated training configs	`BaseTrainConfig`
`alignrl.types`	Shared protocols and result types	`Trainer`, `TrainResult`, `EvalResult`

Architecture

The codebase follows a few core design decisions:

Pydantic configs - Every training stage uses a typed config class inheriting from BaseTrainConfig, loadable from YAML files. Validation happens at construction time, not at training time.
Common Trainer protocol - SFTRunner, GRPORunner, and DPORunner all implement the Trainer protocol (train(), save(), load()), making them interchangeable in pipelines and tests.
Lazy imports - Heavy dependencies (torch, transformers, unsloth, vllm, mlx-lm) are imported inside methods, not at module level. The base package installs in seconds with just pydantic and pyyaml.
Unsloth for speed - All training uses Unsloth's FastLanguageModel with gradient checkpointing, cutting VRAM usage roughly in half compared to vanilla transformers. Fits Qwen2.5-3B training on a free Colab T4 (16GB).
Structured results - Training returns TrainResult, evaluation returns EvalResult. Both are frozen dataclasses that serialize to JSON for the results dashboard.

Project Structure

alignrl/
  configs/          # YAML configs for each training stage
  docs/             # GitHub Pages results dashboard
  notebooks/        # Colab-ready Jupyter notebooks
  results/          # Benchmark JSON (consumed by dashboard)
  src/alignrl/      # Package source
  tests/            # 33 unit tests (pytest)
  pyproject.toml    # Hatchling build, optional dependency groups

Tech Stack

Category	Tools
Training	TRL, Unsloth, PEFT, bitsandbytes
Evaluation	lm-evaluation-harness
Inference	vLLM, MLX-LM, Unsloth
Demo	Gradio
Config	Pydantic, PyYAML
Quality	Ruff, mypy, pytest

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Mar 26, 2026

0.2.0

Mar 24, 2026

This version

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alignrl-0.1.0.tar.gz (35.7 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alignrl-0.1.0-py3-none-any.whl (17.5 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file alignrl-0.1.0.tar.gz.

File metadata

Download URL: alignrl-0.1.0.tar.gz
Upload date: Mar 24, 2026
Size: 35.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for alignrl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c49b5c44f11da830fcd28e331802759cbef8ac0a08389a23d1ff52c587889015`
MD5	`0970bebc325098807775a638d909998a`
BLAKE2b-256	`88d4efe948c45c7a3cf5a2482f2b17981232ecdc487102e2874922b32330d9bf`

See more details on using hashes here.

File details

Details for the file alignrl-0.1.0-py3-none-any.whl.

File metadata

Download URL: alignrl-0.1.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for alignrl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`00cafc528c4612185e0d534f2f9aeea068bf7f631871ea651618f936286a8904`
MD5	`f14000f408453d8e5555ed695b008184`
BLAKE2b-256	`165845c45485503ba6391c0b5f54cfdec3e108b9bf3703e241a4421bc0fe340d`

See more details on using hashes here.

alignrl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

alignrl

What is this?

Pipeline

Quick Start

Notebooks

Benchmark Results

Module Reference

Architecture

Project Structure

Tech Stack

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes