Skip to main content

LLM post-training playbook: SFT, GRPO, DPO, eval, and inference

Project description

alignrl

Python 3.10+ License: MIT Tests

From base model to deployed reasoning agent - every LLM post-training technique, implemented and benchmarked.

What is this?

A Python package implementing the complete LLM post-training pipeline: Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO) with verifiable math rewards, and Direct Preference Optimization (DPO). Includes evaluation benchmarks via lm-evaluation-harness and multi-backend inference serving (Unsloth, vLLM, MLX). Built for learning and demonstration, designed to run on free Colab GPUs with QLoRA and Unsloth for memory-efficient training on Qwen2.5-3B.

Pipeline

graph LR
    A[Qwen2.5-3B<br/>Base Model] --> B[SFT<br/>Instruction Following]
    B --> C[GRPO<br/>Math Reasoning via RL]
    B --> D[DPO<br/>Preference Alignment]
    C --> E[Evaluation<br/>GSM8K, MATH, ARC]
    D --> E
    E --> F[Inference<br/>Unsloth / vLLM / MLX]

Quick Start

# Install
pip install git+https://github.com/sacredvoid/alignrl.git

# Train (SFT as an example)
alignrl train sft -c configs/sft.yaml

# Evaluate
alignrl eval --adapter ./outputs/sft/final --stage sft

# Launch comparison demo
alignrl serve --stages base sft=./outputs/sft/final grpo=./outputs/grpo/final

For GPU training, install with the train and unsloth extras:

pip install "alignrl[train,unsloth] @ git+https://github.com/sacredvoid/alignrl.git"

Notebooks

Each notebook is self-contained and runs end-to-end on a free Colab T4 GPU.

# Notebook Technique Colab
01 SFT on OpenHermes-2.5 Supervised Fine-Tuning with QLoRA Open in Colab
02 GRPO on GSM8K RL with Verifiable Math Rewards Open in Colab
03 DPO on UltraFeedback Direct Preference Optimization Open in Colab
04 Benchmark Evaluation lm-evaluation-harness across stages Open in Colab
05 Inference Comparison Side-by-side Gradio demo Open in Colab

Benchmark Results

All evaluations run on Qwen2.5-3B with QLoRA adapters. Best score per benchmark in bold.

Benchmark Metric Base SFT GRPO DPO
GSM8K exact_match 0.31 0.45 0.62 0.43
MATH exact_match 0.12 0.18 0.29 0.17
ARC-Challenge acc_norm 0.48 0.54 0.52 0.55

Key takeaways:

  • GRPO dominates math reasoning - GSM8K jumps from 31% to 62% (2x), MATH from 12% to 29% (2.4x)
  • DPO edges out on general reasoning - ARC-Challenge best at 55%, suggesting preference alignment improves broad task quality
  • SFT is a strong baseline - consistent improvement across all benchmarks before any RL

Module Reference

Module Purpose Key Class
alignrl.sft Supervised Fine-Tuning with QLoRA SFTRunner
alignrl.grpo RL with Verifiable Math Rewards GRPORunner
alignrl.dpo Direct Preference Optimization DPORunner
alignrl.eval Benchmark evaluation harness EvalRunner
alignrl.inference Multi-backend model serving ModelServer
alignrl.rewards Math reward verifiers for GRPO math_verify_reward
alignrl.demo Gradio comparison UI create_demo
alignrl.cli CLI entry point (train, eval, serve) main
alignrl.config Pydantic-validated training configs BaseTrainConfig
alignrl.types Shared protocols and result types Trainer, TrainResult, EvalResult

Architecture

The codebase follows a few core design decisions:

  • Pydantic configs - Every training stage uses a typed config class inheriting from BaseTrainConfig, loadable from YAML files. Validation happens at construction time, not at training time.
  • Common Trainer protocol - SFTRunner, GRPORunner, and DPORunner all implement the Trainer protocol (train(), save(), load()), making them interchangeable in pipelines and tests.
  • Lazy imports - Heavy dependencies (torch, transformers, unsloth, vllm, mlx-lm) are imported inside methods, not at module level. The base package installs in seconds with just pydantic and pyyaml.
  • Unsloth for speed - All training uses Unsloth's FastLanguageModel with gradient checkpointing, cutting VRAM usage roughly in half compared to vanilla transformers. Fits Qwen2.5-3B training on a free Colab T4 (16GB).
  • Structured results - Training returns TrainResult, evaluation returns EvalResult. Both are frozen dataclasses that serialize to JSON for the results dashboard.

Project Structure

alignrl/
  configs/          # YAML configs for each training stage
  docs/             # GitHub Pages results dashboard
  notebooks/        # Colab-ready Jupyter notebooks
  results/          # Benchmark JSON (consumed by dashboard)
  src/alignrl/      # Package source
  tests/            # 49 unit tests (pytest)
  pyproject.toml    # Hatchling build, optional dependency groups

Tech Stack

Category Tools
Training TRL, Unsloth, PEFT, bitsandbytes
Evaluation lm-evaluation-harness
Inference vLLM, MLX-LM, Unsloth
Demo Gradio
Config Pydantic, PyYAML
Quality Ruff, mypy, pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alignrl-0.2.0.tar.gz (53.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alignrl-0.2.0-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file alignrl-0.2.0.tar.gz.

File metadata

  • Download URL: alignrl-0.2.0.tar.gz
  • Upload date:
  • Size: 53.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for alignrl-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d53b5a1db2f9ca939fa88587fbba08dbc1753afb82dde211f04f8beeec33d16b
MD5 f6e0cdaf333deff4ce39274b3a4a3087
BLAKE2b-256 ff0556e03723d047152c3bdbb5a5e1bd81adad3c0e8cb96432911b351bfc01b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for alignrl-0.2.0.tar.gz:

Publisher: publish.yml on sacredvoid/alignrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file alignrl-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: alignrl-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for alignrl-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8562f2ce957a15e7434ae59792cb9fe9b69d23fdfa13ffaf79283520e475f775
MD5 d69f71d5d0b0e5359abb721a6758ca05
BLAKE2b-256 b183fee936b087a995c590441bda9da311f9b0eabe04a4fa1c69849129d0f863

See more details on using hashes here.

Provenance

The following attestation bundles were made for alignrl-0.2.0-py3-none-any.whl:

Publisher: publish.yml on sacredvoid/alignrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page