Skip to main content

LLM post-training playbook: SFT, GRPO, DPO, eval, and inference

Project description

alignrl

Python 3.10+ License: MIT Tests

From base model to deployed reasoning agent - every LLM post-training technique, implemented and benchmarked.

What is this?

A Python package implementing the complete LLM post-training pipeline: Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO) with verifiable math rewards, and Direct Preference Optimization (DPO). Includes evaluation benchmarks via lm-evaluation-harness and multi-backend inference serving (Unsloth, vLLM, MLX). Built for learning and demonstration, designed to run on free Colab GPUs with QLoRA and Unsloth for memory-efficient training on Qwen2.5-3B.

Pipeline

graph LR
    A[Qwen2.5-3B<br/>Base Model] --> B[SFT<br/>Instruction Following]
    B --> C[GRPO<br/>Math Reasoning via RL]
    B --> D[DPO<br/>Preference Alignment]
    C --> E[Evaluation<br/>GSM8K, MATH, ARC]
    D --> E
    E --> F[Inference<br/>Unsloth / vLLM / MLX]

Quick Start

# Install
pip install git+https://github.com/sacredvoid/alignrl.git

# Train (SFT as an example)
alignrl train sft -c configs/sft.yaml

# Evaluate
alignrl eval --adapter ./outputs/sft/final --stage sft

# Launch comparison demo
alignrl serve --stages base sft=./outputs/sft/final grpo=./outputs/grpo/final

For GPU training, install with the train and unsloth extras:

pip install "alignrl[train,unsloth] @ git+https://github.com/sacredvoid/alignrl.git"

Notebooks

Each notebook is self-contained and runs end-to-end on a free Colab T4 GPU.

# Notebook Technique Colab
01 SFT on OpenHermes-2.5 Supervised Fine-Tuning with QLoRA Open in Colab
02 GRPO on GSM8K RL with Verifiable Math Rewards Open in Colab
03 DPO on UltraFeedback Direct Preference Optimization Open in Colab
04 Benchmark Evaluation lm-evaluation-harness across stages Open in Colab
05 Inference Comparison Side-by-side Gradio demo Open in Colab

Benchmark Results

All evaluations run on Qwen2.5-3B with QLoRA adapters. Best score per benchmark in bold.

Benchmark Metric Base SFT GRPO DPO
GSM8K exact_match 0.31 0.45 0.62 0.43
MATH exact_match 0.12 0.18 0.29 0.17
ARC-Challenge acc_norm 0.48 0.54 0.52 0.55

Key takeaways:

  • GRPO dominates math reasoning - GSM8K jumps from 31% to 62% (2x), MATH from 12% to 29% (2.4x)
  • DPO edges out on general reasoning - ARC-Challenge best at 55%, suggesting preference alignment improves broad task quality
  • SFT is a strong baseline - consistent improvement across all benchmarks before any RL

Module Reference

Module Purpose Key Class
alignrl.sft Supervised Fine-Tuning with QLoRA SFTRunner
alignrl.grpo RL with Verifiable Math Rewards GRPORunner
alignrl.dpo Direct Preference Optimization DPORunner
alignrl.eval Benchmark evaluation harness EvalRunner
alignrl.inference Multi-backend model serving ModelServer
alignrl.rewards Math reward verifiers for GRPO math_verify_reward
alignrl.demo Gradio comparison UI create_demo
alignrl.cli CLI entry point (train, eval, serve) main
alignrl.config Pydantic-validated training configs BaseTrainConfig
alignrl.types Shared protocols and result types Trainer, TrainResult, EvalResult

Architecture

The codebase follows a few core design decisions:

  • Pydantic configs - Every training stage uses a typed config class inheriting from BaseTrainConfig, loadable from YAML files. Validation happens at construction time, not at training time.
  • Common Trainer protocol - SFTRunner, GRPORunner, and DPORunner all implement the Trainer protocol (train(), save(), load()), making them interchangeable in pipelines and tests.
  • Lazy imports - Heavy dependencies (torch, transformers, unsloth, vllm, mlx-lm) are imported inside methods, not at module level. The base package installs in seconds with just pydantic and pyyaml.
  • Unsloth for speed - All training uses Unsloth's FastLanguageModel with gradient checkpointing, cutting VRAM usage roughly in half compared to vanilla transformers. Fits Qwen2.5-3B training on a free Colab T4 (16GB).
  • Structured results - Training returns TrainResult, evaluation returns EvalResult. Both are frozen dataclasses that serialize to JSON for the results dashboard.

Project Structure

alignrl/
  configs/          # YAML configs for each training stage
  docs/             # GitHub Pages results dashboard
  notebooks/        # Colab-ready Jupyter notebooks
  results/          # Benchmark JSON (consumed by dashboard)
  src/alignrl/      # Package source
  tests/            # 33 unit tests (pytest)
  pyproject.toml    # Hatchling build, optional dependency groups

Tech Stack

Category Tools
Training TRL, Unsloth, PEFT, bitsandbytes
Evaluation lm-evaluation-harness
Inference vLLM, MLX-LM, Unsloth
Demo Gradio
Config Pydantic, PyYAML
Quality Ruff, mypy, pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alignrl-0.1.0.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alignrl-0.1.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file alignrl-0.1.0.tar.gz.

File metadata

  • Download URL: alignrl-0.1.0.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for alignrl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c49b5c44f11da830fcd28e331802759cbef8ac0a08389a23d1ff52c587889015
MD5 0970bebc325098807775a638d909998a
BLAKE2b-256 88d4efe948c45c7a3cf5a2482f2b17981232ecdc487102e2874922b32330d9bf

See more details on using hashes here.

File details

Details for the file alignrl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: alignrl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for alignrl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 00cafc528c4612185e0d534f2f9aeea068bf7f631871ea651618f936286a8904
MD5 f14000f408453d8e5555ed695b008184
BLAKE2b-256 165845c45485503ba6391c0b5f54cfdec3e108b9bf3703e241a4421bc0fe340d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page