Train LLMs on Apple silicon with MLX and the Hugging Face Hub

Project description

logo

MLX-LM-LORA

With MLX-LM-LoRA you can, train Large Language Models locally on Apple Silicon using MLX. Training works with all models supported by MLX-LM, including:

Llama
Mistral
Qwen
Gemma
OLMo, OLMoE
MiniCPM, MiniCPM3
and more...

Supported Training Methods

Training Types:

LoRA: Low-Rank Adaptation for efficient fine-tuning
DoRA: Weight-Decomposed Low-Rank Adaptation
Full-precision: Train all model parameters
Quantized training: QLoRA with 4-bit, 6-bit, or 8-bit quantization
Quantization Aware Training (QAT): Apply quantization projection during training for SFT, DPO, and ORPO

Training Algorithms:

SFT: Supervised Fine-Tuning
DPO: Direct Preference Optimization
FTPO / Antidoom: Final-token preference optimization for repairing repetition loops
CPO: Contrastive Preference Optimization
ORPO: Odds Ratio Preference Optimization
GRPO: Group Relative Policy Optimization
GSPO: Group Sequence Policy Optimization
Dr. GRPO: Dr. Group Relative Policy Optimization
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
Online DPO: Online Direct Preference Optimization
XPO: Extended Preference Optimization
RLHF Reinforce KL: Reinforced Reinforcement Learning from Human Feedback (with KL regularization)
PPO: Proximal policy Optimization

New Features

Quantization Aware Training (QAT):

Enable QAT for SFT, DPO, and ORPO with minimal post-update quantization projection.
Supports 4-16 bit, group or per-tensor, and configurable start/interval.
Use QAT to simulate quantization effects during training for better quantized model performance.

Training Your Custom Preference Model:

You can now train a custom preference model for online preference training

📓 Example Notebooks

📦 All example notebooks live in a separate, dedicated repository:

👉 Goekdeniz-Guelmez/mlx-lm-lora-example-notebooks 👈

🔗 Direct link: https://github.com/Goekdeniz-Guelmez/mlx-lm-lora-example-notebooks

Head over to the examples repository for every notebook, YAML config, and walkthrough, including:

🧪 Fine-Tuning (Simple) — LoRA on a standard SFT dataset

🧠 Fine-Tuning (Detailed) — Full model weights for supervised fine-tuning

⚖️ ORPO Training — Monolithic preference optimization

📈 DPO Training — Direct preference optimization

👥 GRPO Training — Group-based reinforcement training

📄 YAML configuration — Example config file

…and more being added over time!

⭐ Star the examples repo to bookmark it: https://github.com/Goekdeniz-Guelmez/mlx-lm-lora-example-notebooks

Install
Quick Start
Training Methods
Other Features
- Examples Repository (moved)
- Training Your Custom Preference Model
Configuration
Dataset Formats
Memory Optimization
Evaluation & Generation
Performance Comparison

Install

pip install -U mlx-lm-lora

Quick Start

The main command is mlx_lm_lora.train. To see all options:

mlx_lm_lora.train --help

Basic training command:

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--data mlx-community/wikisql \
--iters 600

You can specify a YAML config with -c/--config:

mlx_lm_lora.train --config /path/to/config.yaml

Command-line flags will override corresponding values in the config file.

Training Methods

Quantization Aware Training (QAT)

QAT projects trainable weights onto a quantized grid after each optimizer update, simulating quantization effects during training. This improves quantized model performance and robustness.

Supported for: SFT, DPO, ORPO

QAT Flags:

--qat-enable Enable QAT projection during training
--qat-bits Bit-width for QAT (default: 8)
--qat-group-size Group size for QAT (default: 64, 0=per-tensor)
--qat-mode QAT mode (default: affine)
--qat-start-step Start QAT after this optimizer step (default: 1)
--qat-interval Apply QAT every N optimizer steps (default: 1)

Example (SFT):

mlx_lm_lora.train \
  --model <model> \
  --train \
  --train-mode sft \
  --data <data> \
  --qat-enable \
  --qat-bits 4 \
  --qat-group-size 64 \
  --qat-start-step 1 \
  --qat-interval 1

Example (DPO):

mlx_lm_lora.train \
  --model <model> \
  --train \
  --train-mode dpo \
  --data <data> \
  --qat-enable \
  --qat-bits 4

Example (ORPO):

mlx_lm_lora.train \
  --model <model> \
  --train \
  --train-mode orpo \
  --data <data> \
  --qat-enable \
  --qat-bits 8 \
  --qat-group-size 32

Supervised Fine-Tuning (SFT)

Standard instruction tuning using prompt-completion pairs.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode sft \
--sft-loss-type dft \
--data mlx-community/hermes-3 \
--batch-size 4 \
--learning-rate 1e-5 \
--iters 1000

Key Parameters:

--train-type: Choose lora (default), dora, or full
--mask-prompt: Apply loss only to assistant responses
--sft-loss-type: SFT loss function - nll (default), memory-bounded chunked_nll, or dynamic fine-tuning loss dft
--max-seq-length: Maximum sequence length (default: 2048)
--gradient-accumulation-steps: Accumulate gradients over multiple steps

Dataset Format:

{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"prompt": "Explain quantum computing", "completion": "Quantum computing uses..."}
{"text": "Complete text for language modeling"}

Direct Preference Optimization (DPO)

Train models using preference pairs without a separate reward model.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode dpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid \
--reference-model-path Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1

Key Parameters:

--beta: KL penalty strength (default: 0.1)
--dpo-cpo-loss-type: Loss function - sigmoid, hinge, ipo, or dpop
--delta: Margin for hinge loss (default: 50.0)
--reference-model-path: Reference model path (uses main model if not specified)

Dataset Format:

{"prompt": "User question", "chosen": "Good response", "rejected": "Bad response"}
{"system": "You are helpful", "prompt": "Question", "chosen": "Good", "rejected": "Bad"}

Final Token Preference Optimization (FTPO / Antidoom)

Repair reasoning doom loops using preference rows generated by Liquid AI's Antidoom pipeline:

mlx_lm_lora.train \
--model your/model \
--train \
--train-mode ftpo \
--data ./antidoom_data \
--learning-rate 1e-5 \
--lambda-mse-target 0.05 \
--tau-mse-target 1.0 \
--lambda-mse 0.4 \
--clip-epsilon-logits 2.0

Place Antidoom rows in train.jsonl (and optionally valid.jsonl and test.jsonl). Each row must contain context_with_chat_template, rejected_decoded, and multi_chosen_decoded. FTPO updates only the next-token distribution after the supplied context and uses a frozen copy of --model as the reference unless --reference-model-path is provided.

Contrastive Preference Optimization (CPO)

Variant of DPO designed for machine translation and other structured tasks.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode cpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid

Key Parameters: Same as DPO. Uses identical dataset format to DPO.

Odds Ratio Preference Optimization (ORPO)

Monolithic preference optimization without requiring a reference model.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode orpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--reward-scaling 1.0

Key Parameters:

--beta: Temperature for logistic function (default: 0.1)
--reward-scaling: Reward scaling factor (default: 1.0)

Dataset Format:

{"prompt": "Question", "chosen": "Good response", "rejected": "Bad response"}
{"prompt": "Question", "chosen": "Good", "rejected": "Bad", "preference_score": 8.0}
{"prompt": "Question", "chosen": {"messages": [...]}, "rejected": {"messages": [...]}}

Group Relative Policy Optimization (GRPO)

Generate multiple responses per prompt and learn from their relative quality.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--data mlx-community/gsm8k \
--group-size 4 \
--epsilon 1e-4 \
--max-completion-length 512 \
--temperature 0.8 \
--reward-functions "accuracy_reward,format_reward" \
--reward-weights "[0.7, 0.3]"

Key Parameters:

--group-size: Number of generations per prompt (default: 4)
--epsilon: Numerical stability constant (default: 1e-4)
--max-completion-length: Max generation length (default: 512)
--temperature: Sampling temperature (default: 0.8)
--reward-functions: Comma-separated reward function names
--reward-functions-file: Path to custom reward functions file
--reward-weights: JSON list of weights for each reward function
--grpo-loss-type: Loss variant - grpo, bnpo, or dr_grpo

Dataset Format:

{"prompt": "Math problem", "answer": "42"}
{"prompt": "Question", "answer": "Response", "system": "You are helpful"}
{"prompt": "Question", "answer": "Response", "type": "math"}

Custom Reward Functions: Create a Python file with reward functions:

# my_rewards.py
from mlx_lm_lora.reward_functions import register_reward_function

@register_reward_function()
def my_custom_reward(prompt, completion, reference_answer, **kwargs):
    """Custom reward function"""
    # Your logic here
    return score  # float between 0 and 1

Then use: --reward-functions-file ./my_rewards.py --reward-functions "my_custom_reward"

Group Sequence Policy Optimization (GSPO)

GSPO extends GRPO with importance sampling at token or sequence level for improved sample efficiency.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type grpo \
--importance-sampling-level token \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8

Key Parameters:

--importance-sampling-level: Choose token (default) or sequence
All other GRPO parameters apply

Dataset Format: Same as GRPO

Decoupled Reward Group Relative Policy Optimization (Dr. GRPO)

Dr. GRPO decouples the reward computation from the policy optimization for more stable training.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type dr_grpo \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8

Key Parameters:

--grpo-loss-type dr_grpo: Enables Dr. GRPO variant
All other GRPO parameters apply

Dataset Format: Same as GRPO

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

DAPO uses dual epsilon values for more flexible clipping in policy optimization.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--epsilon 1e-4 \
--epsilon-high 1e-2 \
--group-size 4 \
--temperature 0.8

Key Parameters:

--epsilon: Lower bound for clipping (default: 1e-4)
--epsilon-high: Upper bound for clipping (uses epsilon value if not specified)
All other GRPO parameters apply

Dataset Format: Same as GRPO

Online DPO

Online preference optimization using a judge model or human feedback.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode online_dpo \
--data ./online_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5

Key Parameters:

--judge: Judge model ID or "human" for human feedback
--alpha: Learning rate for online updates (default: 1e-5)
--judge-config: Additional configuration for judge model

Dataset Format:

{"prompt": [{"role": "user", "content": "Question"}]}
{"messages": [{"role": "user", "content": "Question"}]}

eXtended Preference Optimization (XPO)

XPO extends online DPO with additional preference learning mechanisms.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode xpo \
--data ./xpo_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5 \
--beta 0.1

Key Parameters:

--judge: Judge model ID or "human"
--alpha: Online learning rate (default: 1e-5)
--beta: KL penalty strength (default: 0.1)
--judge-config: Additional judge configuration

Dataset Format: Same as Online DPO

Reinforced Reinforcement Learning from Human Feedback with KL

Full RLHF REINFORCE pipeline with reward model and policy optimization Ziegler style.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode rlhf-reinforce \
--data Goekdeniz-Guelmez/ultrafeedback-prompt-flat \
--judge mlx-community/reward-model \
--alpha 1e-5 \
--beta 0.1

Key Parameters:

--judge: Reward model ID
--alpha: Policy learning rate (default: 1e-5)
--beta: KL penalty strength (default: 0.1)

Dataset Format: Same as Online DPO

Proximal Policy Optimization

Full PPO pipeline with reward model and policy optimization.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode ppo \
--data Goekdeniz-Guelmez/ultrafeedback-prompt-flat \
--judge mlx-community/reward-model \
--epsilon 0.2

Key Parameters:

--judge: Reward model ID
--epsilon: The Epsilon for numerical stability (default: 0.2)

Dataset Format: Same as Online DPO

Other Features

Training Your Custom Preference Model

This feature adds a second training stage on top of the judge (preference) stage. A reward model thats scores the policy’s generations and the policy is updated with a KL‑penalised PPO‑style loss.

Collect preference data → judge‑mode (online DPO) → reward model
Run RLHF (policy optimisation) using the reward model → final policy

python -m mlx_lm_lora.train_judge \
--model Goekdeniz-Guelmez/Josiefied-Qwen3-0.6B-abliterated-v1 \
--train-type full \
--optimizer adamw \
--steps-per-report 1 \
--iters 50 \
--max-seq-length 1024 \
--adapter-path /Users/Goekdeniz.Guelmez@computacenter.com/Library/CloudStorage/OneDrive-COMPUTACENTER/Desktop/test \
--data mlx-community/Human-Like-DPO \
--gradient-accumulation-steps 1

Dataset Format: Same as DPO (with prompt, chosen, and rejected pairs).

Configuration

Core Training Parameters

# Model and data
--model <model_path>              # Model path or HF repo
--data <data_path>                # Dataset path or HF dataset name
--train-type lora                 # lora, dora, or full
--train-mode sft                  # sft, dpo, cpo, orpo, grpo, etc.

# Training schedule
--batch-size 4                    # Batch size
--iters 1000                      # Training iterations
--epochs 3                        # Training epochs (ignored if iters set)
--learning-rate 1e-5              # Learning rate
--gradient-accumulation-steps 1   # Gradient accumulation

# Model architecture
--num-layers 16                   # Layers to fine-tune (-1 for all)
--max-seq-length 2048            # Maximum sequence length

# LoRA parameters
--lora-parameters '{"rank": 8, "dropout": 0.0, "scale": 10.0}'

# Optimization
--optimizer adam                  # adam, adamw, qhadam, muon
--lr-schedule cosine             # Learning rate schedule
--grad-checkpoint                # Enable gradient checkpointing

# Quantization

# Quantization Aware Training (QAT)

QAT projects trainable weights onto a quantized grid after each optimizer update, simulating quantization effects during training. This improves quantized model performance and robustness. QAT is supported for SFT, DPO, and ORPO.

**QAT Flags:**

- `--qat-enable`    Enable QAT projection during training
- `--qat-bits`     Bit-width for QAT (default: 8)
- `--qat-group-size`  Group size for QAT (default: 64, 0=per-tensor)
- `--qat-mode`     QAT mode (default: affine)
- `--qat-start-step`  Start QAT after this optimizer step (default: 1)
- `--qat-interval`   Apply QAT every N optimizer steps (default: 1)

See [QAT section above](#quantization-aware-training-qat) for usage examples.
--load-in-4bits                  # 4-bit quantization
--load-in-6bits                  # 6-bit quantization  
--load-in-8bits                  # 8-bit quantization

# Quantization Aware Training (QAT)
--qat-enable                      # Enable QAT projection during training
--qat-bits 4                      # Bit-width for QAT (default: 8)
--qat-group-size 64               # Group size for QAT (default: 64, 0=per-tensor)
--qat-mode affine                 # QAT mode (default: affine)
--qat-start-step 1                # Start QAT after this optimizer step (default: 1)
--qat-interval 1                  # Apply QAT every N optimizer steps (default: 1)

# Monitoring
--steps-per-report 10            # Steps between loss reports
--steps-per-eval 200             # Steps between validation
--val-batches 25                 # Validation batches (-1 for all)
--wandb project_name             # WandB logging

# Checkpointing
--adapter-path ./adapters        # Save/load path for adapters
--save-every 100                 # Save frequency
--resume-adapter-file <path>     # Resume from checkpoint
--fuse                           # Fuse and save trained model

Algorithm-Specific Parameters

Preference Optimization Methods:

DPO/CPO:

--beta 0.1                        # KL penalty strength
--dpo-cpo-loss-type sigmoid       # sigmoid, hinge, ipo, dpop
--delta 50.0                      # Margin for hinge loss
--reference-model-path <path>     # Reference model path

ORPO:

--beta 0.1                        # Temperature parameter
--reward-scaling 1.0              # Reward scaling factor

Group-Based Methods:

GRPO (Base):

--group-size 4                    # Generations per prompt
--epsilon 1e-4                    # Numerical stability constant
--temperature 0.8                 # Sampling temperature
--max-completion-length 512       # Max generation length
--reward-functions "func1,func2"  # Comma-separated reward functions
--reward-functions-file <path>    # Custom reward functions file
--reward-weights "[0.5, 0.5]"    # JSON list of reward weights
--grpo-loss-type grpo             # grpo, bnpo, dr_grpo

GSPO (GRPO + Importance Sampling):

--importance-sampling-level token # token (default) or sequence
# Plus all GRPO parameters

Dr. GRPO (Decoupled Rewards):

--grpo-loss-type dr_grpo         # Enable Dr. GRPO variant
# Plus all GRPO parameters

DAPO (Dynamic Clipping):

--epsilon 1e-4                   # Lower bound for clipping
--epsilon-high 1e-2              # Upper bound for clipping
# Plus all GRPO parameters

Online Methods:

Online DPO:

--judge <model_id>               # Judge model or "human"
--alpha 1e-5                     # Online learning rate
--beta 0.1                       # KL penalty strength
--judge-config '{}'              # Additional judge configuration

XPO (Extended Preference Optimization):

--judge <model_id>               # Judge model or "human"
--alpha 1e-5                     # Online learning rate
--beta 0.1                       # KL penalty strength
--judge-config '{}'              # Judge configuration
# Plus additional XPO-specific parameters

RLHF Reinforce:

--judge <reward_model_id>        # Reward model
--alpha 1e-5                     # Policy learning rate
--beta 0.1                       # KL penalty strength
--group-size 4                   # Samples for policy optimization
--judge-config '{}'              # Reward model configuration

PPO:

--judge <reward_model_id>        # Reward model
--alpha 1e-5                     # Policy learning rate
--epsilon 0.2                    # Numerical stability value
--group-size 4                   # Samples for policy optimization
--judge-config '{}'              # Reward model configuration

Dataset Formats

Local Datasets

Place JSONL files in a directory:

data/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Hugging Face Datasets

mlx_lm_lora.train --data "Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1" --train

Custom Dataset Keys

Configure custom field names:

--text-feature "content"          # For text datasets
--chat-feature "conversation"     # For chat datasets
--prompt-feature "question"       # For prompt-completion
--completion-feature "answer"     # For prompt-completion
--chosen-feature "preferred"      # For preference datasets
--rejected-feature "dispreferred" # For preference datasets
--system-feature "instruction"    # For system messages

Dataset Examples by Training Mode

SFT - Chat Format:

{"messages": [
  {"role": "system", "content": "You are helpful"},
  {"role": "user", "content": "What is 2+2?"},
  {"role": "assistant", "content": "4"}
]}

SFT - Completion Format:

{"prompt": "What is 2+2?", "completion": "2+2 equals 4"}

SFT - Text Format:

{"text": "The complete text for language modeling"}

DPO/CPO Format:

{"prompt": "Explain AI", "chosen": "AI is artificial intelligence", "rejected": "AI is magic"}

ORPO Format:

{"prompt": "What is AI?", "chosen": "Good explanation", "rejected": "Bad explanation", "preference_score": 0.8}

GRPO Format:

{"prompt": "Solve: 2+2=?", "answer": "4", "system": "You are a math tutor"}

RLHF (Online DPO, XPO, RLHF Reinforced, PPO) Format:

{"prompt": [{"role": "user", "content": "Question"}]}

or:

{"prompt": "Question"}

Memory Optimization

Quantization (QLoRA)

Use quantized models to reduce memory usage:

# 4-bit quantization (most memory efficient)
mlx_lm_lora.train --model <model> --load-in-4bits --train

# 6-bit quantization (balanced)
mlx_lm_lora.train --model <model> --load-in-6bits --train

# 8-bit quantization (higher quality)
mlx_lm_lora.train --model <model> --load-in-8bits --train

Other Memory Reduction Techniques

# Reduce batch size
--batch-size 1

# Train fewer layers
--num-layers 8

# Enable gradient checkpointing
--grad-checkpoint

# Reduce sequence length
--max-seq-length 1024

# Use gradient accumulation
--gradient-accumulation-steps 4 --batch-size 1

LoRA Configuration for Memory

# Smaller LoRA rank
--lora-parameters '{"rank": 4, "dropout": 0.1, "scale": 10.0}'

# Train specific layers only
--num-layers 8

Evaluation & Generation

Evaluation

Evaluate on test set:

mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--data <data_path> \
--test \
--test-batches 500

Generation

Use mlx-lm for generation with trained adapters:

mlx_lm.generate \
--model <model_path> \
--adapter-path <adapter_path> \
--prompt "Your prompt here" \
--max-tokens 100 \
--temperature 0.7

Fusing Adapters

Merge LoRA weights into base model:

mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--fuse

Advanced Features

Learning Rate Schedules

--lr-schedule cosine              # Cosine annealing
--lr-schedule linear              # Linear decay
--lr-schedule constant            # Constant rate

Multiple Optimizers

--optimizer adam                  # Adam optimizer
--optimizer adamw                 # AdamW with weight decay
--optimizer qhadam               # Quasi-hyperbolic Adam
--optimizer muon                 # Muon optimizer

Reward Function System (GRPO)

List available reward functions:

mlx_lm_lora.train --list-reward-functions

Use multiple reward functions:

--reward-functions "accuracy_reward,format_reward,length_reward" \
--reward-weights "[0.5, 0.3, 0.2]"

WandB Integration

--wandb my_project_name

Training Method Comparison

Method	Type	Reference Model	Judge Model	Multiple Generations	Key Benefit
SFT	Supervised	❌	❌	❌	Simple, fast training
DPO	Preference	✅	❌	❌	No reward model needed
CPO	Preference	✅	❌	❌	Better for structured tasks
ORPO	Preference	❌	❌	❌	Monolithic optimization
GRPO	Policy	❌	❌	✅	Group-based learning
GSPO	Policy	❌	❌	✅	Importance sampling
Dr. GRPO	Policy	❌	❌	✅	Decoupled rewards
DAPO	Policy	❌	❌	✅	Dynamic clipping
Online DPO	Online RL	✅	✅	✅	Real-time feedback
XPO	Online RL	✅	✅	✅	Extended preferences
RLHF Reinforce	Online RL	✅	✅	✅	Full RL pipeline
PPO	Online RL	✅	✅	✅	Full RL pipeline

Example Commands for All Methods

Basic Methods

# SFT
mlx_lm_lora.train --model <model> --train-mode sft --data <data>

# DPO
mlx_lm_lora.train --model <model> --train-mode dpo --data <data> --beta 0.1

# CPO
mlx_lm_lora.train --model <model> --train-mode cpo --data <data> --beta 0.1

# ORPO
mlx_lm_lora.train --model <model> --train-mode orpo --data <data> --beta 0.1

Group-Based Methods

# GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> --group-size 4

# GSPO (GRPO with importance sampling)
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--importance-sampling-level token --group-size 4

# Dr. GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--grpo-loss-type dr_grpo --group-size 4

# DAPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--epsilon 1e-4 --epsilon-high 1e-2 --group-size 4

Online Methods

# Online DPO
mlx_lm_lora.train --model <model> --train-mode online_dpo --data <data> \
--judge <judge_model> --alpha 1e-5

# XPO
mlx_lm_lora.train --model <model> --train-mode xpo --data <data> \
--judge <judge_model> --alpha 1e-5

# RLHF Reinforce
mlx_lm_lora.train --model <model> --train-mode rlhf-reinforce --data <data> \
--judge <reward_model> --alpha 1e-5 --group-size 4

# PPO
mlx_lm_lora.train --model <model> --train-mode ppo --data <data> \
--judge <reward_model> --epsilon 0.2 --group-size 4

Troubleshooting

Common Issues

Out of Memory: Reduce batch size, use quantization, enable gradient checkpointing
Slow Training: Increase batch size, reduce validation frequency
Poor Quality: Increase LoRA rank, train more layers, check data quality
Convergence Issues: Adjust learning rate, try different optimizers

Memory Usage Guidelines

Model Size	Recommended Settings
1-3B	`--batch-size 4 --num-layers 16`
7B	`--batch-size 2 --num-layers 8 --load-in-8bits`
13B+	`--batch-size 1 --num-layers 4 --load-in-4bits --grad-checkpoint`

Example Configurations

Basic LoRA Fine-tuning

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./my_data
train_type: lora
train_mode: sft
batch_size: 4
learning_rate: 1e-5
iters: 1000
lora_parameters:
  rank: 8
  dropout: 0.0
  scale: 10.0

DPO Training

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./preference_data
train_mode: dpo
beta: 0.1
dpo_cpo_loss_type: sigmoid
batch_size: 2
learning_rate: 5e-6
iters: 500

GRPO with Custom Rewards

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./grpo_data
train_mode: grpo
group_size: 4
temperature: 0.8
reward_functions: "accuracy_reward,format_reward"
reward_weights: [0.7, 0.3]
max_completion_length: 512

Benchmarking Your Setup

To measure performance on your hardware with MLX-LM-LoRA:

# SFT with speed/memory reporting
mlx_lm_lora.train \
  --model Goekdeniz-Guelmez/JOSIE-1.1-4B-Instruct \
  --data mlx-community/wikisql \
  --train --train-mode sft \
  --batch-size 4 --iters 100 \
  --steps-per-report 10

Monitor output for:

it/s (iterations per second)
peak_memory (in GB)
tokens/sec (throughput)

Performance Comparison

Below is a comparison of iteration speed and memory usage across different training libraries my (MLX-LM-LoRA), Unsloth, mlx-tune. Benchmarks are approximate and depend on hardware, model size, and configuration.

Test Configuration:

Hardware: M4 Pro (24GB unified memory) vs. NVIDIA A100 (80GB VRAM)
Settings: All LoRA layers trained, batch size of 1, max context length of 4096, 100 training steps
Quantization: No quantization for Qwen/Qwen3-0.6B, 4-bit quantization for Qwen/Qwen3-8B

Model Size	Training Mode	MLX-LM-LoRA	Unsloth	mlx-tune
		(Apple Silicon)	(NVIDIA GPU)	(Apple Silicon)
		Speed / Memory	Speed / Memory	Speed / Memory
Qwen/Qwen3-0.6B	SFT	~4.7 it/s ~2-2 GB	~2.7 it/s ~1-2 GB VRAM	~0.6 it/s ~4-6 GB
Qwen/Qwen3-0.6B	ORPO	~4.5 it/s ~2-4 GB	~2.4 it/s ~2-8 GB VRAM	OOM
Qwen/Qwen3-0.6B	GRPO	~0.02 it/s ~9-20 GB	~0.04 it/s ~76-80 GB VRAM	OOM
Qwen/Qwen3-8B	SFT	~4.1 it/s ~6-10 GB	~1.3 it/s ~10-16 GB VRAM	~0.07 it/s ~8-18 GB

Key Differences

MLX-LM-LoRA (Apple Silicon - Native MLX)

✅ Comprehensive: 12 training algorithms (SFT, DPO, CPO, ORPO, GRPO, GSPO, Dr. GRPO, DAPO, Online DPO, XPO, RLHF, PPO)
✅ Custom Preference Models: Built-in judge training for online preference workflows
✅ Unified Memory: Access to full system RAM (up to 512GB on Ultra)
✅ Moderate Speed: Optimized MLX implementation with native Apple Silicon support
✅ CLI-First: Simple command-line, and notebook interface with YAML config support
⚠️ Apple Only: Requires Apple Silicon (M1/M2/M3/M4)

Unsloth (NVIDIA GPU - CUDA/Triton)

✅ Fastest: Highly optimized Triton kernels for NVIDIA GPUs
✅ Production Ready: Battle-tested, widely used in industry
✅ Memory Efficient: Custom CUDA kernels minimize VRAM usage
✅ Rich Ecosystem: Seamless integration with Hugging Face, TRL, PEFT
⚠️ NVIDIA Only: Requires CUDA-compatible GPU (doesn't work on Apple Silicon)
⚠️ VRAM Limited: Constrained by GPU VRAM (24-80GB typical)

mlx-tune (Apple Silicon - MLX with Unsloth API)

✅ API Compatible: Drop-in replacement for Unsloth code on Apple Silicon
✅ Unified Memory: Same memory advantages as MLX-LM-LoRA
✅ Portability Focus: Write once on Mac, deploy on CUDA
✅ Vision Models: VLM fine-tuning support (Qwen3.5, etc.)
⚠️ Limited Methods: Fewer training algorithms than MLX-LM-LoRA
⚠️ Wrapper Library: Built on top of MLX, adds abstraction layer
⚠️ Moderate Speed: Similar to MLX-LM-LoRA (both use MLX backend)

MLX-LM-LoRA is trusted by teams and industry leaders such as:

MLX-LM-LoRA is also beeing used by researchers, engineers, and other profesionals by Apple, IBM, Bosch, Red Hat, Daimler Truck, and Mercedes-Benz Group.

Is you or your team using MLX-LM-LoRA? I'd love to hear from you! Feel free to reach out and I'll add your logo here too. 🚀

Alt

Star History

Citing MLX-LM-LoRA

@software{gülmez2025mlxlmlora,
  author = {Gökdeniz Gülmez},
  title = {{MLX-LM-LoRA}: Train LLMs on Apple silicon with MLX and the Hugging Face Hub},
  url = {https://github.com/Goekdeniz-Guelmez/mlx-lm-lora},
  version = {0.1.0},
  year = {2025},
}

Project details

Release history Release notifications | RSS feed

This version

3.0.0

Jul 14, 2026

2.1.0

Apr 22, 2026

1.1.10

Mar 11, 2026

1.1.9

Mar 9, 2026

1.1.8

Mar 6, 2026

1.0.8

Mar 5, 2026

1.0.6

Mar 4, 2026

1.0.5

Mar 1, 2026

1.0.4

Feb 11, 2026

1.0.3

Feb 11, 2026

1.0.2

Feb 10, 2026

1.0.1

Feb 9, 2026

1.0.0

Dec 9, 2025

0.9.10

Dec 9, 2025

0.9.9

Dec 4, 2025

0.9.8

Nov 29, 2025

0.9.7

Nov 21, 2025

0.8.5

Nov 19, 2025

0.8.4

Nov 5, 2025

0.8.1

Sep 28, 2025

0.7.0

Jun 23, 2025

0.6.92

Jun 18, 2025

0.6.91

Jun 18, 2025

0.6.9

Jun 17, 2025

0.6.8

Jun 17, 2025

0.4.7

Jun 4, 2025

0.4.6

May 29, 2025

0.4.5

May 28, 2025

0.3.5

May 19, 2025

0.3.3

May 18, 2025

0.3.2

May 15, 2025

0.2.2

May 15, 2025

0.2.0

May 13, 2025

0.1.9

May 13, 2025

0.1.8

May 13, 2025

0.1.7

May 13, 2025

0.1.6

May 11, 2025

0.1.5

May 11, 2025

0.1.4

May 11, 2025

0.1.3

May 10, 2025

0.1.1

May 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_lm_lora-3.0.0.tar.gz (2.4 MB view details)

Uploaded Jul 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_lm_lora-3.0.0-py3-none-any.whl (97.3 kB view details)

Uploaded Jul 14, 2026 Python 3

File details

Details for the file mlx_lm_lora-3.0.0.tar.gz.

File metadata

Download URL: mlx_lm_lora-3.0.0.tar.gz
Upload date: Jul 14, 2026
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_lm_lora-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ab2476181ac02f247e5fcc48ecf8f0db610f8e95660c61a46d3b41590268d41d`
MD5	`671326cd80ad58f645c42dba69ec7703`
BLAKE2b-256	`b6e59d4c2eac827c4e15acb94f066403b0354a1e3f7a84617952462d819ba239`

See more details on using hashes here.

File details

Details for the file mlx_lm_lora-3.0.0-py3-none-any.whl.

File metadata

Download URL: mlx_lm_lora-3.0.0-py3-none-any.whl
Upload date: Jul 14, 2026
Size: 97.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_lm_lora-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06c4a6a72e71322e21f9be94128090f0bb4b7a924f09938d705c36fdabdcd184`
MD5	`4f4fcb30358f591bb19c2c59432e217f`
BLAKE2b-256	`7c45e114bba1e3126919a65913a976535f65504c5d6cb58243b8dfb8c3abc955`

See more details on using hashes here.

mlx-lm-lora 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MLX-LM-LORA

Supported Training Methods

New Features

📓 Example Notebooks

📦 All example notebooks live in a separate, dedicated repository:

👉 Goekdeniz-Guelmez/mlx-lm-lora-example-notebooks 👈

🔗 Direct link: https://github.com/Goekdeniz-Guelmez/mlx-lm-lora-example-notebooks

Contents

Install

Quick Start

Training Methods

Quantization Aware Training (QAT)

Supervised Fine-Tuning (SFT)

Direct Preference Optimization (DPO)

Final Token Preference Optimization (FTPO / Antidoom)

Contrastive Preference Optimization (CPO)

Odds Ratio Preference Optimization (ORPO)

Group Relative Policy Optimization (GRPO)

Group Sequence Policy Optimization (GSPO)

Decoupled Reward Group Relative Policy Optimization (Dr. GRPO)

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Online DPO

eXtended Preference Optimization (XPO)

Reinforced Reinforcement Learning from Human Feedback with KL

Proximal Policy Optimization

Other Features

Training Your Custom Preference Model

Configuration

Core Training Parameters

Algorithm-Specific Parameters

Dataset Formats

Local Datasets

Hugging Face Datasets

Custom Dataset Keys

Dataset Examples by Training Mode

Memory Optimization

Quantization (QLoRA)

Other Memory Reduction Techniques

LoRA Configuration for Memory

Evaluation & Generation

Evaluation

Generation

Fusing Adapters

Advanced Features

Learning Rate Schedules

Multiple Optimizers

Reward Function System (GRPO)

WandB Integration

Training Method Comparison

Example Commands for All Methods

Basic Methods

Group-Based Methods

Online Methods

Troubleshooting

Common Issues

Memory Usage Guidelines

Example Configurations

Basic LoRA Fine-tuning

DPO Training

GRPO with Custom Rewards

Benchmarking Your Setup

Performance Comparison

Key Differences

MLX-LM-LoRA is trusted by teams and industry leaders such as:

Star History

Citing MLX-LM-LoRA

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

👉 `Goekdeniz-Guelmez/mlx-lm-lora-example-notebooks` 👈