✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.
🦥 Unsloth.ai News
🔗 Links and Resources
⭐ Key Features
- All kernels written in OpenAI's Triton language. Manual backprop engine.
- 0% loss in accuracy - no approximation methods - all exact.
- No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
- Works on Linux and Windows via WSL.
- Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
- Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
- If you trained a model with 🦥Unsloth, you can use this cool sticker!
🥇 Performance Benchmarking
1 A100 40GB |
🤗Hugging Face |
Flash Attention |
🦥Unsloth Open Source |
🦥Unsloth Pro |
Alpaca |
1x |
1.04x |
1.98x |
15.64x |
LAION Chip2 |
1x |
0.92x |
1.61x |
20.73x |
OASST |
1x |
1.19x |
2.17x |
14.83x |
Slim Orca |
1x |
1.18x |
2.22x |
14.82x |
Free Colab T4 |
Dataset |
🤗Hugging Face |
Pytorch 2.1.1 |
🦥Unsloth |
🦥 VRAM reduction |
Llama-2 7b |
OASST |
1x |
1.19x |
1.95x |
-43.3% |
Mistral 7b |
Alpaca |
1x |
1.07x |
1.56x |
-13.7% |
Tiny Llama 1.1b |
Alpaca |
1x |
2.06x |
3.87x |
-73.8% |
DPO with Zephyr |
Ultra Chat |
1x |
1.09x |
1.55x |
-18.6% |
💾 Installation Instructions
Conda Installation
Select either pytorch-cuda=11.8
for CUDA 11.8 or pytorch-cuda=12.1
for CUDA 12.1. If you have mamba
, use mamba
instead of conda
for faster solving. See this Github issue for help on debugging Conda installs.
conda create --name unsloth_env \
python=3.10 \
pytorch-cuda=<11.8/12.1> \
pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
-y
conda activate unsloth_env
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
Pip Installation
Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.
- Find your CUDA version via
import torch; torch.version.cuda
- For Pytorch 2.1.0: You can update Pytorch via Pip (interchange
cu121
/ cu118
). Go to https://pytorch.org/ to learn more. Select either cu118
for CUDA 11.8 or cu121
for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere"
path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"
- For Pytorch 2.1.1: Use the
"ampere"
path for newer RTX 30xx GPUs or higher.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
- For Pytorch 2.2.0: Use the
"ampere"
path for newer RTX 30xx GPUs or higher.
pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
- If you get errors, try the below first, then go back to step 1:
pip install --upgrade pip
- For Pytorch 2.2.1:
# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
# Pre Ampere RTX 2080, T4, GTX 1080 GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
- For Pytorch 2.3.0: Use the
"ampere"
path for newer RTX 30xx GPUs or higher.
pip install "unsloth[cu118-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
- To troubleshoot installs try the below (all must succeed). Xformers should mostly all be available.
nvcc
python -m xformers.info
python -m bitsandbytes
📜 Documentation
- Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
- We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
- We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates
DPO Support
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()
🥇 Detailed Benchmarking Tables
- Click "Code" for fully reproducible examples
- "Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
- For the full list of benchmarking tables, go to our website
1 A100 40GB |
🤗Hugging Face |
Flash Attention 2 |
🦥Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
1.04x |
1.98x |
2.48x |
5.32x |
15.64x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
1040 |
1001 |
525 |
419 |
196 |
67 |
memory MB |
18235 |
15365 |
9631 |
8525 |
|
|
% saved |
|
15.74 |
47.18 |
53.25 |
|
|
Llama-Factory 3rd party benchmarking
- Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.
Method |
Bits |
TGS |
GRAM |
Speed |
HF |
16 |
2392 |
18GB |
100% |
HF+FA2 |
16 |
2954 |
17GB |
123% |
Unsloth+FA2 |
16 |
4007 |
16GB |
168% |
HF |
4 |
2415 |
9GB |
101% |
Unsloth+FA2 |
4 |
3726 |
7GB |
160% |
Performance comparisons between popular models
Click for specific model benchmarking tables (Mistral 7b, CodeLlama 34b etc.)
Mistral 7b
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Mistral 7B Slim Orca |
1x |
1.15x |
2.15x |
2.53x |
4.61x |
13.69x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
1813 |
1571 |
842 |
718 |
393 |
132 |
memory MB |
32853 |
19385 |
12465 |
10271 |
|
|
% saved |
|
40.99 |
62.06 |
68.74 |
|
|
CodeLlama 34b
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Code Llama 34B |
OOM ❌ |
0.99x |
1.87x |
2.61x |
4.27x |
12.82x |
code |
▶️ Code |
Code |
Code |
Code |
|
|
seconds |
1953 |
1982 |
1043 |
748 |
458 |
152 |
memory MB |
40000 |
33217 |
27413 |
22161 |
|
|
% saved |
|
16.96 |
31.47 |
44.60 |
|
|
1 Tesla T4
1 T4 16GB |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Pro Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
1.09x |
1.69x |
1.79x |
2.93x |
8.3x |
code |
▶️ Code |
Code |
Code |
Code |
|
|
seconds |
1599 |
1468 |
942 |
894 |
545 |
193 |
memory MB |
7199 |
7059 |
6459 |
5443 |
|
|
% saved |
|
1.94 |
10.28 |
24.39 |
|
|
2 Tesla T4s via DDP
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
0.99x |
4.95x |
4.44x |
7.28x |
20.61x |
code |
▶️ Code |
Code |
Code |
|
|
|
seconds |
9882 |
9946 |
1996 |
2227 |
1357 |
480 |
memory MB |
9176 |
9128 |
6904 |
6782 |
|
|
% saved |
|
0.52 |
24.76 |
26.09 |
|
|
Performance comparisons on 1 Tesla T4 GPU:
Click for Time taken for 1 epoch
One Tesla T4 on Google Colab
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System |
GPU |
Alpaca (52K) |
LAION OIG (210K) |
Open Assistant (10K) |
SlimOrca (518K) |
Huggingface |
1 T4 |
23h 15m |
56h 28m |
8h 38m |
391h 41m |
Unsloth Open |
1 T4 |
13h 7m (1.8x) |
31h 47m (1.8x) |
4h 27m (1.9x) |
240h 4m (1.6x) |
Unsloth Pro |
1 T4 |
3h 6m (7.5x) |
5h 17m (10.7x) |
1h 7m (7.7x) |
59h 53m (6.5x) |
Unsloth Max |
1 T4 |
2h 39m (8.8x) |
4h 31m (12.5x) |
0h 58m (8.9x) |
51h 30m (7.6x) |
Peak Memory Usage
System |
GPU |
Alpaca (52K) |
LAION OIG (210K) |
Open Assistant (10K) |
SlimOrca (518K) |
Huggingface |
1 T4 |
7.3GB |
5.9GB |
14.0GB |
13.3GB |
Unsloth Open |
1 T4 |
6.8GB |
5.7GB |
7.8GB |
7.7GB |
Unsloth Pro |
1 T4 |
6.4GB |
6.4GB |
6.4GB |
6.4GB |
Unsloth Max |
1 T4 |
11.4GB |
12.4GB |
11.9GB |
14.4GB |
Click for Performance Comparisons on 2 Tesla T4 GPUs via DDP:
**Time taken for 1 epoch**
Two Tesla T4s on Kaggle
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System |
GPU |
Alpaca (52K) |
LAION OIG (210K) |
Open Assistant (10K) |
SlimOrca (518K) * |
Huggingface |
2 T4 |
84h 47m |
163h 48m |
30h 51m |
1301h 24m * |
Unsloth Pro |
2 T4 |
3h 20m (25.4x) |
5h 43m (28.7x) |
1h 12m (25.7x) |
71h 40m (18.1x) * |
Unsloth Max |
2 T4 |
3h 4m (27.6x) |
5h 14m (31.3x) |
1h 6m (28.1x) |
54h 20m (23.9x) * |
Peak Memory Usage on a Multi GPU System (2 GPUs)
System |
GPU |
Alpaca (52K) |
LAION OIG (210K) |
Open Assistant (10K) |
SlimOrca (518K) * |
Huggingface |
2 T4 |
8.4GB | 6GB |
7.2GB | 5.3GB |
14.3GB | 6.6GB |
10.9GB | 5.9GB * |
Unsloth Pro |
2 T4 |
7.7GB | 4.9GB |
7.5GB | 4.9GB |
8.5GB | 4.9GB |
6.2GB | 4.7GB * |
Unsloth Max |
2 T4 |
10.5GB | 5GB |
10.6GB | 5GB |
10.6GB | 5GB |
10.5GB | 5GB * |
- Slim Orca
bsz=1
for all benchmarks since bsz=2
OOMs. We can handle bsz=2
, but we benchmark it with bsz=1
for consistency.
Thank You to