A high-performance framework for fine-tuning large language models with multi-GPU support
Project description
Hyper-Sloth
A high-performance framework for fine-tuning large language models.
Overview
Built on top of Unsloth - HyperSloth extends Unsloth's excellent foundation with multi-GPU support and optimized batching strategies.
What HyperSloth adds:
- Multi-GPU training via NCCL: Scale your Unsloth workflows across multiple GPUs
- Adaptive batching optimizations: Sequence sorting, round-robin load balancing, and minimal padding strategies to reduce computational waste and improve GPU utilization
Inherited from Unsloth:
- 2x faster than standard transformers training: Built on Unsloth's optimized kernels and memory management
- Up to 75% VRAM savings: Inherits Unsloth's memory efficiency optimizations
- Quality preserved: Same training quality as standard approaches with significantly better performance
The multiplier effect: Since we build on Unsloth's foundation, you get Unsloth's 2x speed + 75% memory savings, then multiply that performance across the number of GPUs you have - often achieving speedups well beyond the theoretical maximum through our batching optimizations.
⚡ Performance Benchmarks
HyperSloth vs Unsloth Direct Comparison
We conducted a controlled comparison using identical configurations:
- Model: Qwen3-8B-bnb-4bit
- Training Steps: 100 steps
- Global Batch Size: 32
- Dataset: Fixed data sampler ensures identical training data
Results:
- HyperSloth (2 GPUs): 8m 28s ⚡
- Unsloth (1 GPU): 19m 34s
- Performance Gain: ~2.3x faster
Why 2.3x Speedup on 2 GPUs?
Theoretical maximum speedup with 2 GPUs would be 2x, but communication overhead typically reduces this to ~1.7x in practice. HyperSloth achieves 2.3x speedup through several optimizations:
🔄 Standard Multi-GPU: ~1.7x speedup
├─ GPU communication overhead
└─ Load balancing inefficiencies
⚡ HyperSloth: 2.3x speedup
├─ ✅ Sequence length sorting: reduces padding waste
├─ ✅ Adaptive batching: improves memory efficiency
├─ ✅ Round-robin load balancing: better GPU utilization
└─ ✅ NCCL gradient optimization: reduced communication overhead
This demonstrates how algorithmic optimizations can exceed theoretical hardware limits by reducing computational waste.
Key Performance Features
- Sequence length sorting: Groups similar-length sequences to minimize padding waste (up to 40% token savings)
- GPU load balancing: Distributes work evenly across all available GPUs using round-robin batch assignment
- NCCL optimization: Uses PyTorch's native distributed training with efficient all-reduce gradient synchronization
- Memory efficiency: Adaptive batching reduces VRAM usage compared to naive padding approaches
Additional Benchmarks
For detailed training time comparisons across different hardware configurations and loss curve analysis, see our 📊 Auxiliary Speed Benchmarks.
💾 Installation
pip install git+https://github.com/anhvth/HyperSloth.git
⚡ Quickstart
Get up and running with HyperSloth in 3 simple steps:
Step 1: Build Your Dataset
First, prepare your training data using any Hugging Face dataset:
hypersloth-build-dataset --hf_dataset mlabonne/FineTome-100k -n 1000 --split train --name finetom-1k --tokenizer Qwen/Qwen3-8B --print_samples
What this does:
- Downloads 1000 samples from
mlabonne/FineTome-100k - Tokenizes using
Qwen/Qwen3-8Btokenizer - Saves as
finetom-1kdataset - Shows sample conversations with
--print_samples
Expected output:
Loading 1000 samples from mlabonne/FineTome-100k...
================================================================================
SAMPLE TEXTS FROM PROCESSED DATASET:
================================================================================
--- Sample 1 ---
<|im_start|>user
[Sample conversation]
<|im_end|>
<|im_start|>assistant
[Sample response]
<|im_end|>
Dataset saved to: data/built_dataset/finetom-1k
Registry updated: data/data_config.json
Dataset "finetom-1k" has been successfully built and saved!
Step 2: Initialize Training Configuration
Generate a configuration template:
hypersloth-init
This creates example_training_config.py with default settings. Edit it to use your dataset:
# Update the data section to use your built dataset
hyper_config_model = HyperConfig(
data=DataConfig.from_dataset_name("finetom-1k"), # Your dataset name
training=TrainingConfig(
gpus=[0, 1], # Adjust to your available GPUs
loss_type="response_only", # Calculate loss only on assistant responses
),
fast_model_args=FastModelArgs(
model_name="unsloth/Qwen3-0.6b-bnb-4bit", # Smaller model for quick testing
max_seq_length=2048,
),
lora_args=LoraArgs(
r=16,
lora_alpha=16,
),
)
Step 3: Start Multi-GPU Training
Launch training across your GPUs:
hypersloth-train ./example_training_config.py
Expected output:
21:32:54 | INFO | 🔧 GPU 0 (Rank 0/1) | Model: unsloth/Qwen3-0.6b-bnb-4bit
21:32:54 | INFO | 🔧 GPU 1 (Rank 1/1) | Model: unsloth/Qwen3-0.6b-bnb-4bit
21:32:54 | INFO | 🚀 Starting total training timer
[Training progress with adaptive batching and NCCL synchronization]
Optional: Monitor with tmux
hypersloth-train ./example_training_config.py --tmux train
# Then attach to sessions: tmux a -t train_gpu_0
Quick Tips
For faster iteration:
- Start with smaller models:
unsloth/Qwen3-0.6b-bnb-4bit - Use fewer samples:
-n 1000for quick testing - Test single GPU first:
gpus=[0]in config
For production:
- Scale up dataset size:
-n 50000or more - Use larger models:
unsloth/Qwen3-8B-bnb-4bit - Add more GPUs:
gpus=[0, 1, 2, 3]
Memory management:
- Reduce
per_device_train_batch_sizeif you hit OOM - Increase
gradient_accumulation_stepsto maintain effective batch size
That's it! You now have HyperSloth running multi-GPU training with optimized batching. Check the logs for padding savings and performance metrics.
🛠 Command-Line Tools
hypersloth-train: Main training launcher with multi-GPU and tmux supporthypersloth-init: Generate configuration templates for new projects
📓 Demo Notebook
For interactive training and experimentation, check out our demo training notebooks:
notebooks/train.ipynb: Complete training example equivalent tohypersloth-train examples/example_sharegpt_lora_2gpus.py- Kaggle: Qwen3 Unsloth 2GPUs: Live training example with HyperSloth on Kaggle's GPU environment
📊 How to Prepare Data
To prepare your dataset for training, use the build_dataset.py script:
python scripts/build_dataset.py mlabonne/FineTome-100k -n 50000 --seed 3407 --split train --name finetom --tokenizer Qwen/Qwen3-8B
After running the script, use the built dataset in your configuration:
hyper_config_model = HyperConfig(
data=DataConfig.from_dataset_name("finetom") # Use the dataset name you created
training=TrainingConfig(
gpus=[0, 1], # Change this to the number of GPUs you have
loss_type="response_only", # all or response_only, the loss will only be calculated on the response part of the input
),
fast_model_args=FastModelArgs(
model_name="unsloth/gemma-3-1b-it",
max_seq_length=2048,
),
lora_args=LoraArgs(
r=16,
lora_alpha=16,
),
)
🏗 How It Works
Adaptive Batch Partitioning
HyperSloth patches the trainer's inner training loop with adaptive_partition_batches() that:
- Sorts sequences by length: Groups similar-length sequences together within each batch slice
- Round-robin GPU distribution: Distributes batch slices across GPUs in round-robin fashion for load balancing
- Minimizes padding: Reduces wasted computation from padding tokens by up to 40%
- Tracks efficiency: Logs padding savings and token statistics in real-time during training
Distributed Training with NCCL
For multi-GPU setups, HyperSloth uses:
- Standard PyTorch DDP: Each GPU runs a separate process with
torch.distributed - NCCL gradient synchronization: Automatic all-reduce operations for gradient averaging
- Process spawning:
hypersloth-trainlaunches one process per GPU usingspawn_training_process() - Tmux integration: Optional
--tmuxflag creates separate terminal sessions for monitoring each GPU
🔧 Troubleshooting
Common Issues:
-
Process Spawning Errors:
nvidia-smi # Check GPU availability python -c "import torch; print(torch.cuda.is_available())" # Verify CUDA
-
Memory Issues:
- Reduce
per_device_train_batch_sizein your config - Increase
gradient_accumulation_stepsto maintain effective batch size
- Reduce
-
Performance Optimization:
- Monitor tmux sessions to check individual GPU utilization
- Experiment with batch sizes to find optimal memory/speed trade-off
Debugging Tips:
# Test single GPU first (modify your config to use gpus=[0])
hypersloth-train configs/your_config.py
# Monitor individual GPU processes
hypersloth-train configs/your_config.py --tmux train
# Then attach to sessions: tmux a -t train_gpu_0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hypersloth-0.1.1.tar.gz.
File metadata
- Download URL: hypersloth-0.1.1.tar.gz
- Upload date:
- Size: 37.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
733421c6706ead34684e182e8ffaa00c3933298603348c3f2d2d7e0a44e9a13b
|
|
| MD5 |
4a7c36725050e5d6f15423ca68302a17
|
|
| BLAKE2b-256 |
64cbcfbe7f0493805378165ef1c0d746274245d6704eda6f43066bc0f88a96fa
|
File details
Details for the file hypersloth-0.1.1-py3-none-any.whl.
File metadata
- Download URL: hypersloth-0.1.1-py3-none-any.whl
- Upload date:
- Size: 46.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56d46cbb8cb4ca550579a205b3b3e8202128c6cc14e9442d1263e04d72fa4a5c
|
|
| MD5 |
597d457a6d9eb78d17d70a55f7c175de
|
|
| BLAKE2b-256 |
b35a18d49ad3a914b1cde4ed49e9dcbfae22901a98cf32ad41eb7077be36ddae
|