A high-performance framework for fine-tuning large language models with multi-GPU support

These details have not been verified by PyPI

Project links

Project description

HyperSloth Logo

Hyper-Sloth

A high-performance framework for fine-tuning large language models.

Overview

Built on top of Unsloth - HyperSloth extends Unsloth's excellent foundation with multi-GPU support and optimized batching strategies.

What HyperSloth adds:

Multi-GPU training via NCCL: Scale your Unsloth workflows across multiple GPUs
Adaptive batching optimizations: Sequence sorting, round-robin load balancing, and minimal padding strategies to reduce computational waste and improve GPU utilization

Inherited from Unsloth:

2x faster than standard transformers training: Built on Unsloth's optimized kernels and memory management
Up to 75% VRAM savings: Inherits Unsloth's memory efficiency optimizations
Quality preserved: Same training quality as standard approaches with significantly better performance

The multiplier effect: Since we build on Unsloth's foundation, you get Unsloth's 2x speed + 75% memory savings, then multiply that performance across the number of GPUs you have - often achieving speedups well beyond the theoretical maximum through our batching optimizations.

⚡ Performance Benchmarks

📊 View Full WandB Comparison

HyperSloth vs Unsloth Direct Comparison

We conducted a controlled comparison using identical configurations:

Model: Qwen3-8B-bnb-4bit
Training Steps: 100 steps
Global Batch Size: 32
Dataset: Fixed data sampler ensures identical training data

Results:

HyperSloth (2 GPUs): 8m 28s ⚡
Unsloth (1 GPU): 19m 34s
Performance Gain: ~2.3x faster

Why 2.3x Speedup on 2 GPUs?

Theoretical maximum speedup with 2 GPUs would be 2x, but communication overhead typically reduces this to ~1.7x in practice. HyperSloth achieves 2.3x speedup through several optimizations:

🔄 Standard Multi-GPU: ~1.7x speedup
    ├─ GPU communication overhead
    └─ Load balancing inefficiencies

⚡ HyperSloth: 2.3x speedup  
    ├─ ✅ Sequence length sorting: reduces padding waste
    ├─ ✅ Adaptive batching: improves memory efficiency  
    ├─ ✅ Round-robin load balancing: better GPU utilization
    └─ ✅ NCCL gradient optimization: reduced communication overhead

This demonstrates how algorithmic optimizations can exceed theoretical hardware limits by reducing computational waste.

Key Performance Features

Sequence length sorting: Groups similar-length sequences to minimize padding waste (up to 40% token savings)
GPU load balancing: Distributes work evenly across all available GPUs using round-robin batch assignment
NCCL optimization: Uses PyTorch's native distributed training with efficient all-reduce gradient synchronization
Memory efficiency: Adaptive batching reduces VRAM usage compared to naive padding approaches

Additional Benchmarks

For detailed training time comparisons across different hardware configurations and loss curve analysis, see our 📊 Auxiliary Speed Benchmarks.

💾 Installation

pip install git+https://github.com/anhvth/HyperSloth.git

⚡ Quickstart

Get up and running with HyperSloth in 3 simple steps:

Step 1: Build Your Dataset

First, prepare your training data using any Hugging Face dataset:

hypersloth-build-dataset --hf_dataset mlabonne/FineTome-100k -n 1000 --split train --name finetom-1k --tokenizer Qwen/Qwen3-8B --print_samples

What this does:

Downloads 1000 samples from mlabonne/FineTome-100k
Tokenizes using Qwen/Qwen3-8B tokenizer
Saves as finetom-1k dataset
Shows sample conversations with --print_samples

Expected output:

Loading 1000 samples from mlabonne/FineTome-100k...

================================================================================
SAMPLE TEXTS FROM PROCESSED DATASET:
================================================================================

--- Sample 1 ---
<|im_start|>user
[Sample conversation]
<|im_end|>
<|im_start|>assistant
[Sample response]
<|im_end|>

Dataset saved to: data/built_dataset/finetom-1k
Registry updated: data/data_config.json
Dataset "finetom-1k" has been successfully built and saved!

Step 2: Initialize Training Configuration

Generate a configuration template:

hypersloth-init

This creates example_training_config.py with default settings. Edit it to use your dataset:

# Update the data section to use your built dataset
hyper_config_model = HyperConfig(
    data=DataConfig.from_dataset_name("finetom-1k"),  # Your dataset name
    training=TrainingConfig(
        gpus=[0, 1],  # Adjust to your available GPUs
        loss_type="response_only",  # Calculate loss only on assistant responses
    ),
    fast_model_args=FastModelArgs(
        model_name="unsloth/Qwen3-0.6b-bnb-4bit",  # Smaller model for quick testing
        max_seq_length=2048,
    ),
    lora_args=LoraArgs(
        r=16,
        lora_alpha=16,
    ),
)

Step 3: Start Multi-GPU Training

Launch training across your GPUs:

hypersloth-train ./example_training_config.py

Expected output:

21:32:54 | INFO | 🔧 GPU 0 (Rank 0/1) | Model: unsloth/Qwen3-0.6b-bnb-4bit
21:32:54 | INFO | 🔧 GPU 1 (Rank 1/1) | Model: unsloth/Qwen3-0.6b-bnb-4bit
21:32:54 | INFO | 🚀 Starting total training timer
[Training progress with adaptive batching and NCCL synchronization]

Optional: Monitor with tmux

hypersloth-train ./example_training_config.py --tmux train
# Then attach to sessions: tmux a -t train_gpu_0

Quick Tips

For faster iteration:

Start with smaller models: unsloth/Qwen3-0.6b-bnb-4bit
Use fewer samples: -n 1000 for quick testing
Test single GPU first: gpus=[0] in config

For production:

Scale up dataset size: -n 50000 or more
Use larger models: unsloth/Qwen3-8B-bnb-4bit
Add more GPUs: gpus=[0, 1, 2, 3]

Memory management:

Reduce per_device_train_batch_size if you hit OOM
Increase gradient_accumulation_steps to maintain effective batch size

That's it! You now have HyperSloth running multi-GPU training with optimized batching. Check the logs for padding savings and performance metrics.

🛠 Command-Line Tools

hypersloth-train: Main training launcher with multi-GPU and tmux support
hypersloth-init: Generate configuration templates for new projects

📓 Demo Notebook

For interactive training and experimentation, check out our demo training notebooks:

notebooks/train.ipynb: Complete training example equivalent to hypersloth-train examples/example_sharegpt_lora_2gpus.py
Kaggle: Qwen3 Unsloth 2GPUs: Live training example with HyperSloth on Kaggle's GPU environment

📊 How to Prepare Data

To prepare your dataset for training, use the build_dataset.py script:

python scripts/build_dataset.py mlabonne/FineTome-100k -n 50000 --seed 3407 --split train --name finetom --tokenizer Qwen/Qwen3-8B

After running the script, use the built dataset in your configuration:

hyper_config_model = HyperConfig(
    data=DataConfig.from_dataset_name("finetom") # Use the dataset name you created
    training=TrainingConfig(
        gpus=[0, 1],  # Change this to the number of GPUs you have
        loss_type="response_only",  # all or response_only, the loss will only be calculated on the response part of the input
    ),
    fast_model_args=FastModelArgs(
        model_name="unsloth/gemma-3-1b-it",
        max_seq_length=2048,
    ),
    lora_args=LoraArgs(
        r=16,
        lora_alpha=16,
    ),
)

🏗 How It Works

Adaptive Batch Partitioning

HyperSloth patches the trainer's inner training loop with adaptive_partition_batches() that:

Sorts sequences by length: Groups similar-length sequences together within each batch slice
Round-robin GPU distribution: Distributes batch slices across GPUs in round-robin fashion for load balancing
Minimizes padding: Reduces wasted computation from padding tokens by up to 40%
Tracks efficiency: Logs padding savings and token statistics in real-time during training

Distributed Training with NCCL

For multi-GPU setups, HyperSloth uses:

Standard PyTorch DDP: Each GPU runs a separate process with torch.distributed
NCCL gradient synchronization: Automatic all-reduce operations for gradient averaging
Process spawning: hypersloth-train launches one process per GPU using spawn_training_process()
Tmux integration: Optional --tmux flag creates separate terminal sessions for monitoring each GPU

🔧 Troubleshooting

Common Issues:

Process Spawning Errors:

nvidia-smi  # Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"  # Verify CUDA

Memory Issues:
- Reduce per_device_train_batch_size in your config
- Increase gradient_accumulation_steps to maintain effective batch size
Performance Optimization:
- Monitor tmux sessions to check individual GPU utilization
- Experiment with batch sizes to find optimal memory/speed trade-off

Debugging Tips:

# Test single GPU first (modify your config to use gpus=[0])
hypersloth-train configs/your_config.py

# Monitor individual GPU processes  
hypersloth-train configs/your_config.py --tmux train
# Then attach to sessions: tmux a -t train_gpu_0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypersloth-0.1.1.tar.gz (37.2 kB view details)

Uploaded Jun 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hypersloth-0.1.1-py3-none-any.whl (46.4 kB view details)

Uploaded Jun 3, 2025 Python 3

File details

Details for the file hypersloth-0.1.1.tar.gz.

File metadata

Download URL: hypersloth-0.1.1.tar.gz
Upload date: Jun 3, 2025
Size: 37.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for hypersloth-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`733421c6706ead34684e182e8ffaa00c3933298603348c3f2d2d7e0a44e9a13b`
MD5	`4a7c36725050e5d6f15423ca68302a17`
BLAKE2b-256	`64cbcfbe7f0493805378165ef1c0d746274245d6704eda6f43066bc0f88a96fa`

See more details on using hashes here.

File details

Details for the file hypersloth-0.1.1-py3-none-any.whl.

File metadata

Download URL: hypersloth-0.1.1-py3-none-any.whl
Upload date: Jun 3, 2025
Size: 46.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for hypersloth-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56d46cbb8cb4ca550579a205b3b3e8202128c6cc14e9442d1263e04d72fa4a5c`
MD5	`597d457a6d9eb78d17d70a55f7c175de`
BLAKE2b-256	`b35a18d49ad3a914b1cde4ed49e9dcbfae22901a98cf32ad41eb7077be36ddae`

See more details on using hashes here.

hypersloth 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Hyper-Sloth

Overview

⚡ Performance Benchmarks

HyperSloth vs Unsloth Direct Comparison

Key Performance Features

Additional Benchmarks

💾 Installation

⚡ Quickstart

Step 1: Build Your Dataset

Step 2: Initialize Training Configuration

Step 3: Start Multi-GPU Training

Quick Tips

🛠 Command-Line Tools

📓 Demo Notebook

📊 How to Prepare Data

🏗 How It Works

Adaptive Batch Partitioning

Distributed Training with NCCL

🔧 Troubleshooting

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes