Skip to main content

A high-performance framework for fine-tuning large language models with multi-GPU support

Project description

opensloth Logo

OpenSloth 🦥⚡

Scale Unsloth to multiple GPUs with just torchrun. No configuration files, no custom frameworks - pure PyTorch DDP.

  • 🚀 2-4x faster than single GPU
  • 🎯 Zero configuration - works out of the box
  • 💾 Same VRAM per GPU as single GPU Unsloth
  • 🔧 Any Unsloth model - Qwen, Llama, Gemma, etc.

Installation

# Install dependencies
uv add torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
uv add unsloth datasets transformers trl
uv add git+https://github.com/anhvth/opensloth.git

Quick Start

Replace python with torchrun:

# Single GPU
python train_scripts/train_ddp.py

# Multi-GPU 
torchrun --nproc_per_node=2 train_scripts/train_ddp.py  # 2 GPUs
torchrun --nproc_per_node=4 train_scripts/train_ddp.py  # 4 GPUs

OpenSloth automatically handles GPU distribution, gradient sync, and batch sizing.

Performance

Setup Time Speedup
1 GPU 19m 34s 1.0x
2 GPUs 8m 28s 2.3x

Expected scaling: 2 GPUs = ~2.3x, 4 GPUs = ~4.5x, 8 GPUs = ~9x

Usage

from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer
from opensloth.patching.ddp_patch import ddp_patch

ddp_patch()  # Enable DDP compatibility

# Standard Unsloth setup
local_rank = int(os.environ.get("LOCAL_RANK", 0))
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-0.6B",
    device_map={"": local_rank},
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(model, r=16)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()

Run: torchrun --nproc_per_node=4 your_script.py

Migration from Old Approach

Current (Recommended): Simple torchrun + DDP patch

from opensloth.patching.ddp_patch import ddp_patch
ddp_patch()
# ... standard Unsloth code

Old Approach (v0.1.8): For complex configuration files, use:

git checkout https://github.com/anhvth/opensloth/releases/tag/v0.1.8

Links

  • Unsloth - 2x faster training library
  • TRL - Transformer Reinforcement Learning
  • PyTorch DDP - Distributed training

git clone https://github.com/anhvth/opensloth.git
cd opensloth  
torchrun --nproc_per_node=4 train_scripts/train_ddp.py

Happy training! 🦥⚡

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opensloth-0.2.1.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opensloth-0.2.1-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file opensloth-0.2.1.tar.gz.

File metadata

  • Download URL: opensloth-0.2.1.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.11 Linux/5.4.0-216-generic

File hashes

Hashes for opensloth-0.2.1.tar.gz
Algorithm Hash digest
SHA256 751b124a5b18c1fb8082c59174fb2a1fd246337e6c837d7a7ae39aa710726397
MD5 d4ffbc1c706f6ac1a56feb29aad8b5cd
BLAKE2b-256 0af85db26144a96ceb08b3e85aaac711b0bcb9efe8c548d7f577f94f53770b1b

See more details on using hashes here.

File details

Details for the file opensloth-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: opensloth-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.11 Linux/5.4.0-216-generic

File hashes

Hashes for opensloth-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6bd0dd918add6324c15be5b4acb7cc7964e761fc4a691774cd5fba8c1f0255e9
MD5 dfa02881794218e14b3998cea02b533b
BLAKE2b-256 9178debf3a8a5b1f2207aa4a756b1554ba8a2b2606e311dd083645eafc198d6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page