Blazing-fast LLM fine-tuning with minimal VRAM — multi-GPU, manual LoRA gradients, flash attention, 4-bit quant
Project description
amazingvmsloth
Blazing-fast LLM fine-tuning with minimal VRAM.
\ / | amazingvmsloth - Fast LLM Fine-Tuning
O^O / \_/ \ Minimal VRAM. Maximum Speed.
\ /
"-____-"
Train 14B models on a 4GB GPU. Multi-GPU, 4-bit quantization, LoRA, CPU offloading, gradient checkpointing, and sequence packing — all built for speed on consumer hardware.
Install
pip install amazingvmsloth
Or from source:
git clone https://github.com/CollabVMgamez/amazingvmsloth.git
cd amazingvmsloth
pip install -e .
Requirements: Python 3.9+, PyTorch 2.1+, CUDA 11.8+ (optional, CPU training supported)
Quick Start
1. Wizard — let it pick settings for your hardware
amazingvmsloth wizard --model Qwen/Qwen2.5-0.5B
Analyzes your GPU/CPU and prints a ready-to-run command.
2. Train
amazingvmsloth train \
--model Qwen/Qwen2.5-0.5B \
--dataset tatsu-lab/alpaca \
--epochs 3 \
--batch-size 2 \
--grad-accum 4 \
--lora-r 16 \
--output-dir ./output
Supports chat-format datasets too:
amazingvmsloth train \
--model Qwen/Qwen2.5-0.5B \
--dataset angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k \
--dataset-format chat \
--max-samples 1000 \
--output-dir ./thinking_lora
3. Convert LoRA to merged model
amazingvmsloth merge \
--model Qwen/Qwen2.5-0.5B \
--lora ./output \
--output ./merged_model
4. Run inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./merged_model", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
CLI Commands
| Command | Description |
|---|---|
wizard |
Interactive config generator based on your hardware |
train |
Fine-tune a model with LoRA |
merge |
Merge LoRA weights into base model |
convert |
Merge + convert to GGUF (requires llama.cpp) |
info |
Show model info and VRAM estimates |
bench |
Benchmark vs unsloth |
Hardware Tiers
| GPU VRAM | Strategy |
|---|---|
| 4-6 GB | 4-bit quant, batch=1, grad accum=8, seq=512, tiny LoRA |
| 6-12 GB | 4-bit quant, batch=1-2, grad accum=4, seq=1024 |
| 12-24 GB | 4-bit or full precision, batch=2-4, torch.compile |
| 24+ GB | Full precision, no grad checkpointing, large batch |
| CPU only | fp32/bf16, torch.compile, physical-core threading |
Key Features
- rsLoRA scaling for stable training at any rank
- 4-bit/8-bit quantization via bitsandbytes
- XFormers/SDPA attention patching (Flash Attention on Linux)
- Sequence packing for 2-3x throughput
- Gradient checkpointing with selective layer skipping
- Multi-GPU: DDP, FSDP, DeepSpeed, pipeline parallelism
- Layer offloading via
accelerate.dispatch_model - CPU training with IPEX, pre-packing, torch.compile
- PagedAdamW8bit optimizer for low-VRAM training
- Checkpoint resume with full RNG/optimizer state
- Tqdm progress bar with live loss + VRAM display
Example: 500 Steps on Dolly
amazingvmsloth train \
--model Qwen/Qwen2.5-0.5B \
--dataset databricks/databricks-dolly-15k \
--dataset-format alpaca \
--epochs 1 --batch-size 2 --grad-accum 2 \
--max-samples 1000 --max-seq-length 512 \
--lora-r 16 --output-dir ./dolly_lora --packing
This runs ~500 steps in ~10 minutes on a 4GB RTX 3050.
Project Structure
amazingvmsloth/
├── lora.py # LoRA with rsLoRA, device-aware init
├── quantization.py # 4-bit/8-bit quant, kbit training prep
├── attention.py # SDPA/XFormers patching
├── trainer.py # AmazingTrainer with tqdm, packing, offloading
├── cpu_trainer.py # CpuTrainer for CPU-only training
├── packing.py # Sequence packing collators
├── gradient.py # GradientAccumulator
├── optimizer.py # PagedAdamW8bit, CpuOffloadedAdamW
├── offload.py # Layer offloading via accelerate
├── cli.py # CLI entrypoint
├── wizard.py # Hardware-aware config generator
├── bench.py # Benchmark vs unsloth
└── utils/
├── banner.py # Startup banner with GPU info
├── memory.py # VRAM estimation
├── patching.py # LoRA save/load helpers
└── save_load.py # Model save/merge
Benchmarks
On RTX 3050 4GB Laptop GPU:
| Library | Time (1 epoch, 500 samples) | Peak VRAM |
|---|---|---|
| amazingvmsloth | 5.3s | 1.07 GB |
| unsloth | 10.1s | 0.96 GB |
1.91x faster on small runs with pre-quantized models.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amazingvmsloth-0.2.10.tar.gz.
File metadata
- Download URL: amazingvmsloth-0.2.10.tar.gz
- Upload date:
- Size: 53.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f799bb056acbc00cf271d576bee1bbe1eb69ddda8bf670c5049131ffc97fba7b
|
|
| MD5 |
6ff1351014625bfc9a40e7075214d80d
|
|
| BLAKE2b-256 |
eb32f1c05793adad9e19bb6af16f3fe04c4bf7340606b6657b727c85f65c0218
|
File details
Details for the file amazingvmsloth-0.2.10-py3-none-any.whl.
File metadata
- Download URL: amazingvmsloth-0.2.10-py3-none-any.whl
- Upload date:
- Size: 58.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1c092aba41cd2671d50fdc31247d7ae0f885a004ee46c40318257bb0af74121
|
|
| MD5 |
7c4c9bc84cc9492cc772b76824894a14
|
|
| BLAKE2b-256 |
80b06986ed834995b3280a02853952de0014daf39ca9e4cda6a6796f6bee5864
|