Skip to main content

Blazing-fast LLM fine-tuning with minimal VRAM — multi-GPU, manual LoRA gradients, flash attention, 4-bit quant

Project description

amazingvmsloth

Blazing-fast LLM fine-tuning with minimal VRAM.

  \   / |    amazingvmsloth - Fast LLM Fine-Tuning
   O^O / \_/ \   Minimal VRAM. Maximum Speed.
  \        /
   "-____-"

Train 14B models on a 4GB GPU. Multi-GPU, 4-bit quantization, LoRA, CPU offloading, gradient checkpointing, and sequence packing — all built for speed on consumer hardware.


Install

pip install amazingvmsloth

Or from source:

git clone https://github.com/CollabVMgamez/amazingvmsloth.git
cd amazingvmsloth
pip install -e .

Requirements: Python 3.9+, PyTorch 2.1+, CUDA 11.8+ (optional, CPU training supported)


Quick Start

1. Wizard — let it pick settings for your hardware

amazingvmsloth wizard --model Qwen/Qwen2.5-0.5B

Analyzes your GPU/CPU and prints a ready-to-run command.

2. Train

amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset tatsu-lab/alpaca \
  --epochs 3 \
  --batch-size 2 \
  --grad-accum 4 \
  --lora-r 16 \
  --output-dir ./output

Supports chat-format datasets too:

amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k \
  --dataset-format chat \
  --max-samples 1000 \
  --output-dir ./thinking_lora

3. Convert LoRA to merged model

amazingvmsloth merge \
  --model Qwen/Qwen2.5-0.5B \
  --lora ./output \
  --output ./merged_model

4. Run inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./merged_model", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

CLI Commands

Command Description
wizard Interactive config generator based on your hardware
train Fine-tune a model with LoRA
merge Merge LoRA weights into base model
convert Merge + convert to GGUF (requires llama.cpp)
info Show model info and VRAM estimates
bench Benchmark vs unsloth

Hardware Tiers

GPU VRAM Strategy
4-6 GB 4-bit quant, batch=1, grad accum=8, seq=512, tiny LoRA
6-12 GB 4-bit quant, batch=1-2, grad accum=4, seq=1024
12-24 GB 4-bit or full precision, batch=2-4, torch.compile
24+ GB Full precision, no grad checkpointing, large batch
CPU only fp32/bf16, torch.compile, physical-core threading

Key Features

  • rsLoRA scaling for stable training at any rank
  • 4-bit/8-bit quantization via bitsandbytes
  • XFormers/SDPA attention patching (Flash Attention on Linux)
  • Sequence packing for 2-3x throughput
  • Gradient checkpointing with selective layer skipping
  • Multi-GPU: DDP, FSDP, DeepSpeed, pipeline parallelism
  • Layer offloading via accelerate.dispatch_model
  • CPU training with IPEX, pre-packing, torch.compile
  • PagedAdamW8bit optimizer for low-VRAM training
  • Checkpoint resume with full RNG/optimizer state
  • Tqdm progress bar with live loss + VRAM display

Example: 500 Steps on Dolly

amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset databricks/databricks-dolly-15k \
  --dataset-format alpaca \
  --epochs 1 --batch-size 2 --grad-accum 2 \
  --max-samples 1000 --max-seq-length 512 \
  --lora-r 16 --output-dir ./dolly_lora --packing

This runs ~500 steps in ~10 minutes on a 4GB RTX 3050.


Project Structure

amazingvmsloth/
├── lora.py              # LoRA with rsLoRA, device-aware init
├── quantization.py      # 4-bit/8-bit quant, kbit training prep
├── attention.py         # SDPA/XFormers patching
├── trainer.py           # AmazingTrainer with tqdm, packing, offloading
├── cpu_trainer.py       # CpuTrainer for CPU-only training
├── packing.py           # Sequence packing collators
├── gradient.py          # GradientAccumulator
├── optimizer.py         # PagedAdamW8bit, CpuOffloadedAdamW
├── offload.py           # Layer offloading via accelerate
├── cli.py               # CLI entrypoint
├── wizard.py            # Hardware-aware config generator
├── bench.py             # Benchmark vs unsloth
└── utils/
    ├── banner.py        # Startup banner with GPU info
    ├── memory.py        # VRAM estimation
    ├── patching.py      # LoRA save/load helpers
    └── save_load.py     # Model save/merge

Benchmarks

On RTX 3050 4GB Laptop GPU:

Library Time (1 epoch, 500 samples) Peak VRAM
amazingvmsloth 5.3s 1.07 GB
unsloth 10.1s 0.96 GB

1.91x faster on small runs with pre-quantized models.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazingvmsloth-0.2.1.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amazingvmsloth-0.2.1-py3-none-any.whl (56.7 kB view details)

Uploaded Python 3

File details

Details for the file amazingvmsloth-0.2.1.tar.gz.

File metadata

  • Download URL: amazingvmsloth-0.2.1.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for amazingvmsloth-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1801e15c1f9a9492afa1c16ee5df0ab4a65333e3a4b39f51a036daedba923649
MD5 d7e5878dfbe00f495ce87468e39c8899
BLAKE2b-256 eee3451d2926bd5443465325173b184b6d1b1e63b7c56d2310cc5b877f18746a

See more details on using hashes here.

File details

Details for the file amazingvmsloth-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: amazingvmsloth-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 56.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for amazingvmsloth-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a3a8151fa8fd9925acd929bc5b4ccb7366bc100ee0718d1cf011833f9bd5dcdc
MD5 701d0af41ed6f257bc0a1a06be8c5d78
BLAKE2b-256 b6be50243941c03a253faa202f51a1f9ec9b75e7e0ccad5e31e3d7e14f4278f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page