Blazing-fast LLM fine-tuning with minimal VRAM — multi-GPU, manual LoRA gradients, flash attention, 4-bit quant

Project description

amazingvmsloth

Blazing-fast LLM fine-tuning with minimal VRAM.

  \   / |    amazingvmsloth - Fast LLM Fine-Tuning
   O^O / \_/ \   Minimal VRAM. Maximum Speed.
  \        /
   "-____-"

Train 14B models on a 4GB GPU. Multi-GPU, 4-bit quantization, LoRA, CPU offloading, gradient checkpointing, and sequence packing — all built for speed on consumer hardware.

Install

pip install amazingvmsloth

Or from source:

git clone https://github.com/CollabVMgamez/amazingvmsloth.git
cd amazingvmsloth
pip install -e .

Requirements: Python 3.9+, PyTorch 2.1+, CUDA 11.8+ (optional, CPU training supported)

Quick Start

1. Wizard — let it pick settings for your hardware

amazingvmsloth wizard --model Qwen/Qwen2.5-0.5B

Analyzes your GPU/CPU and prints a ready-to-run command.

2. Train

amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset tatsu-lab/alpaca \
  --epochs 3 \
  --batch-size 2 \
  --grad-accum 4 \
  --lora-r 16 \
  --output-dir ./output

Supports chat-format datasets too:

amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k \
  --dataset-format chat \
  --max-samples 1000 \
  --output-dir ./thinking_lora

3. Convert LoRA to merged model

amazingvmsloth merge \
  --model Qwen/Qwen2.5-0.5B \
  --lora ./output \
  --output ./merged_model

4. Run inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./merged_model", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

CLI Commands

Command	Description
`wizard`	Interactive config generator based on your hardware
`train`	Fine-tune a model with LoRA
`merge`	Merge LoRA weights into base model
`convert`	Merge + convert to GGUF (requires llama.cpp)
`info`	Show model info and VRAM estimates
`bench`	Benchmark vs unsloth

Hardware Tiers

GPU VRAM	Strategy
4-6 GB	4-bit quant, batch=1, grad accum=8, seq=512, tiny LoRA
6-12 GB	4-bit quant, batch=1-2, grad accum=4, seq=1024
12-24 GB	4-bit or full precision, batch=2-4, torch.compile
24+ GB	Full precision, no grad checkpointing, large batch
CPU only	fp32/bf16, torch.compile, physical-core threading

Key Features

rsLoRA scaling for stable training at any rank
4-bit/8-bit quantization via bitsandbytes
XFormers/SDPA attention patching (Flash Attention on Linux)
Sequence packing for 2-3x throughput
Gradient checkpointing with selective layer skipping
Multi-GPU: DDP, FSDP, DeepSpeed, pipeline parallelism
Layer offloading via accelerate.dispatch_model
CPU training with IPEX, pre-packing, torch.compile
PagedAdamW8bit optimizer for low-VRAM training
Checkpoint resume with full RNG/optimizer state
Tqdm progress bar with live loss + VRAM display

Example: 500 Steps on Dolly

amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset databricks/databricks-dolly-15k \
  --dataset-format alpaca \
  --epochs 1 --batch-size 2 --grad-accum 2 \
  --max-samples 1000 --max-seq-length 512 \
  --lora-r 16 --output-dir ./dolly_lora --packing

This runs ~500 steps in ~10 minutes on a 4GB RTX 3050.

Project Structure

amazingvmsloth/
├── lora.py              # LoRA with rsLoRA, device-aware init
├── quantization.py      # 4-bit/8-bit quant, kbit training prep
├── attention.py         # SDPA/XFormers patching
├── trainer.py           # AmazingTrainer with tqdm, packing, offloading
├── cpu_trainer.py       # CpuTrainer for CPU-only training
├── packing.py           # Sequence packing collators
├── gradient.py          # GradientAccumulator
├── optimizer.py         # PagedAdamW8bit, CpuOffloadedAdamW
├── offload.py           # Layer offloading via accelerate
├── cli.py               # CLI entrypoint
├── wizard.py            # Hardware-aware config generator
├── bench.py             # Benchmark vs unsloth
└── utils/
    ├── banner.py        # Startup banner with GPU info
    ├── memory.py        # VRAM estimation
    ├── patching.py      # LoRA save/load helpers
    └── save_load.py     # Model save/merge

Benchmarks

On RTX 3050 4GB Laptop GPU:

Library	Time (1 epoch, 500 samples)	Peak VRAM
amazingvmsloth	5.3s	1.07 GB
unsloth	10.1s	0.96 GB

1.91x faster on small runs with pre-quantized models.

License

MIT

Project details

Release history Release notifications | RSS feed

0.3.2

May 22, 2026

0.3.1

May 22, 2026

0.3.0

May 22, 2026

0.2.10

May 19, 2026

0.2.9

May 19, 2026

This version

0.2.8

May 18, 2026

0.2.7

May 18, 2026

0.2.6

May 18, 2026

0.2.5

May 18, 2026

0.2.4

May 18, 2026

0.2.3

May 17, 2026

0.2.2

May 17, 2026

0.2.1

May 17, 2026

0.2.0

May 17, 2026

0.1.9

May 17, 2026

0.1.8

May 17, 2026

0.1.7

May 17, 2026

0.1.6

May 17, 2026

0.1.5

May 17, 2026

0.1.4

May 17, 2026

0.1.3

May 17, 2026

0.1.2

May 17, 2026

0.1.1

May 17, 2026

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazingvmsloth-0.2.8.tar.gz (53.0 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amazingvmsloth-0.2.8-py3-none-any.whl (58.5 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file amazingvmsloth-0.2.8.tar.gz.

File metadata

Download URL: amazingvmsloth-0.2.8.tar.gz
Upload date: May 18, 2026
Size: 53.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for amazingvmsloth-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`fc4f2a1b74718c264ffc1ae52eafe1309ed974c0cc57dea00f1395e30fe232c6`
MD5	`3da12703a5c1eff0683956dc1bac4d25`
BLAKE2b-256	`74a7ddebecb6de7057a539e3129cffced743028a854c90114b817cc7f379ccaa`

See more details on using hashes here.

File details

Details for the file amazingvmsloth-0.2.8-py3-none-any.whl.

File metadata

Download URL: amazingvmsloth-0.2.8-py3-none-any.whl
Upload date: May 18, 2026
Size: 58.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for amazingvmsloth-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5cdf88ad8a68218e6e2ae7a3409c3411bd93de09919b473949e8a45fa8b90464`
MD5	`5c5f041f5fa7fd4af5aa110c2b8fc378`
BLAKE2b-256	`60ed4cdb550884c7f8118e5ebb29be48735abe67629c1195a95d84a9b75be1fe`

See more details on using hashes here.

amazingvmsloth 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

amazingvmsloth

Install

Quick Start

1. Wizard — let it pick settings for your hardware

2. Train

3. Convert LoRA to merged model

4. Run inference

CLI Commands

Hardware Tiers

Key Features

Example: 500 Steps on Dolly

Project Structure

Benchmarks

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes