Drop-in Unsloth alternative — zero errors, maximum speed LoRA/QLoRA fine-tuning
Project description
FastLoRA ⚡🛡️
The drop-in Unsloth alternative that never crashes.
Maximum speed LoRA/QLoRA fine-tuning with automatic safety guards — every feature is a simple True/False toggle.
pip install fastlora
pip install "fastlora[full]" # recommended
Why FastLoRA?
| Unsloth | FastLoRA | |
|---|---|---|
| Installation errors | Frequent | None |
| VRAM Guard (auto OOM prevention) | ✗ | ✓ |
| Fallback chain on failure | ✗ | ✓ |
| Compiled kernel cache | ✗ | ✓ |
| Adapter hot-swap (ms) | ✗ | ✓ |
| Every feature True/False toggle | Partial | ✓ |
| 0.0–1.0 power control per feature | ✗ | ✓ |
| Multi-GPU: DDP / FSDP / DeepSpeed | Partial | ✓ |
| Works on CPU / MPS / CUDA | Partial | ✓ |
Installation
# Minimal (core dependencies only)
pip install fastlora
# Recommended (full features)
pip install "fastlora[full]"
# With Flash Attention 2 (requires CUDA + compilation)
pip install "fastlora[full,flash]"
# With DeepSpeed multi-GPU
pip install "fastlora[full,distributed]"
# Everything
pip install "fastlora[all]"
Quick Start
from fastlora import FastLoRA
fl = FastLoRA(
"meta-llama/Llama-3.2-3B",
lora=True,
quantization="4bit",
flash_attention=True,
vram_guard=True,
)
model, tokenizer = fl.load()
Settings Panel
Copy this into your training script and adjust what you need.
Everything else runs automatically.
from fastlora import FastLoRA
fl = FastLoRA(
"meta-llama/Llama-3.2-3B",
# LORA On/Off Power (0.0–1.0)
lora = True, # lora_power = 1.0,
lora_r = 16, # rank: 8 / 16 / 32 / 64
lora_alpha = 32, # scaling (usually r×2)
# QUANTIZATION On/Off Power
quantization = "4bit", # "4bit" / "8bit" / "none"
quantization_power = 1.0, # <0.7 → falls back to 8bit
# SPEED On/Off Power
flash_attention = True, # attention_power = 1.0,
torch_compile = True, # compile_power = 0.8,
fused_ops = True,
batch_packing = True, # packing_power = 1.0,
cuda_optimizations = True,
compile_cache = True,
pin_memory = True,
auto_batch_size = True,
# VRAM GUARD Threshold (0.0–1.0 → triggers at X% usage)
vram_guard = True, # vram_guard_power = 0.85,
# TRAINING
precision = "auto", # "auto" / "bf16" / "fp16" / "fp32"
gradient_checkpointing = True,
learning_rate = 2e-4,
lora_power = 1.0,
)
model, tokenizer = fl.load()
Remove the # from any power parameter to activate it.
Full Training Pipeline
from fastlora import FastLoRA, format_alpaca
from fastlora import CheckpointManager, LRFinder, EarlyStopping, ExperimentLogger
from datasets import load_dataset
# 1. Initialize
fl = FastLoRA("meta-llama/Llama-3.2-3B", lora=True, quantization="4bit")
model, tokenizer = fl.load()
# 2. Find optimal learning rate automatically
lr = LRFinder(fl).find(train_dataset)
# 3. Load dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:2000]")
# 4. Get trainer
trainer = fl.get_trainer(dataset, formatting_func=format_alpaca)
# 5. Add logging (Wandb + TensorBoard + CSV — all at once, no conflicts)
ExperimentLogger(fl, wandb=True, tensorboard=True, csv=True).patch_trainer(trainer)
# 6. Add early stopping
EarlyStopping(patience=3).patch_trainer(trainer)
# 7. Resume from checkpoint if exists
resume = CheckpointManager(fl, "./checkpoints").patch_trainer(trainer)
# 8. Train
trainer.train(resume_from_checkpoint=resume)
# 9. Save
fl.save("./my_model")
Feature Reference
Core Features
| Parameter | Type | Default | Description |
|---|---|---|---|
lora |
bool | True |
Enable LoRA fine-tuning |
lora_r |
int | 16 |
LoRA rank (8 / 16 / 32 / 64) |
lora_alpha |
int | 32 |
LoRA scaling factor |
lora_dropout |
float | 0.05 |
LoRA dropout rate |
quantization |
str | "4bit" |
"4bit" / "8bit" / "none" |
flash_attention |
bool | True |
Flash Attention 2 (falls back to SDPA if unavailable) |
gradient_checkpointing |
bool | True |
Gradient checkpointing |
precision |
str | "auto" |
"auto" / "bf16" / "fp16" / "fp32" |
Speed Features
| Parameter | Type | Default | Description |
|---|---|---|---|
torch_compile |
bool | True |
torch.compile() — slow first step, ~2x faster after |
compile_cache |
bool | True |
Cache compiled kernels (5s startup instead of 3min) |
fused_ops |
bool | True |
Fused RMSNorm (Triton if available, PyTorch fallback) |
cuda_optimizations |
bool | True |
TF32 + cuDNN benchmark mode |
batch_packing |
bool | True |
Sequence packing — zero padding, ~1.4x throughput |
pin_memory |
bool | True |
Async CPU→GPU transfer |
auto_batch_size |
bool | True |
Auto-detect max batch size for available VRAM |
Power Controls (0.0 – 1.0)
| Parameter | Effect |
|---|---|
lora_power |
0.5 → rank is halved. 1.0 = full |
compile_power |
1.0 = max-autotune, 0.8 = reduce-overhead, 0.5 = default |
quantization_power |
< 0.7 → automatically falls back to 8bit |
attention_power |
1.0 = Flash Att 2, 0.5 = SDPA, 0.0 = eager |
packing_power |
< 0.1 → packing disabled |
vram_guard_power |
0.85 = triggers at 85% VRAM usage |
Safety Features
| Parameter | Type | Default | Description |
|---|---|---|---|
vram_guard |
bool | True |
Auto OOM prevention — halves batch, doubles accumulation |
vram_guard_power |
float | 0.85 |
Intervention threshold (0.85 = 85% VRAM) |
oom_retry |
bool | True |
On OOM: retry with 4bit→8bit→fp16→cpu fallback chain |
allow_remote_code |
bool | False |
Execute remote model code (dangerous — keep False) |
strict_mode |
bool | False |
True = crash on error, False = fallback and continue |
Distributed Training
| Parameter | Type | Default | Description |
|---|---|---|---|
multi_gpu |
bool | False |
Enable multi-GPU training |
distributed_backend |
str | "ddp" |
"ddp" / "fsdp" / "deepspeed" |
deepspeed_stage |
int | 2 |
DeepSpeed ZeRO stage: 1 / 2 / 3 |
Long Context
| Parameter | Type | Default | Description |
|---|---|---|---|
sliding_window |
bool | False |
Sliding Window Attention |
sliding_window_size |
int | 4096 |
Window size in tokens |
ring_attention |
bool | False |
Ring Attention for 500K+ tokens |
Add-on Classes
CheckpointManager — Resume Training
ckpt = CheckpointManager(fl, "./checkpoints", save_total_limit=3)
resume = ckpt.patch_trainer(trainer)
trainer.train(resume_from_checkpoint=resume)
LRFinder — Auto Learning Rate
lr = LRFinder(fl, num_iter=100).find(train_dataset)
# fl.cfg.learning_rate is updated automatically
# saves lr_finder.png plot
EarlyStopping
es = EarlyStopping(patience=3, min_delta=0.001, metric="eval_loss")
es.patch_trainer(trainer)
# es.stopped → True if training was stopped early
# es.best_value → best metric value seen
ExperimentLogger — Wandb + TensorBoard + CSV
log = ExperimentLogger(
fl,
wandb=True, # pip install wandb
tensorboard=True, # pip install tensorboard
csv=True,
project="my-project",
run_name="run-001",
)
log.patch_trainer(trainer)
trainer.train()
log.finish()
AdapterManager — Hot-swap LoRA Adapters
fl.adapter_manager.register("math", "./adapters/math")
fl.adapter_manager.register("coding", "./adapters/coding")
model = fl.adapter_manager.swap(model, "math") # milliseconds
model = fl.adapter_manager.swap(model, "coding") # milliseconds
Utility Functions
from fastlora import check_environment, format_alpaca, format_chatml
# Check installed packages and versions
check_environment()
# Format dataset samples
text = format_alpaca({"instruction": "...", "input": "", "output": "..."})
text = format_chatml({"messages": [{"role": "user", "content": "..."}]})
# CLI environment check
fastlora-check
VRAM Guard — How It Works
When VRAM usage exceeds the threshold (default 85%):
- Batch size halved — e.g. 4 → 2
- Gradient accumulation doubled — effective batch size stays the same
- Cache cleared —
gc.collect()+cuda.empty_cache() - Runs in a background thread — no interruption to training
# Check VRAM status anytime
print(fl.vram_status())
# {'used_gb': 7.2, 'reserved_gb': 8.0, 'total_gb': 24.0, 'free_gb': 16.0, 'ratio': 0.30}
# Manual check
fl.vram_guard.check_and_adapt()
Benchmark
tok_per_sec = fl.profile(n_steps=10)
Device Support
| Device | LoRA | Quantization | Flash Attention | torch.compile |
|---|---|---|---|---|
| CUDA (RTX / A100 / H100) | ✓ | ✓ | ✓ | ✓ |
| Apple Silicon (MPS) | ✓ | ✗ | ✗ | ✗ |
| CPU | ✓ | ✗ | ✗ | ✓ |
Requirements
Required:
- Python ≥ 3.9
- torch ≥ 2.1.0
- transformers ≥ 4.40.0
- accelerate ≥ 0.27.0
Optional (install with [full]):
- peft ≥ 0.10.0 — LoRA/QLoRA
- bitsandbytes ≥ 0.43.0 — 4bit/8bit quantization
- trl ≥ 0.8.0 — SFTTrainer
- datasets ≥ 2.18.0 — dataset loading
Optional extras:
- flash-attn — Flash Attention 2
- triton — fused kernels
- deepspeed — ZeRO training
- wandb — experiment tracking
- tensorboard — training visualization
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastlora-1.0.0.tar.gz.
File metadata
- Download URL: fastlora-1.0.0.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2032e3d44d96f14a3f4a02571c0ff5aef314b0fe4af643236a6bc8e7b925b85
|
|
| MD5 |
46362ddbae491a2ae7fc5d2bcae87331
|
|
| BLAKE2b-256 |
a3e3c06a19a9b941e207014ea4b8947af7472d46ce3b9c3bb0f2acfe48b90571
|
File details
Details for the file fastlora-1.0.0-py3-none-any.whl.
File metadata
- Download URL: fastlora-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c63a026093f472a3fa6330b02b725eae8b165ca5ad05a00bd31961666b30dbdc
|
|
| MD5 |
5dee012fee990f584fd8568debfc6236
|
|
| BLAKE2b-256 |
7ab633f67e9bc2d7817aa96cc264da51a6decf18081e45690be46bcd7e266a2c
|