Drop-in Unsloth alternative — zero errors, maximum speed LoRA/QLoRA fine-tuning
Project description
FastLoRA ⚡🛡️🧠
The drop-in Unsloth alternative that never crashes.
Maximum speed LoRA/QLoRA fine-tuning — unlimited model size, automatic hardware detection, unstoppable training.
pip install fastlora
pip install "fastlora[full]" # recommended
Why FastLoRA?
| Unsloth | FastLoRA | |
|---|---|---|
| Installation errors | Frequent | None |
| Crashes during training | Common | Never |
| VRAM Guard (auto OOM prevention) | ✗ | ✓ |
| Auto hardware detection | ✗ | ✓ |
| Unlimited model size (1B → 1T+) | ✗ | ✓ |
| Unstoppable training | ✗ | ✓ |
| Every feature True/False toggle | Partial | ✓ |
| 0.0–1.0 power control per feature | ✗ | ✓ |
| Compiled kernel cache | ✗ | ✓ |
| Adapter hot-swap (ms) | ✗ | ✓ |
| Multi-GPU: DDP / FSDP / DeepSpeed | Partial | ✓ |
Installation
# Minimal
pip install fastlora
# Recommended (full features)
pip install "fastlora[full]"
# With Flash Attention 2 (requires CUDA + compilation)
pip install "fastlora[full,flash]"
# With DeepSpeed multi-GPU
pip install "fastlora[full,distributed]"
# Everything
pip install "fastlora[all]"
Quick Start
from fastlora import FastLoRA
fl = FastLoRA(
"meta-llama/Llama-3.2-3B",
lora=True,
quantization="4bit",
flash_attention=True,
vram_guard=True,
)
model, tokenizer = fl.load()
Settings Panel
from fastlora import FastLoRA
fl = FastLoRA(
"meta-llama/Llama-3.2-3B",
# LORA On/Off Power (0.0–1.0)
lora = True, # lora_power = 1.0,
lora_r = 16, # rank: 8 / 16 / 32 / 64
lora_alpha = 32, # scaling (usually r×2)
# QUANTIZATION On/Off Power
quantization = "4bit", # "4bit" / "8bit" / "none"
quantization_power= 1.0, # <0.7 → falls back to 8bit
# SPEED On/Off Power
flash_attention = True, # attention_power = 1.0,
torch_compile = True, # compile_power = 0.8,
fused_ops = True,
batch_packing = True, # packing_power = 1.0,
cuda_optimizations= True,
compile_cache = True,
pin_memory = True,
auto_batch_size = True,
# VRAM GUARD Threshold (0.0–1.0)
vram_guard = True, # vram_guard_power = 0.85,
# TRAINING
precision = "auto",
gradient_checkpointing = True,
learning_rate = 2e-4,
)
model, tokenizer = fl.load()
Full Training Pipeline
from fastlora import FastLoRA, format_alpaca
from fastlora import CheckpointManager, LRFinder, EarlyStopping, ExperimentLogger
from datasets import load_dataset
fl = FastLoRA("meta-llama/Llama-3.2-3B", lora=True, quantization="4bit")
model, tokenizer = fl.load()
dataset = load_dataset("tatsu-lab/alpaca", split="train[:2000]")
trainer = fl.get_trainer(dataset, formatting_func=format_alpaca)
ExperimentLogger(fl, tensorboard=True, csv=True).patch_trainer(trainer)
EarlyStopping(patience=3).patch_trainer(trainer)
resume = CheckpointManager(fl, "./checkpoints").patch_trainer(trainer)
fl.train(trainer, resume_path=resume)
fl.save("./my_model")
v4.2 New Features
Auto Hardware Scanner
Scans GPU on startup, applies best settings automatically. Manual settings always take priority.
CPU → compile=False, flash=False, batch=1
Low-end GPU → 4bit, grad_ckpt, cpu_offload
Mid-range → 4bit, flash attention, batch=2
High-end → 4bit, flash att 2, batch=4, bf16
Datacenter → no quant, fullgraph compile, batch=8
Flagship → no quant, fullgraph compile, batch=16
Unlimited Parameter Support
Auto strategy for any model size:
0–3B → normal mode
3–10B → 4bit + gradient checkpointing
10–30B → 4bit + CPU offload + layer offload
30–100B → aggressive offload + batch=1
100B+ → streaming mode (1 layer on GPU at a time)
Unstoppable Training
Only KeyboardInterrupt can stop training:
OOM → clean memory, reduce batch, continue
CUDA error → reset device, continue
NaN/Inf loss → skip step, reduce LR if persistent
Data error → skip sample, continue
Unknown → activate safe mode, continue
Feature Reference
Speed
| Parameter | Default | Description |
|---|---|---|
torch_compile |
True |
~2x faster after warmup |
compile_cache |
True |
5s startup instead of 3min |
fused_ops |
True |
Fused RMSNorm (Triton) |
cuda_optimizations |
True |
TF32 + cuDNN benchmark |
batch_packing |
True |
Zero padding, ~1.4x throughput |
pin_memory |
True |
Async CPU→GPU |
auto_batch_size |
True |
Max batch for available VRAM |
Safety
| Parameter | Default | Description |
|---|---|---|
vram_guard |
True |
Auto OOM prevention |
vram_guard_power |
0.85 |
Intervenes at 85% VRAM |
unstoppable |
True |
Nothing stops training |
allow_remote_code |
False |
Remote model code (keep False) |
v4.2 Systems
| Parameter | Default | Description |
|---|---|---|
auto_hardware_scan |
True |
Auto GPU profile |
unlimited_params |
True |
Auto strategy for any model size |
loss_spike_detection |
False |
Detect loss spikes |
dynamic_batch_scaling |
False |
Real-time batch adjustment |
gradient_noise_monitor |
False |
Gradient health monitoring |
smart_checkpoint |
False |
Save only on improvement |
Benchmark Results
Tested on NVIDIA Tesla T4 (Google Colab):
| Version | Model | Steps | Time | Throughput |
|---|---|---|---|---|
| FastLoRA v3 | TinyLlama-1.1B | 50 | 192s | 2.07 samples/s |
| FastLoRA v4 Beta | Qwen2.5-1.5B | 50 | 21s | 3.40 samples/s |
| FastLoRA v4.1 | TinyLlama-1.1B | 50 | 28.45s | 3.516 samples/s |
| FastLoRA v4.2 | Qwen2.5-1.5B | 200 | 470s | 3.404 samples/s |
Unsloth was also benchmarked. Unsloth didn't run.
Requirements
Required: torch ≥ 2.1.0, transformers ≥ 4.40.0, accelerate ≥ 0.27.0
Optional ([full]): peft, bitsandbytes, trl, datasets
Optional extras: flash-attn, triton, deepspeed, optuna, wandb, tensorboard
License
MIT © 2025 Ömür Bera Işık
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastlora-4.2.1.tar.gz.
File metadata
- Download URL: fastlora-4.2.1.tar.gz
- Upload date:
- Size: 38.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3077ac7b6848f00d49b94e692aeab02a989b86bf08b1cb49034f9c9d58463bd8
|
|
| MD5 |
4df5f0b59f12ff92ff4c2af0ebbead2e
|
|
| BLAKE2b-256 |
3131b255e585ee3862f7e1c3f4e7e03b2ad6524ba52a8f1c7054fbc409351b65
|
File details
Details for the file fastlora-4.2.1-py3-none-any.whl.
File metadata
- Download URL: fastlora-4.2.1-py3-none-any.whl
- Upload date:
- Size: 38.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
697281739e8977fa1c0aca010a7e4604a7153b0b3ddf0350653736993b2ea636
|
|
| MD5 |
b27eb1f009e836db312c010fc171d962
|
|
| BLAKE2b-256 |
4657de60b010737844862b2cd36d81e3fed12ea689967d0a3c3705f62859f9b2
|