Skip to main content

Drop-in Unsloth alternative — zero errors, maximum speed LoRA/QLoRA fine-tuning

Project description

FastLoRA ⚡🛡️🧠

The drop-in Unsloth alternative that never crashes.
Maximum speed LoRA/QLoRA fine-tuning — unlimited model size, automatic hardware detection, unstoppable training.

pip install fastlora
pip install "fastlora[full]"   # recommended

Why FastLoRA?

Unsloth FastLoRA
Installation errors Frequent None
Crashes during training Common Never
VRAM Guard (auto OOM prevention)
Auto hardware detection
Unlimited model size (1B → 1T+)
Unstoppable training
Every feature True/False toggle Partial
0.0–1.0 power control per feature
Compiled kernel cache
Adapter hot-swap (ms)
Multi-GPU: DDP / FSDP / DeepSpeed Partial

Installation

# Minimal
pip install fastlora

# Recommended (full features)
pip install "fastlora[full]"

# With Flash Attention 2 (requires CUDA + compilation)
pip install "fastlora[full,flash]"

# With DeepSpeed multi-GPU
pip install "fastlora[full,distributed]"

# Everything
pip install "fastlora[all]"

Quick Start

from fastlora import FastLoRA

fl = FastLoRA(
    "meta-llama/Llama-3.2-3B",
    lora=True,
    quantization="4bit",
    flash_attention=True,
    vram_guard=True,
)
model, tokenizer = fl.load()

Settings Panel

from fastlora import FastLoRA

fl = FastLoRA(
    "meta-llama/Llama-3.2-3B",

    # LORA              On/Off     Power (0.0–1.0)
    lora              = True,    # lora_power          = 1.0,
    lora_r            = 16,      # rank: 8 / 16 / 32 / 64
    lora_alpha        = 32,      # scaling (usually r×2)

    # QUANTIZATION      On/Off     Power
    quantization      = "4bit",  # "4bit" / "8bit" / "none"
    quantization_power= 1.0,     # <0.7 → falls back to 8bit

    # SPEED             On/Off     Power
    flash_attention   = True,    # attention_power     = 1.0,
    torch_compile     = True,    # compile_power       = 0.8,
    fused_ops         = True,
    batch_packing     = True,    # packing_power       = 1.0,
    cuda_optimizations= True,
    compile_cache     = True,
    pin_memory        = True,
    auto_batch_size   = True,

    # VRAM GUARD        Threshold (0.0–1.0)
    vram_guard        = True,    # vram_guard_power    = 0.85,

    # TRAINING
    precision         = "auto",
    gradient_checkpointing = True,
    learning_rate     = 2e-4,
)

model, tokenizer = fl.load()

Full Training Pipeline

from fastlora import FastLoRA, format_alpaca
from fastlora import CheckpointManager, LRFinder, EarlyStopping, ExperimentLogger
from datasets import load_dataset

fl = FastLoRA("meta-llama/Llama-3.2-3B", lora=True, quantization="4bit")
model, tokenizer = fl.load()

dataset = load_dataset("tatsu-lab/alpaca", split="train[:2000]")
trainer = fl.get_trainer(dataset, formatting_func=format_alpaca)

ExperimentLogger(fl, tensorboard=True, csv=True).patch_trainer(trainer)
EarlyStopping(patience=3).patch_trainer(trainer)
resume = CheckpointManager(fl, "./checkpoints").patch_trainer(trainer)

fl.train(trainer, resume_path=resume)
fl.save("./my_model")

v4.2 New Features

Auto Hardware Scanner

Scans GPU on startup, applies best settings automatically. Manual settings always take priority.

CPU          → compile=False, flash=False, batch=1
Low-end GPU  → 4bit, grad_ckpt, cpu_offload
Mid-range    → 4bit, flash attention, batch=2
High-end     → 4bit, flash att 2, batch=4, bf16
Datacenter   → no quant, fullgraph compile, batch=8
Flagship     → no quant, fullgraph compile, batch=16

Unlimited Parameter Support

Auto strategy for any model size:

0–3B    → normal mode
3–10B   → 4bit + gradient checkpointing
10–30B  → 4bit + CPU offload + layer offload
30–100B → aggressive offload + batch=1
100B+   → streaming mode (1 layer on GPU at a time)

Unstoppable Training

Only KeyboardInterrupt can stop training:

OOM          → clean memory, reduce batch, continue
CUDA error   → reset device, continue
NaN/Inf loss → skip step, reduce LR if persistent
Data error   → skip sample, continue
Unknown      → activate safe mode, continue

Feature Reference

Speed

Parameter Default Description
torch_compile True ~2x faster after warmup
compile_cache True 5s startup instead of 3min
fused_ops True Fused RMSNorm (Triton)
cuda_optimizations True TF32 + cuDNN benchmark
batch_packing True Zero padding, ~1.4x throughput
pin_memory True Async CPU→GPU
auto_batch_size True Max batch for available VRAM

Safety

Parameter Default Description
vram_guard True Auto OOM prevention
vram_guard_power 0.85 Intervenes at 85% VRAM
unstoppable True Nothing stops training
allow_remote_code False Remote model code (keep False)

v4.2 Systems

Parameter Default Description
auto_hardware_scan True Auto GPU profile
unlimited_params True Auto strategy for any model size
loss_spike_detection False Detect loss spikes
dynamic_batch_scaling False Real-time batch adjustment
gradient_noise_monitor False Gradient health monitoring
smart_checkpoint False Save only on improvement

Benchmark Results

Tested on NVIDIA Tesla T4 (Google Colab):

Version Model Steps Time Throughput
FastLoRA v3 TinyLlama-1.1B 50 192s 2.07 samples/s
FastLoRA v4 Beta Qwen2.5-1.5B 50 21s 3.40 samples/s
FastLoRA v4.1 TinyLlama-1.1B 50 28.45s 3.516 samples/s
FastLoRA v4.2 Qwen2.5-1.5B 200 470s 3.404 samples/s

Unsloth was also benchmarked. Unsloth didn't run.


Requirements

Required: torch ≥ 2.1.0, transformers ≥ 4.40.0, accelerate ≥ 0.27.0

Optional ([full]): peft, bitsandbytes, trl, datasets

Optional extras: flash-attn, triton, deepspeed, optuna, wandb, tensorboard


License

MIT © 2025 Ömür Bera Işık

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastlora-4.2.1.tar.gz (38.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastlora-4.2.1-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file fastlora-4.2.1.tar.gz.

File metadata

  • Download URL: fastlora-4.2.1.tar.gz
  • Upload date:
  • Size: 38.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for fastlora-4.2.1.tar.gz
Algorithm Hash digest
SHA256 3077ac7b6848f00d49b94e692aeab02a989b86bf08b1cb49034f9c9d58463bd8
MD5 4df5f0b59f12ff92ff4c2af0ebbead2e
BLAKE2b-256 3131b255e585ee3862f7e1c3f4e7e03b2ad6524ba52a8f1c7054fbc409351b65

See more details on using hashes here.

File details

Details for the file fastlora-4.2.1-py3-none-any.whl.

File metadata

  • Download URL: fastlora-4.2.1-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for fastlora-4.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 697281739e8977fa1c0aca010a7e4604a7153b0b3ddf0350653736993b2ea636
MD5 b27eb1f009e836db312c010fc171d962
BLAKE2b-256 4657de60b010737844862b2cd36d81e3fed12ea689967d0a3c3705f62859f9b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page