Skip to main content

Drop-in Unsloth alternative — zero errors, maximum speed LoRA/QLoRA fine-tuning

Project description

FastLoRA ⚡🛡️

The drop-in Unsloth alternative that never crashes.
Maximum speed LoRA/QLoRA fine-tuning with automatic safety guards — every feature is a simple True/False toggle.

pip install fastlora
pip install "fastlora[full]"   # recommended

Why FastLoRA?

Unsloth FastLoRA
Installation errors Frequent None
VRAM Guard (auto OOM prevention)
Fallback chain on failure
Compiled kernel cache
Adapter hot-swap (ms)
Every feature True/False toggle Partial
0.0–1.0 power control per feature
Multi-GPU: DDP / FSDP / DeepSpeed Partial
Works on CPU / MPS / CUDA Partial

Installation

# Minimal (core dependencies only)
pip install fastlora

# Recommended (full features)
pip install "fastlora[full]"

# With Flash Attention 2 (requires CUDA + compilation)
pip install "fastlora[full,flash]"

# With DeepSpeed multi-GPU
pip install "fastlora[full,distributed]"

# Everything
pip install "fastlora[all]"

Quick Start

from fastlora import FastLoRA

fl = FastLoRA(
    "meta-llama/Llama-3.2-3B",
    lora=True,
    quantization="4bit",
    flash_attention=True,
    vram_guard=True,
)
model, tokenizer = fl.load()

Settings Panel

Copy this into your training script and adjust what you need.
Everything else runs automatically.

from fastlora import FastLoRA

fl = FastLoRA(
    "meta-llama/Llama-3.2-3B",

    # LORA            On/Off     Power (0.0–1.0)
    lora            = True,    # lora_power          = 1.0,
    lora_r          = 16,      # rank: 8 / 16 / 32 / 64
    lora_alpha      = 32,      # scaling (usually r×2)

    # QUANTIZATION    On/Off     Power
    quantization    = "4bit",  # "4bit" / "8bit" / "none"
    quantization_power = 1.0,  # <0.7 → falls back to 8bit

    # SPEED           On/Off     Power
    flash_attention = True,    # attention_power     = 1.0,
    torch_compile   = True,    # compile_power       = 0.8,
    fused_ops       = True,
    batch_packing   = True,    # packing_power       = 1.0,
    cuda_optimizations = True,
    compile_cache   = True,
    pin_memory      = True,
    auto_batch_size = True,

    # VRAM GUARD      Threshold (0.0–1.0 → triggers at X% usage)
    vram_guard      = True,    # vram_guard_power    = 0.85,

    # TRAINING
    precision       = "auto",  # "auto" / "bf16" / "fp16" / "fp32"
    gradient_checkpointing = True,
    learning_rate   = 2e-4,
    lora_power      = 1.0,
)

model, tokenizer = fl.load()

Remove the # from any power parameter to activate it.


Full Training Pipeline

from fastlora import FastLoRA, format_alpaca
from fastlora import CheckpointManager, LRFinder, EarlyStopping, ExperimentLogger
from datasets import load_dataset

# 1. Initialize
fl = FastLoRA("meta-llama/Llama-3.2-3B", lora=True, quantization="4bit")
model, tokenizer = fl.load()

# 2. Find optimal learning rate automatically
lr = LRFinder(fl).find(train_dataset)

# 3. Load dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:2000]")

# 4. Get trainer
trainer = fl.get_trainer(dataset, formatting_func=format_alpaca)

# 5. Add logging (Wandb + TensorBoard + CSV — all at once, no conflicts)
ExperimentLogger(fl, wandb=True, tensorboard=True, csv=True).patch_trainer(trainer)

# 6. Add early stopping
EarlyStopping(patience=3).patch_trainer(trainer)

# 7. Resume from checkpoint if exists
resume = CheckpointManager(fl, "./checkpoints").patch_trainer(trainer)

# 8. Train
trainer.train(resume_from_checkpoint=resume)

# 9. Save
fl.save("./my_model")

Feature Reference

Core Features

Parameter Type Default Description
lora bool True Enable LoRA fine-tuning
lora_r int 16 LoRA rank (8 / 16 / 32 / 64)
lora_alpha int 32 LoRA scaling factor
lora_dropout float 0.05 LoRA dropout rate
quantization str "4bit" "4bit" / "8bit" / "none"
flash_attention bool True Flash Attention 2 (falls back to SDPA if unavailable)
gradient_checkpointing bool True Gradient checkpointing
precision str "auto" "auto" / "bf16" / "fp16" / "fp32"

Speed Features

Parameter Type Default Description
torch_compile bool True torch.compile() — slow first step, ~2x faster after
compile_cache bool True Cache compiled kernels (5s startup instead of 3min)
fused_ops bool True Fused RMSNorm (Triton if available, PyTorch fallback)
cuda_optimizations bool True TF32 + cuDNN benchmark mode
batch_packing bool True Sequence packing — zero padding, ~1.4x throughput
pin_memory bool True Async CPU→GPU transfer
auto_batch_size bool True Auto-detect max batch size for available VRAM

Power Controls (0.0 – 1.0)

Parameter Effect
lora_power 0.5 → rank is halved. 1.0 = full
compile_power 1.0 = max-autotune, 0.8 = reduce-overhead, 0.5 = default
quantization_power < 0.7 → automatically falls back to 8bit
attention_power 1.0 = Flash Att 2, 0.5 = SDPA, 0.0 = eager
packing_power < 0.1 → packing disabled
vram_guard_power 0.85 = triggers at 85% VRAM usage

Safety Features

Parameter Type Default Description
vram_guard bool True Auto OOM prevention — halves batch, doubles accumulation
vram_guard_power float 0.85 Intervention threshold (0.85 = 85% VRAM)
oom_retry bool True On OOM: retry with 4bit→8bit→fp16→cpu fallback chain
allow_remote_code bool False Execute remote model code (dangerous — keep False)
strict_mode bool False True = crash on error, False = fallback and continue

Distributed Training

Parameter Type Default Description
multi_gpu bool False Enable multi-GPU training
distributed_backend str "ddp" "ddp" / "fsdp" / "deepspeed"
deepspeed_stage int 2 DeepSpeed ZeRO stage: 1 / 2 / 3

Long Context

Parameter Type Default Description
sliding_window bool False Sliding Window Attention
sliding_window_size int 4096 Window size in tokens
ring_attention bool False Ring Attention for 500K+ tokens

Add-on Classes

CheckpointManager — Resume Training

ckpt    = CheckpointManager(fl, "./checkpoints", save_total_limit=3)
resume  = ckpt.patch_trainer(trainer)
trainer.train(resume_from_checkpoint=resume)

LRFinder — Auto Learning Rate

lr = LRFinder(fl, num_iter=100).find(train_dataset)
# fl.cfg.learning_rate is updated automatically
# saves lr_finder.png plot

EarlyStopping

es = EarlyStopping(patience=3, min_delta=0.001, metric="eval_loss")
es.patch_trainer(trainer)
# es.stopped      → True if training was stopped early
# es.best_value   → best metric value seen

ExperimentLogger — Wandb + TensorBoard + CSV

log = ExperimentLogger(
    fl,
    wandb=True,        # pip install wandb
    tensorboard=True,  # pip install tensorboard
    csv=True,
    project="my-project",
    run_name="run-001",
)
log.patch_trainer(trainer)
trainer.train()
log.finish()

AdapterManager — Hot-swap LoRA Adapters

fl.adapter_manager.register("math",   "./adapters/math")
fl.adapter_manager.register("coding", "./adapters/coding")

model = fl.adapter_manager.swap(model, "math")    # milliseconds
model = fl.adapter_manager.swap(model, "coding")  # milliseconds

Utility Functions

from fastlora import check_environment, format_alpaca, format_chatml

# Check installed packages and versions
check_environment()

# Format dataset samples
text = format_alpaca({"instruction": "...", "input": "", "output": "..."})
text = format_chatml({"messages": [{"role": "user", "content": "..."}]})
# CLI environment check
fastlora-check

VRAM Guard — How It Works

When VRAM usage exceeds the threshold (default 85%):

  1. Batch size halved — e.g. 4 → 2
  2. Gradient accumulation doubled — effective batch size stays the same
  3. Cache clearedgc.collect() + cuda.empty_cache()
  4. Runs in a background thread — no interruption to training
# Check VRAM status anytime
print(fl.vram_status())
# {'used_gb': 7.2, 'reserved_gb': 8.0, 'total_gb': 24.0, 'free_gb': 16.0, 'ratio': 0.30}

# Manual check
fl.vram_guard.check_and_adapt()

Benchmark

tok_per_sec = fl.profile(n_steps=10)

Device Support

Device LoRA Quantization Flash Attention torch.compile
CUDA (RTX / A100 / H100)
Apple Silicon (MPS)
CPU

Requirements

Required:

  • Python ≥ 3.9
  • torch ≥ 2.1.0
  • transformers ≥ 4.40.0
  • accelerate ≥ 0.27.0

Optional (install with [full]):

  • peft ≥ 0.10.0 — LoRA/QLoRA
  • bitsandbytes ≥ 0.43.0 — 4bit/8bit quantization
  • trl ≥ 0.8.0 — SFTTrainer
  • datasets ≥ 2.18.0 — dataset loading

Optional extras:

  • flash-attn — Flash Attention 2
  • triton — fused kernels
  • deepspeed — ZeRO training
  • wandb — experiment tracking
  • tensorboard — training visualization

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastlora-1.0.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastlora-1.0.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file fastlora-1.0.0.tar.gz.

File metadata

  • Download URL: fastlora-1.0.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for fastlora-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c2032e3d44d96f14a3f4a02571c0ff5aef314b0fe4af643236a6bc8e7b925b85
MD5 46362ddbae491a2ae7fc5d2bcae87331
BLAKE2b-256 a3e3c06a19a9b941e207014ea4b8947af7472d46ce3b9c3bb0f2acfe48b90571

See more details on using hashes here.

File details

Details for the file fastlora-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: fastlora-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for fastlora-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c63a026093f472a3fa6330b02b725eae8b165ca5ad05a00bd31961666b30dbdc
MD5 5dee012fee990f584fd8568debfc6236
BLAKE2b-256 7ab633f67e9bc2d7817aa96cc264da51a6decf18081e45690be46bcd7e266a2c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page