Drop-in Unsloth alternative — zero errors, maximum speed LoRA/QLoRA fine-tuning

These details have not been verified by PyPI

Project links

Project description

FastLoRA ⚡🛡️

The drop-in Unsloth alternative that never crashes.
Maximum speed LoRA/QLoRA fine-tuning with automatic safety guards — every feature is a simple True/False toggle.

pip install fastlora
pip install "fastlora[full]"   # recommended

Why FastLoRA?

	Unsloth	FastLoRA
Installation errors	Frequent	None
VRAM Guard (auto OOM prevention)	✗	✓
Fallback chain on failure	✗	✓
Compiled kernel cache	✗	✓
Adapter hot-swap (ms)	✗	✓
Every feature True/False toggle	Partial	✓
0.0–1.0 power control per feature	✗	✓
Multi-GPU: DDP / FSDP / DeepSpeed	Partial	✓
Works on CPU / MPS / CUDA	Partial	✓

Installation

# Minimal (core dependencies only)
pip install fastlora

# Recommended (full features)
pip install "fastlora[full]"

# With Flash Attention 2 (requires CUDA + compilation)
pip install "fastlora[full,flash]"

# With DeepSpeed multi-GPU
pip install "fastlora[full,distributed]"

# Everything
pip install "fastlora[all]"

Quick Start

from fastlora import FastLoRA

fl = FastLoRA(
    "meta-llama/Llama-3.2-3B",
    lora=True,
    quantization="4bit",
    flash_attention=True,
    vram_guard=True,
)
model, tokenizer = fl.load()

Settings Panel

Copy this into your training script and adjust what you need.
Everything else runs automatically.

from fastlora import FastLoRA

fl = FastLoRA(
    "meta-llama/Llama-3.2-3B",

    # LORA            On/Off     Power (0.0–1.0)
    lora            = True,    # lora_power          = 1.0,
    lora_r          = 16,      # rank: 8 / 16 / 32 / 64
    lora_alpha      = 32,      # scaling (usually r×2)

    # QUANTIZATION    On/Off     Power
    quantization    = "4bit",  # "4bit" / "8bit" / "none"
    quantization_power = 1.0,  # <0.7 → falls back to 8bit

    # SPEED           On/Off     Power
    flash_attention = True,    # attention_power     = 1.0,
    torch_compile   = True,    # compile_power       = 0.8,
    fused_ops       = True,
    batch_packing   = True,    # packing_power       = 1.0,
    cuda_optimizations = True,
    compile_cache   = True,
    pin_memory      = True,
    auto_batch_size = True,

    # VRAM GUARD      Threshold (0.0–1.0 → triggers at X% usage)
    vram_guard      = True,    # vram_guard_power    = 0.85,

    # TRAINING
    precision       = "auto",  # "auto" / "bf16" / "fp16" / "fp32"
    gradient_checkpointing = True,
    learning_rate   = 2e-4,
    lora_power      = 1.0,
)

model, tokenizer = fl.load()

Remove the # from any power parameter to activate it.

Full Training Pipeline

from fastlora import FastLoRA, format_alpaca
from fastlora import CheckpointManager, LRFinder, EarlyStopping, ExperimentLogger
from datasets import load_dataset

# 1. Initialize
fl = FastLoRA("meta-llama/Llama-3.2-3B", lora=True, quantization="4bit")
model, tokenizer = fl.load()

# 2. Find optimal learning rate automatically
lr = LRFinder(fl).find(train_dataset)

# 3. Load dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:2000]")

# 4. Get trainer
trainer = fl.get_trainer(dataset, formatting_func=format_alpaca)

# 5. Add logging (Wandb + TensorBoard + CSV — all at once, no conflicts)
ExperimentLogger(fl, wandb=True, tensorboard=True, csv=True).patch_trainer(trainer)

# 6. Add early stopping
EarlyStopping(patience=3).patch_trainer(trainer)

# 7. Resume from checkpoint if exists
resume = CheckpointManager(fl, "./checkpoints").patch_trainer(trainer)

# 8. Train
trainer.train(resume_from_checkpoint=resume)

# 9. Save
fl.save("./my_model")

Feature Reference

Core Features

Parameter	Type	Default	Description
`lora`	bool	`True`	Enable LoRA fine-tuning
`lora_r`	int	`16`	LoRA rank (8 / 16 / 32 / 64)
`lora_alpha`	int	`32`	LoRA scaling factor
`lora_dropout`	float	`0.05`	LoRA dropout rate
`quantization`	str	`"4bit"`	`"4bit"` / `"8bit"` / `"none"`
`flash_attention`	bool	`True`	Flash Attention 2 (falls back to SDPA if unavailable)
`gradient_checkpointing`	bool	`True`	Gradient checkpointing
`precision`	str	`"auto"`	`"auto"` / `"bf16"` / `"fp16"` / `"fp32"`

Speed Features

Parameter	Type	Default	Description
`torch_compile`	bool	`True`	`torch.compile()` — slow first step, ~2x faster after
`compile_cache`	bool	`True`	Cache compiled kernels (5s startup instead of 3min)
`fused_ops`	bool	`True`	Fused RMSNorm (Triton if available, PyTorch fallback)
`cuda_optimizations`	bool	`True`	TF32 + cuDNN benchmark mode
`batch_packing`	bool	`True`	Sequence packing — zero padding, ~1.4x throughput
`pin_memory`	bool	`True`	Async CPU→GPU transfer
`auto_batch_size`	bool	`True`	Auto-detect max batch size for available VRAM

Power Controls (0.0 – 1.0)

Parameter	Effect
`lora_power`	`0.5` → rank is halved. `1.0` = full
`compile_power`	`1.0` = max-autotune, `0.8` = reduce-overhead, `0.5` = default
`quantization_power`	`< 0.7` → automatically falls back to 8bit
`attention_power`	`1.0` = Flash Att 2, `0.5` = SDPA, `0.0` = eager
`packing_power`	`< 0.1` → packing disabled
`vram_guard_power`	`0.85` = triggers at 85% VRAM usage

Safety Features

Parameter	Type	Default	Description
`vram_guard`	bool	`True`	Auto OOM prevention — halves batch, doubles accumulation
`vram_guard_power`	float	`0.85`	Intervention threshold (0.85 = 85% VRAM)
`oom_retry`	bool	`True`	On OOM: retry with 4bit→8bit→fp16→cpu fallback chain
`allow_remote_code`	bool	`False`	Execute remote model code (dangerous — keep False)
`strict_mode`	bool	`False`	`True` = crash on error, `False` = fallback and continue

Distributed Training

Parameter	Type	Default	Description
`multi_gpu`	bool	`False`	Enable multi-GPU training
`distributed_backend`	str	`"ddp"`	`"ddp"` / `"fsdp"` / `"deepspeed"`
`deepspeed_stage`	int	`2`	DeepSpeed ZeRO stage: `1` / `2` / `3`

Long Context

Parameter	Type	Default	Description
`sliding_window`	bool	`False`	Sliding Window Attention
`sliding_window_size`	int	`4096`	Window size in tokens
`ring_attention`	bool	`False`	Ring Attention for 500K+ tokens

Add-on Classes

CheckpointManager — Resume Training

ckpt    = CheckpointManager(fl, "./checkpoints", save_total_limit=3)
resume  = ckpt.patch_trainer(trainer)
trainer.train(resume_from_checkpoint=resume)

LRFinder — Auto Learning Rate

lr = LRFinder(fl, num_iter=100).find(train_dataset)
# fl.cfg.learning_rate is updated automatically
# saves lr_finder.png plot

EarlyStopping

es = EarlyStopping(patience=3, min_delta=0.001, metric="eval_loss")
es.patch_trainer(trainer)
# es.stopped      → True if training was stopped early
# es.best_value   → best metric value seen

ExperimentLogger — Wandb + TensorBoard + CSV

log = ExperimentLogger(
    fl,
    wandb=True,        # pip install wandb
    tensorboard=True,  # pip install tensorboard
    csv=True,
    project="my-project",
    run_name="run-001",
)
log.patch_trainer(trainer)
trainer.train()
log.finish()

AdapterManager — Hot-swap LoRA Adapters

fl.adapter_manager.register("math",   "./adapters/math")
fl.adapter_manager.register("coding", "./adapters/coding")

model = fl.adapter_manager.swap(model, "math")    # milliseconds
model = fl.adapter_manager.swap(model, "coding")  # milliseconds

Utility Functions

from fastlora import check_environment, format_alpaca, format_chatml

# Check installed packages and versions
check_environment()

# Format dataset samples
text = format_alpaca({"instruction": "...", "input": "", "output": "..."})
text = format_chatml({"messages": [{"role": "user", "content": "..."}]})

# CLI environment check
fastlora-check

VRAM Guard — How It Works

When VRAM usage exceeds the threshold (default 85%):

Batch size halved — e.g. 4 → 2
Gradient accumulation doubled — effective batch size stays the same
Cache cleared — gc.collect() + cuda.empty_cache()
Runs in a background thread — no interruption to training

# Check VRAM status anytime
print(fl.vram_status())
# {'used_gb': 7.2, 'reserved_gb': 8.0, 'total_gb': 24.0, 'free_gb': 16.0, 'ratio': 0.30}

# Manual check
fl.vram_guard.check_and_adapt()

Benchmark

tok_per_sec = fl.profile(n_steps=10)

Device Support

Device	LoRA	Quantization	Flash Attention	torch.compile
CUDA (RTX / A100 / H100)	✓	✓	✓	✓
Apple Silicon (MPS)	✓	✗	✗	✗
CPU	✓	✗	✗	✓

Requirements

Required:

Python ≥ 3.9
torch ≥ 2.1.0
transformers ≥ 4.40.0
accelerate ≥ 0.27.0

Optional (install with [full]):

peft ≥ 0.10.0 — LoRA/QLoRA
bitsandbytes ≥ 0.43.0 — 4bit/8bit quantization
trl ≥ 0.8.0 — SFTTrainer
datasets ≥ 2.18.0 — dataset loading

Optional extras:

flash-attn — Flash Attention 2
triton — fused kernels
deepspeed — ZeRO training
wandb — experiment tracking
tensorboard — training visualization

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.2.1

Mar 20, 2026

This version

1.0.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastlora-1.0.0.tar.gz (8.1 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastlora-1.0.0-py3-none-any.whl (6.9 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file fastlora-1.0.0.tar.gz.

File metadata

Download URL: fastlora-1.0.0.tar.gz
Upload date: Mar 19, 2026
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for fastlora-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c2032e3d44d96f14a3f4a02571c0ff5aef314b0fe4af643236a6bc8e7b925b85`
MD5	`46362ddbae491a2ae7fc5d2bcae87331`
BLAKE2b-256	`a3e3c06a19a9b941e207014ea4b8947af7472d46ce3b9c3bb0f2acfe48b90571`

See more details on using hashes here.

File details

Details for the file fastlora-1.0.0-py3-none-any.whl.

File metadata

Download URL: fastlora-1.0.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 6.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for fastlora-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c63a026093f472a3fa6330b02b725eae8b165ca5ad05a00bd31961666b30dbdc`
MD5	`5dee012fee990f584fd8568debfc6236`
BLAKE2b-256	`7ab633f67e9bc2d7817aa96cc264da51a6decf18081e45690be46bcd7e266a2c`

See more details on using hashes here.

fastlora 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FastLoRA ⚡🛡️

Why FastLoRA?

Installation

Quick Start

Settings Panel

Full Training Pipeline

Feature Reference

Core Features

Speed Features

Power Controls (0.0 – 1.0)

Safety Features

Distributed Training

Long Context

Add-on Classes

CheckpointManager — Resume Training

LRFinder — Auto Learning Rate

EarlyStopping

ExperimentLogger — Wandb + TensorBoard + CSV

AdapterManager — Hot-swap LoRA Adapters

Utility Functions

VRAM Guard — How It Works

Benchmark

Device Support

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes