Skip to main content

QLoRA fine-tuning of fused 4-bit Mixture-of-Experts on a single small GPU, on stock bitsandbytes.

Project description

experts4bit-qlora

CI PyPI

QLoRA fine-tuning of fused Mixture-of-Experts weights on a single small GPU — the part that doesn't fit anywhere else yet.

The problem

transformers v5 stores MoE experts as one fused 3-D nn.Parameter per layer (OlmoeExperts, Qwen3MoeExperts, …). bitsandbytes' 4-bit walker only replaces nn.Linear modules, so it silently skips the experts — which are the overwhelming majority of a MoE's weights. load_in_4bit "shrinks" the model but the experts stay in full precision (bitsandbytes#1849).

Experts4bit is the primitive that 4-bit-quantizes exactly that fused stack. As of v0.2.0 it is the 4-bit face of ExpertsNbit, which stores the same stack at selectable precision — nf4 / fp4 (4-bit packed), int8 / fp8 (8-bit blockwise), or bf16 / fp16 (passthrough) — with a test-pinned fidelity ordering (bf16 < int8 < nf4 reconstruction error) so the precision knob is a measured trade, not a vibe. This package pairs the primitive with a streaming loader and per-expert LoRA, so you can actually fine-tune a real sparse-MoE on reasonable hardware.

What it buys you (measured on an RTX A2000 12 GB — in a NAS's PCIe 3.0 x8 slot; see METHODOLOGY "Test host")

  • It fits at all. Full bf16 OLMoE-1B-7B is ~13.9 GB — it OOMs on a 12 GB card. In 4-bit it loads at 4.70 GB and trains in <8 GB. The streaming loader never materializes the bf16 model in CPU or GPU RAM (verified under a 3 GB container RAM cap).
  • It trains. QLoRA on the frozen NF4 experts improves a held-out Alpaca eval from 1.4813 → 1.0290 (see docs/METHODOLOGY.md).
  • It scales past VRAM (OFFLOAD_EXPERTS=1). The frozen experts stream from pinned CPU RAM one layer at a time, so a fused-MoE whose 4-bit experts exceed the card can QLoRA-train on 12 GB: Qwen3-30B-A3B peaks at 7.16 GB, Gemma-4-26B-A4B at 8.47 GB — both OOM without offload. Mechanics and cost under Training + expert offload.
  • It serves the fine-tune it made (python -m experts4bit_qlora.infer). The adapters run over the exact NF4 base they were trained against — no GGUF/AWQ re-quantization shifting the error surface. OLMoE decodes at 1.44 tok/s in 1.68 GB with prefetched offload (resident: 3.08 tok/s at 4.86 GB); the same path decodes Gemma-4-26B at 0.43 tok/s (6.2 GB) and Qwen3-30B-A3B at 0.22 tok/s (4.4 GB) — models whose resident decode simply OOMs. See Inference.
  • Honest caveat — this is a memory technology, not an energy one. On a GPU that already fits the model, 4-bit is a 1.2–2.3× energy penalty (NF4 is storage-only; the GEMM runs in bf16 either way, plus dequant). The energy win only shows up when memory is the binding constraint — then it's the difference between running and not, and up to 4.4× lower energy/token from the batch that freed memory unlocks. Numbers and method in the docs.

Install

pip install experts4bit-qlora           # primitive + adapters + benchmarks (torch + bitsandbytes)
pip install "experts4bit-qlora[train]"  # + the streaming MoE trainer (transformers>=5.0, datasets, ...)

Runs on a stock pip install bitsandbytes today — see "Relationship to bitsandbytes" below.

Quickstart

import torch
from experts4bit_qlora import Experts4bit, ExpertsNbit, ExpertsLoRA

# Freeze a fused expert stack in 4-bit, attach trainable per-expert LoRA.
gate_up = torch.randn(8, 2 * 256, 128)          # [num_experts, 2*intermediate, hidden]
down    = torch.randn(8, 128, 256)              # [num_experts, hidden, intermediate]
base    = Experts4bit.from_float(gate_up, down, quant_type="nf4", compute_dtype=torch.float32)
model   = ExpertsLoRA(base, r=8, alpha=16)      # only the LoRA adapters train

# Same stack at other storage precisions (8-bit blockwise / 16-bit passthrough):
base8   = ExpertsNbit.from_float(gate_up, down, quant_type="int8", compute_dtype=torch.float32)

End-to-end OLMoE QLoRA fine-tune (needs a CUDA GPU + [train] extras):

STEPS=150 R=8 TRAIN_EXPERTS=1 TRAIN_ATTENTION=0 OUT=./out \
  python -m experts4bit_qlora.train

Training + expert offload

Training holds no dequantized-expert activations: the frozen base projections re-dequantize from the packed weights inside backward (ExpertsNbit._project), so activation memory stays flat in the number of experts — on any released bitsandbytes, for every storage scheme. Two knobs:

  • QUANT_TYPE=nf4|fp4|int8|fp8|bf16|fp16 selects the frozen base's storage precision end-to-end (loader → training → serving). Default nf4; serve with the same value you trained with.
  • OFFLOAD_EXPERTS=1 keeps the frozen experts in pinned CPU RAM (set OFFLOAD_PIN=0 to skip pinning) and streams one layer to the GPU at a time — GPU-resident only for that layer's forward and its gradient-checkpoint recompute, evicted after. Peak GPU drops by roughly (experts footprint − one layer) at the cost of one PCIe transfer per layer per pass (+11 % s/step on the OLMoE A/B). A memory optimization, not a speedup: it changes what fits, not how fast. Offloading changes tensor location, not math — unit-test-verified, including the gradient-checkpoint recompute path. Offloaded training requires gradient checkpointing (the shipped trainer always enables it); the unsupported non-checkpointed combination fails loudly rather than mis-training. Details in docs/METHODOLOGY.md §11.

Transfer diagnostics (default off): E4B_OFFLOAD_STATS=1 prints per-layer H2D bandwidth, prefetch stall/slack, and a one-shot PCIe-link + ceiling report; E4B_OFFLOAD_ARENA=1 consolidates each layer's four expert tensors into two per-dtype copies. What they measured on the reference host — and why offload is PCIe-bound there — is in docs/OFFLOAD-TRANSFER-NOTES.md.

Scope

The ExpertsNbit primitive and ExpertsLoRA adapters are model-agnostic. The streaming loader / trainer (python -m experts4bit_qlora.train) supports SwiGLU fused-MoE architectures — experts stored either per-expert or already-fused on disk:

  • OLMoE (OLMoE-1B-7B) — convergence-tested end-to-end; fits a 12 GB card at ~4.7 GB.
  • Qwen3-MoE / Qwen3.5-MoE — same checkpoint + module layout as OLMoE (verified byte-identical); structurally tested.
  • Gemma-4 (text tower) — different internally (experts at layers.{i}.experts beside a parallel dense MLP + a custom router; experts fused on disk) — handled and structurally tested.

All three are covered by tests/test_loader_architectures.py. Real Qwen3/Gemma weights (26–35B) need a ≥24 GB card — or the expert-offload path above — to fit 12 GB. Unsupported architectures fail fast with a clear error; PRs for more welcome.

Inference: serve the fine-tune you just made

The adapters were trained against this exact NF4 base (same codebook, same per-expert absmax). python -m experts4bit_qlora.infer serves them over that same base — no re-quantization to GGUF/AWQ, so the quantization error at serving time is identical to what training saw:

ADAPTER=./out/adapter_best.pt python -m experts4bit_qlora.infer            # generate
OFFLOAD_EXPERTS=1 BENCH_TOKENS=128 python -m experts4bit_qlora.infer       # timed decode bench

What inference mode adds (all no_grad-only; training paths are untouched):

  • Decode fast-path — a single-token forward skips the one-hot expert-mask machinery and its per-expert host syncs, looping the token's top_k experts with 0-d device indices.
  • Fused 4-bit GEMV — single-row base projections go through bnb.matmul_4bit's GEMV kernel, which reads the packed NF4 weight directly instead of materializing the dequantized expert. Gated by a per-configuration correctness probe — and the probe passes on stock bitsandbytes 0.49.x. (4-bit only; the 8/16-bit schemes decode via the dequantize path.)
  • Prefetched expert offload (OFFLOAD_EXPERTS=1, default PREFETCH=1) — decode with experts that exceed VRAM: layer L+1's NF4 experts copy on a side CUDA stream while layer L computes. Staging is layer-granular, so the schedule is deterministic — no expert-prediction needed — and residency is bounded at two layers.

Measured on the RTX A2000 (OLMoE + the r16 adapter, 128 greedy tokens; big models: base model, 96 tokens; full grids + analysis in docs/METHODOLOGY.md §12):

model config tok/s peak GPU
OLMoE-1B-7B resident (experts on GPU) 3.08 4.86 GB
OLMoE-1B-7B offload, serial 0.40 1.45 GB
OLMoE-1B-7B offload + prefetch 1.44 1.68 GB
Gemma-4-26B-A4B resident OOM
Gemma-4-26B-A4B offload + prefetch 0.43 6.16 GB
Qwen3-30B-A3B resident OOM
Qwen3-30B-A3B offload + prefetch 0.22 4.41 GB

Same honest framing as training — capability, not throughput — and the levers are shape-dependent, measured: at OLMoE scale prefetch is the result (3.65× over serial) and the GEMV route is neutral; at 26–30B scale decode is so transfer-bound that prefetch's ratio shrinks (1.36× / 1.08×), while GEMV swings from +46 % on Gemma-4 (big per-expert stacks — avoided dequantize traffic dominates) to −8 % on Qwen3-30B (thin experts — it doesn't; prefetch + dequantize is Qwen3's best config at 0.238 tok/s). §12c scores the prediction this falsified. Measure your model with the kill-switches; don't extrapolate across shapes.

Library users: enable_inference_prefetch(handles) links the offload handles the loader (or offload_model_experts) returns; load_moe_4bit_streaming(..., offload=True, prefetch=True) does it for you. Serve with the training run's QUANT_TYPE. Kill-switches for A/B: E4B_DECODE_FASTPATH=0, E4B_INFER_GEMV=0.

Benchmarks

# Runs on stock bitsandbytes:
python bench/bench_energy_excluded.py                    # memory wall + tokens-per-joule vs batch

# Require bitsandbytes >= 0.50 — measure the upstream matmul_4bit routing (#1965):
python bench/_upstream/bench_matmul4bit.py --mode both   # equivalence + latency/memory
python bench/_upstream/bench_energy.py                   # joules/op: bf16 vs dequant vs matmul_4bit

The LoRA-placement ablation (which of experts / attention / router to train) and full energy analysis are written up in docs/METHODOLOGY.md. Short version: on Alpaca the placements are largely redundant, attention-only is the efficiency pick, and training the router hurts.

Relationship to bitsandbytes

ExpertsNbit / Experts4bit are bitsandbytes primitives, proposed upstream in bitsandbytes#1965. Until that ships in a release, this package vendors a copy (experts4bit_qlora/_vendor/experts.py) so it runs on stock bitsandbytes today. The import shim prefers the upstream classes when present and still exposing the internals ExpertsLoRA builds on — both names must resolve to the same implementation, never a mix — and falls back to the vendored copy otherwise:

try:
    from bitsandbytes.nn import Experts4bit, ExpertsNbit   # once bitsandbytes#1965 releases (if compatible)
except ImportError:
    from ._vendor.experts import Experts4bit, ExpertsNbit  # vendored fallback (stock bnb)

Nothing in training depends on the bitsandbytes version: the recompute-in-backward projection delivers the activation-memory win on any release. The only bnb.matmul_4bit use left in the package is the inference decode GEMV, which is probe-gated per configuration and passes on stock 0.49.x. When #1965 lands upstream: bump the bitsandbytes floor and delete _vendor/ — no API change.

Provenance & audits

Every measured number above traces to a committed script/test, an exact environment, and a repo commit in PROVENANCE.md — and that file is OpenTimestamps-anchored: ots verify PROVENANCE.md.ots PROVENANCE.md checks the on-disk bytes against the calendar proof, the footer carries the hash-chain of prior revisions, and superseded proofs are retained in .ots-history/. Falsification work lives under audits/ — most recently the audit of unsloth-zoo's MoE-4bit fix that produced unsloth-zoo#849/#850 (audits/unsloth-zoo-4032/REPORT.md).

License

MIT (see LICENSE). experts4bit_qlora/_vendor/experts.py is vendored from bitsandbytes (also MIT) pending upstream merge.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

experts4bit_qlora-0.2.0.tar.gz (69.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

experts4bit_qlora-0.2.0-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file experts4bit_qlora-0.2.0.tar.gz.

File metadata

  • Download URL: experts4bit_qlora-0.2.0.tar.gz
  • Upload date:
  • Size: 69.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for experts4bit_qlora-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f08a2e7403967e10a4f283c7a612a2fc0a7dbeaedb5deb58966fb65c755d8f44
MD5 933c04991b9ee5c9f36baa6996345e5f
BLAKE2b-256 f62d95077ae789ab8b0901e39eefd8c2420f3030be01d2cb0da348e8e8946cfc

See more details on using hashes here.

Provenance

The following attestation bundles were made for experts4bit_qlora-0.2.0.tar.gz:

Publisher: release.yml on pjordanandrsn/experts4bit-qlora

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file experts4bit_qlora-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for experts4bit_qlora-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe15f8638049559a0121d61563546696606e05a1c806953efe90348a1d6d210e
MD5 4d7ac929b70d7b4cb04e16d11e1045f9
BLAKE2b-256 732987d834f4581a9bf14b5864bba4e9c2b5f00d56bee7191cd7d4e87afeb831

See more details on using hashes here.

Provenance

The following attestation bundles were made for experts4bit_qlora-0.2.0-py3-none-any.whl:

Publisher: release.yml on pjordanandrsn/experts4bit-qlora

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page