QLoRA fine-tuning of fused 4-bit Mixture-of-Experts on a single small GPU, on stock bitsandbytes.
Project description
experts4bit-qlora
QLoRA fine-tuning of fused Mixture-of-Experts weights on a single small GPU — the part that doesn't fit anywhere else yet.
The problem
transformers v5 stores MoE experts as one fused 3-D nn.Parameter per layer
(OlmoeExperts, Qwen3MoeExperts, …). bitsandbytes' 4-bit walker only replaces nn.Linear
modules, so it silently skips the experts — which are the overwhelming majority of a MoE's
weights. load_in_4bit "shrinks" the model but the experts stay in full precision
(bitsandbytes#1849).
Experts4bit is the primitive that 4-bit-quantizes exactly that fused stack. As of v0.2.0 it is
the 4-bit face of ExpertsNbit, which stores the same stack at selectable precision — nf4
/ fp4 (4-bit packed), int8 / fp8 (8-bit blockwise), or bf16 / fp16 (passthrough) — with
a test-pinned fidelity ordering (bf16 < int8 < nf4 reconstruction error) so the precision
knob is a measured trade, not a vibe. This package pairs the primitive with a streaming
loader and per-expert LoRA, so you can actually fine-tune a real sparse-MoE on
reasonable hardware.
What it buys you (measured on an RTX A2000 12 GB — in a NAS's PCIe 3.0 x8 slot; see METHODOLOGY "Test host")
- It fits at all. Full bf16 OLMoE-1B-7B is ~13.9 GB — it OOMs on a 12 GB card. In 4-bit it loads at 4.70 GB and trains in <8 GB. The streaming loader never materializes the bf16 model in CPU or GPU RAM (verified under a 3 GB container RAM cap).
- It trains. QLoRA on the frozen NF4 experts improves a held-out Alpaca eval from
1.4813 → 1.0290 (see
docs/METHODOLOGY.md). - It scales past VRAM (
OFFLOAD_EXPERTS=1). The frozen experts stream from pinned CPU RAM one layer at a time, so a fused-MoE whose 4-bit experts exceed the card can QLoRA-train on 12 GB: Qwen3-30B-A3B peaks at 7.16 GB, Gemma-4-26B-A4B at 8.47 GB — both OOM without offload. Mechanics and cost under Training + expert offload. - It serves the fine-tune it made (
python -m experts4bit_qlora.infer). The adapters run over the exact NF4 base they were trained against — no GGUF/AWQ re-quantization shifting the error surface. OLMoE decodes at 1.44 tok/s in 1.68 GB with prefetched offload (resident: 3.08 tok/s at 4.86 GB); the same path decodes Gemma-4-26B at 0.43 tok/s (6.2 GB) and Qwen3-30B-A3B at 0.22 tok/s (4.4 GB) — models whose resident decode simply OOMs. See Inference. - Honest caveat — this is a memory technology, not an energy one. On a GPU that already fits the model, 4-bit is a 1.2–2.3× energy penalty (NF4 is storage-only; the GEMM runs in bf16 either way, plus dequant). The energy win only shows up when memory is the binding constraint — then it's the difference between running and not, and up to 4.4× lower energy/token from the batch that freed memory unlocks. Numbers and method in the docs.
Install
pip install experts4bit-qlora # primitive + adapters + benchmarks (torch + bitsandbytes)
pip install "experts4bit-qlora[train]" # + the streaming MoE trainer (transformers>=5.0, datasets, ...)
Runs on a stock pip install bitsandbytes today — see "Relationship to bitsandbytes" below.
Quickstart
import torch
from experts4bit_qlora import Experts4bit, ExpertsNbit, ExpertsLoRA
# Freeze a fused expert stack in 4-bit, attach trainable per-expert LoRA.
gate_up = torch.randn(8, 2 * 256, 128) # [num_experts, 2*intermediate, hidden]
down = torch.randn(8, 128, 256) # [num_experts, hidden, intermediate]
base = Experts4bit.from_float(gate_up, down, quant_type="nf4", compute_dtype=torch.float32)
model = ExpertsLoRA(base, r=8, alpha=16) # only the LoRA adapters train
# Same stack at other storage precisions (8-bit blockwise / 16-bit passthrough):
base8 = ExpertsNbit.from_float(gate_up, down, quant_type="int8", compute_dtype=torch.float32)
End-to-end OLMoE QLoRA fine-tune (needs a CUDA GPU + [train] extras):
STEPS=150 R=8 TRAIN_EXPERTS=1 TRAIN_ATTENTION=0 OUT=./out \
python -m experts4bit_qlora.train
Training + expert offload
Training holds no dequantized-expert activations: the frozen base projections re-dequantize from
the packed weights inside backward (ExpertsNbit._project), so activation memory stays flat in
the number of experts — on any released bitsandbytes, for every storage scheme. Two knobs:
QUANT_TYPE=nf4|fp4|int8|fp8|bf16|fp16selects the frozen base's storage precision end-to-end (loader → training → serving). Defaultnf4; serve with the same value you trained with.OFFLOAD_EXPERTS=1keeps the frozen experts in pinned CPU RAM (setOFFLOAD_PIN=0to skip pinning) and streams one layer to the GPU at a time — GPU-resident only for that layer's forward and its gradient-checkpoint recompute, evicted after. Peak GPU drops by roughly (experts footprint − one layer) at the cost of one PCIe transfer per layer per pass (+11 % s/step on the OLMoE A/B). A memory optimization, not a speedup: it changes what fits, not how fast. Offloading changes tensor location, not math — unit-test-verified, including the gradient-checkpoint recompute path. Offloaded training requires gradient checkpointing (the shipped trainer always enables it); the unsupported non-checkpointed combination fails loudly rather than mis-training. Details indocs/METHODOLOGY.md§11.
Transfer diagnostics (default off): E4B_OFFLOAD_STATS=1 prints per-layer H2D bandwidth, prefetch
stall/slack, and a one-shot PCIe-link + ceiling report; E4B_OFFLOAD_ARENA=1 consolidates each
layer's four expert tensors into two per-dtype copies. What they measured on the reference host —
and why offload is PCIe-bound there — is in
docs/OFFLOAD-TRANSFER-NOTES.md.
Scope
The ExpertsNbit primitive and ExpertsLoRA adapters are model-agnostic. The streaming
loader / trainer (python -m experts4bit_qlora.train) supports SwiGLU fused-MoE architectures —
experts stored either per-expert or already-fused on disk:
- OLMoE (OLMoE-1B-7B) — convergence-tested end-to-end; fits a 12 GB card at ~4.7 GB.
- Qwen3-MoE / Qwen3.5-MoE — same checkpoint + module layout as OLMoE (verified byte-identical); structurally tested.
- Gemma-4 (text tower) — different internally (experts at
layers.{i}.expertsbeside a parallel dense MLP + a custom router; experts fused on disk) — handled and structurally tested.
All three are covered by tests/test_loader_architectures.py. Real Qwen3/Gemma weights (26–35B)
need a ≥24 GB card — or the expert-offload path above — to fit 12 GB. Unsupported architectures
fail fast with a clear error; PRs for more welcome.
Inference: serve the fine-tune you just made
The adapters were trained against this exact NF4 base (same codebook, same per-expert absmax).
python -m experts4bit_qlora.infer serves them over that same base — no re-quantization to
GGUF/AWQ, so the quantization error at serving time is identical to what training saw:
ADAPTER=./out/adapter_best.pt python -m experts4bit_qlora.infer # generate
OFFLOAD_EXPERTS=1 BENCH_TOKENS=128 python -m experts4bit_qlora.infer # timed decode bench
What inference mode adds (all no_grad-only; training paths are untouched):
- Decode fast-path — a single-token forward skips the one-hot expert-mask machinery and its
per-expert host syncs, looping the token's
top_kexperts with 0-d device indices. - Fused 4-bit GEMV — single-row base projections go through
bnb.matmul_4bit's GEMV kernel, which reads the packed NF4 weight directly instead of materializing the dequantized expert. Gated by a per-configuration correctness probe — and the probe passes on stock bitsandbytes 0.49.x. (4-bit only; the 8/16-bit schemes decode via the dequantize path.) - Prefetched expert offload (
OFFLOAD_EXPERTS=1, defaultPREFETCH=1) — decode with experts that exceed VRAM: layerL+1's NF4 experts copy on a side CUDA stream while layerLcomputes. Staging is layer-granular, so the schedule is deterministic — no expert-prediction needed — and residency is bounded at two layers.
Measured on the RTX A2000 (OLMoE + the r16 adapter, 128 greedy tokens; big models: base model,
96 tokens; full grids + analysis in docs/METHODOLOGY.md §12):
| model | config | tok/s | peak GPU |
|---|---|---|---|
| OLMoE-1B-7B | resident (experts on GPU) | 3.08 | 4.86 GB |
| OLMoE-1B-7B | offload, serial | 0.40 | 1.45 GB |
| OLMoE-1B-7B | offload + prefetch | 1.44 | 1.68 GB |
| Gemma-4-26B-A4B | resident | OOM | — |
| Gemma-4-26B-A4B | offload + prefetch | 0.43 | 6.16 GB |
| Qwen3-30B-A3B | resident | OOM | — |
| Qwen3-30B-A3B | offload + prefetch | 0.22 | 4.41 GB |
Same honest framing as training — capability, not throughput — and the levers are shape-dependent, measured: at OLMoE scale prefetch is the result (3.65× over serial) and the GEMV route is neutral; at 26–30B scale decode is so transfer-bound that prefetch's ratio shrinks (1.36× / 1.08×), while GEMV swings from +46 % on Gemma-4 (big per-expert stacks — avoided dequantize traffic dominates) to −8 % on Qwen3-30B (thin experts — it doesn't; prefetch + dequantize is Qwen3's best config at 0.238 tok/s). §12c scores the prediction this falsified. Measure your model with the kill-switches; don't extrapolate across shapes.
Library users: enable_inference_prefetch(handles) links the offload handles the loader (or
offload_model_experts) returns; load_moe_4bit_streaming(..., offload=True, prefetch=True) does
it for you. Serve with the training run's QUANT_TYPE. Kill-switches for A/B:
E4B_DECODE_FASTPATH=0, E4B_INFER_GEMV=0.
Benchmarks
# Runs on stock bitsandbytes:
python bench/bench_energy_excluded.py # memory wall + tokens-per-joule vs batch
# Require bitsandbytes >= 0.50 — measure the upstream matmul_4bit routing (#1965):
python bench/_upstream/bench_matmul4bit.py --mode both # equivalence + latency/memory
python bench/_upstream/bench_energy.py # joules/op: bf16 vs dequant vs matmul_4bit
The LoRA-placement ablation (which of experts / attention / router to train) and full energy
analysis are written up in docs/METHODOLOGY.md. Short version: on Alpaca
the placements are largely redundant, attention-only is the efficiency pick, and training the
router hurts.
Relationship to bitsandbytes
ExpertsNbit / Experts4bit are bitsandbytes primitives, proposed upstream in
bitsandbytes#1965. Until that
ships in a release, this package vendors a copy (experts4bit_qlora/_vendor/experts.py) so it
runs on stock bitsandbytes today. The import shim prefers the upstream classes when present and
still exposing the internals ExpertsLoRA builds on — both names must resolve to the same
implementation, never a mix — and falls back to the vendored copy otherwise:
try:
from bitsandbytes.nn import Experts4bit, ExpertsNbit # once bitsandbytes#1965 releases (if compatible)
except ImportError:
from ._vendor.experts import Experts4bit, ExpertsNbit # vendored fallback (stock bnb)
Nothing in training depends on the bitsandbytes version: the recompute-in-backward projection
delivers the activation-memory win on any release. The only bnb.matmul_4bit use left in the
package is the inference decode GEMV, which is probe-gated per configuration and passes on stock
0.49.x. When #1965 lands upstream: bump the bitsandbytes floor and delete _vendor/ — no API
change.
Provenance & audits
Every measured number above traces to a committed script/test, an exact environment, and a repo
commit in PROVENANCE.md — and that file is OpenTimestamps-anchored: ots verify PROVENANCE.md.ots PROVENANCE.md checks the on-disk bytes against the calendar proof, the footer
carries the hash-chain of prior revisions, and superseded proofs are retained in
.ots-history/. Falsification work lives under audits/ — most
recently the audit of unsloth-zoo's MoE-4bit fix that produced unsloth-zoo#849/#850
(audits/unsloth-zoo-4032/REPORT.md).
License
MIT (see LICENSE). experts4bit_qlora/_vendor/experts.py is vendored from
bitsandbytes (also MIT) pending upstream merge.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file experts4bit_qlora-0.2.0.tar.gz.
File metadata
- Download URL: experts4bit_qlora-0.2.0.tar.gz
- Upload date:
- Size: 69.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f08a2e7403967e10a4f283c7a612a2fc0a7dbeaedb5deb58966fb65c755d8f44
|
|
| MD5 |
933c04991b9ee5c9f36baa6996345e5f
|
|
| BLAKE2b-256 |
f62d95077ae789ab8b0901e39eefd8c2420f3030be01d2cb0da348e8e8946cfc
|
Provenance
The following attestation bundles were made for experts4bit_qlora-0.2.0.tar.gz:
Publisher:
release.yml on pjordanandrsn/experts4bit-qlora
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
experts4bit_qlora-0.2.0.tar.gz -
Subject digest:
f08a2e7403967e10a4f283c7a612a2fc0a7dbeaedb5deb58966fb65c755d8f44 - Sigstore transparency entry: 2066391859
- Sigstore integration time:
-
Permalink:
pjordanandrsn/experts4bit-qlora@e67117d7198d7e147de96f7865fd7c892d1cbadf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/pjordanandrsn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e67117d7198d7e147de96f7865fd7c892d1cbadf -
Trigger Event:
push
-
Statement type:
File details
Details for the file experts4bit_qlora-0.2.0-py3-none-any.whl.
File metadata
- Download URL: experts4bit_qlora-0.2.0-py3-none-any.whl
- Upload date:
- Size: 46.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe15f8638049559a0121d61563546696606e05a1c806953efe90348a1d6d210e
|
|
| MD5 |
4d7ac929b70d7b4cb04e16d11e1045f9
|
|
| BLAKE2b-256 |
732987d834f4581a9bf14b5864bba4e9c2b5f00d56bee7191cd7d4e87afeb831
|
Provenance
The following attestation bundles were made for experts4bit_qlora-0.2.0-py3-none-any.whl:
Publisher:
release.yml on pjordanandrsn/experts4bit-qlora
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
experts4bit_qlora-0.2.0-py3-none-any.whl -
Subject digest:
fe15f8638049559a0121d61563546696606e05a1c806953efe90348a1d6d210e - Sigstore transparency entry: 2066391949
- Sigstore integration time:
-
Permalink:
pjordanandrsn/experts4bit-qlora@e67117d7198d7e147de96f7865fd7c892d1cbadf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/pjordanandrsn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e67117d7198d7e147de96f7865fd7c892d1cbadf -
Trigger Event:
push
-
Statement type: