QLoRA fine-tuning of fused 4-bit Mixture-of-Experts on a single small GPU, on stock bitsandbytes.
Project description
experts4bit-qlora
QLoRA fine-tuning of fused Mixture-of-Experts weights on a single small GPU — the part that doesn't fit anywhere else yet.
The problem
transformers v5 stores MoE experts as one fused 3-D nn.Parameter per layer
(OlmoeExperts, Qwen3MoeExperts, …). bitsandbytes' 4-bit walker only replaces nn.Linear
modules, so it silently skips the experts — which are the overwhelming majority of a MoE's
weights. load_in_4bit "shrinks" the model but the experts stay in full precision
(bitsandbytes#1849).
Experts4bit is the primitive that 4-bit-quantizes exactly that fused stack. This package pairs
it with a streaming loader and per-expert LoRA, so you can actually fine-tune a real
sparse-MoE on reasonable hardware.
What it buys you (measured on an RTX A2000 12 GB)
- It fits at all. Full bf16 OLMoE-1B-7B is ~13.9 GB — it OOMs on a 12 GB card. In 4-bit it loads at 4.69 GB and trains in <8 GB. The streaming loader never materializes the bf16 model in CPU or GPU RAM (verified under a 3 GB container RAM cap).
- It trains. QLoRA on the frozen NF4 experts improves a held-out Alpaca eval from
1.4813 → 1.0290 (see
docs/METHODOLOGY.md). - Honest caveat — this is a memory technology, not an energy one. On a GPU that already fits the model, 4-bit is a 1.2–2.3× energy penalty (NF4 is storage-only; the GEMM runs in bf16 either way, plus dequant). The energy win only shows up when memory is the binding constraint — then it's the difference between running and not, and up to 4.4× lower energy/token from the batch that freed memory unlocks. Numbers and method in the docs.
Install
pip install experts4bit-qlora # primitive + adapters + benchmarks (torch + bitsandbytes)
pip install "experts4bit-qlora[train]" # + the streaming MoE trainer (transformers>=5.0, datasets, ...)
Runs on a stock pip install bitsandbytes today — see "Relationship to bitsandbytes" below.
Quickstart
import torch
from experts4bit_qlora import Experts4bit, ExpertsLoRA
# Freeze a fused expert stack in 4-bit, attach trainable per-expert LoRA.
gate_up = torch.randn(8, 2 * 256, 128) # [num_experts, 2*intermediate, hidden]
down = torch.randn(8, 128, 256) # [num_experts, hidden, intermediate]
base = Experts4bit.from_float(gate_up, down, quant_type="nf4", compute_dtype=torch.float32)
model = ExpertsLoRA(base, r=8, alpha=16) # only the LoRA adapters train
End-to-end OLMoE QLoRA fine-tune (needs a CUDA GPU + [train] extras):
STEPS=150 R=8 TRAIN_EXPERTS=1 TRAIN_ATTENTION=0 OUT=./out \
python -m experts4bit_qlora.train
Scope
The Experts4bit primitive and ExpertsLoRA adapters are model-agnostic. The streaming loader /
trainer (python -m experts4bit_qlora.train) supports SwiGLU fused-MoE architectures — experts stored
either per-expert or already-fused on disk:
- OLMoE (OLMoE-1B-7B) — convergence-tested end-to-end; fits a 12 GB card at ~4.7 GB.
- Qwen3-MoE / Qwen3.5-MoE — same checkpoint + module layout as OLMoE (verified byte-identical); structurally tested.
- Gemma-4 (text tower) — different internally (experts at
layers.{i}.expertsbeside a parallel dense MLP + a custom router; experts fused on disk) — handled and structurally tested.
All three are covered by tests/test_loader_architectures.py. Real Qwen3/Gemma weights (26–35B) need a
≥24 GB card — or the CPU-offloading path (tracked separately) — to fit 12 GB. Unsupported architectures
fail fast with a clear error; PRs for more welcome.
Benchmarks
# Runs on stock bitsandbytes (uses the portable dequantize forward):
python bench/bench_energy_excluded.py # memory wall + tokens-per-joule vs batch
# Require bitsandbytes >= 0.50 — measure the upstream matmul_4bit optimization (#1965):
python bench/_upstream/bench_matmul4bit.py --mode both # equivalence + latency/memory
python bench/_upstream/bench_energy.py # joules/op: bf16 vs dequant vs matmul_4bit
The LoRA-placement ablation (which of experts / attention / router to train) and full energy
analysis are written up in docs/METHODOLOGY.md. Short version: on Alpaca
the placements are largely redundant, attention-only is the efficiency pick, and training the
router hurts.
Relationship to bitsandbytes
Experts4bit is a bitsandbytes primitive, proposed upstream in
bitsandbytes#1965. Until it
ships in a release, this package vendors a copy (experts4bit_qlora/_vendor/experts.py) so it
runs on stock bitsandbytes today. The import shim prefers the upstream class when present:
try:
from bitsandbytes.nn import Experts4bit # once bitsandbytes#1965 releases
except ImportError:
from ._vendor.experts import Experts4bit # vendored fallback (stock bnb)
The vendored forward also auto-detects whether matmul_4bit is correct on your installed
bitsandbytes — it only handles this weight layout correctly on bnb ≥ 0.50, so on older releases
the primitive uses the portable dequantize path, and the matmul_4bit memory optimization engages
automatically once you upgrade. Results are correct on any supported bnb either way.
When it lands upstream: bump the bitsandbytes floor and delete _vendor/ — no API change.
License
MIT (see LICENSE). experts4bit_qlora/_vendor/experts.py is vendored from
bitsandbytes (also MIT) pending upstream merge.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file experts4bit_qlora-0.1.1.tar.gz.
File metadata
- Download URL: experts4bit_qlora-0.1.1.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbac651eef08ba5db41dbf7f7ce429f77b699dccdb5b61bc9499f9575fe7bdd1
|
|
| MD5 |
3c4aefdaa7f38f6badc76d39ac25d8c4
|
|
| BLAKE2b-256 |
d62ecf000141b8cb1eedab7f52d4365cd42aec339e0487ed3679cc01a8c10d26
|
Provenance
The following attestation bundles were made for experts4bit_qlora-0.1.1.tar.gz:
Publisher:
release.yml on pjordanandrsn/experts4bit-qlora
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
experts4bit_qlora-0.1.1.tar.gz -
Subject digest:
cbac651eef08ba5db41dbf7f7ce429f77b699dccdb5b61bc9499f9575fe7bdd1 - Sigstore transparency entry: 2038654433
- Sigstore integration time:
-
Permalink:
pjordanandrsn/experts4bit-qlora@142d384b0f3a76fbb586d260e24014b125d81848 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/pjordanandrsn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@142d384b0f3a76fbb586d260e24014b125d81848 -
Trigger Event:
push
-
Statement type:
File details
Details for the file experts4bit_qlora-0.1.1-py3-none-any.whl.
File metadata
- Download URL: experts4bit_qlora-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f098507accdbacbefe87d49696695d60fe5e1a84abd8accbeb7a368df852249e
|
|
| MD5 |
8b1808dba0702b67a7810ecce4ae9f73
|
|
| BLAKE2b-256 |
e80cfb94dba87b15b9e3327412286d6db70b2af6198417f286cbae4eaf9501e7
|
Provenance
The following attestation bundles were made for experts4bit_qlora-0.1.1-py3-none-any.whl:
Publisher:
release.yml on pjordanandrsn/experts4bit-qlora
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
experts4bit_qlora-0.1.1-py3-none-any.whl -
Subject digest:
f098507accdbacbefe87d49696695d60fe5e1a84abd8accbeb7a368df852249e - Sigstore transparency entry: 2038654772
- Sigstore integration time:
-
Permalink:
pjordanandrsn/experts4bit-qlora@142d384b0f3a76fbb586d260e24014b125d81848 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/pjordanandrsn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@142d384b0f3a76fbb586d260e24014b125d81848 -
Trigger Event:
push
-
Statement type: