Skip to main content

QLoRA fine-tuning of fused 4-bit Mixture-of-Experts on a single small GPU, on stock bitsandbytes.

Project description

experts4bit-qlora

CI

QLoRA fine-tuning of fused Mixture-of-Experts weights on a single small GPU — the part that doesn't fit anywhere else yet.

The problem

transformers v5 stores MoE experts as one fused 3-D nn.Parameter per layer (OlmoeExperts, Qwen3MoeExperts, …). bitsandbytes' 4-bit walker only replaces nn.Linear modules, so it silently skips the experts — which are the overwhelming majority of a MoE's weights. load_in_4bit "shrinks" the model but the experts stay in full precision (bitsandbytes#1849).

Experts4bit is the primitive that 4-bit-quantizes exactly that fused stack. This package pairs it with a streaming loader and per-expert LoRA, so you can actually fine-tune a real sparse-MoE on reasonable hardware.

What it buys you (measured on an RTX A2000 12 GB)

  • It fits at all. Full bf16 OLMoE-1B-7B is ~13.9 GB — it OOMs on a 12 GB card. In 4-bit it loads at 4.69 GB and trains in <8 GB. The streaming loader never materializes the bf16 model in CPU or GPU RAM (verified under a 3 GB container RAM cap).
  • It trains. QLoRA on the frozen NF4 experts improves a held-out Alpaca eval from 1.4813 → 1.0290 (see docs/METHODOLOGY.md).
  • Honest caveat — this is a memory technology, not an energy one. On a GPU that already fits the model, 4-bit is a 1.2–2.3× energy penalty (NF4 is storage-only; the GEMM runs in bf16 either way, plus dequant). The energy win only shows up when memory is the binding constraint — then it's the difference between running and not, and up to 4.4× lower energy/token from the batch that freed memory unlocks. Numbers and method in the docs.

Install

pip install experts4bit-qlora           # primitive + adapters + benchmarks (torch + bitsandbytes)
pip install "experts4bit-qlora[train]"  # + the streaming MoE trainer (transformers>=5.0, datasets, ...)

Runs on a stock pip install bitsandbytes today — see "Relationship to bitsandbytes" below.

Quickstart

import torch
from experts4bit_qlora import Experts4bit, ExpertsLoRA

# Freeze a fused expert stack in 4-bit, attach trainable per-expert LoRA.
gate_up = torch.randn(8, 2 * 256, 128)          # [num_experts, 2*intermediate, hidden]
down    = torch.randn(8, 128, 256)              # [num_experts, hidden, intermediate]
base    = Experts4bit.from_float(gate_up, down, quant_type="nf4", compute_dtype=torch.float32)
model   = ExpertsLoRA(base, r=8, alpha=16)      # only the LoRA adapters train

End-to-end OLMoE QLoRA fine-tune (needs a CUDA GPU + [train] extras):

STEPS=150 R=8 TRAIN_EXPERTS=1 TRAIN_ATTENTION=0 OUT=./out \
  python -m experts4bit_qlora.train

Scope

The Experts4bit primitive and ExpertsLoRA adapters are model-agnostic. The streaming loader / trainer (python -m experts4bit_qlora.train) supports SwiGLU fused-MoE architectures — experts stored either per-expert or already-fused on disk:

  • OLMoE (OLMoE-1B-7B) — convergence-tested end-to-end; fits a 12 GB card at ~4.7 GB.
  • Qwen3-MoE / Qwen3.5-MoE — same checkpoint + module layout as OLMoE (verified byte-identical); structurally tested.
  • Gemma-4 (text tower) — different internally (experts at layers.{i}.experts beside a parallel dense MLP + a custom router; experts fused on disk) — handled and structurally tested.

All three are covered by tests/test_loader_architectures.py. Real Qwen3/Gemma weights (26–35B) need a ≥24 GB card — or the CPU-offloading path (tracked separately) — to fit 12 GB. Unsupported architectures fail fast with a clear error; PRs for more welcome.

Benchmarks

# Runs on stock bitsandbytes (uses the portable dequantize forward):
python bench/bench_energy_excluded.py                    # memory wall + tokens-per-joule vs batch

# Require bitsandbytes >= 0.50 — measure the upstream matmul_4bit optimization (#1965):
python bench/_upstream/bench_matmul4bit.py --mode both   # equivalence + latency/memory
python bench/_upstream/bench_energy.py                   # joules/op: bf16 vs dequant vs matmul_4bit

The LoRA-placement ablation (which of experts / attention / router to train) and full energy analysis are written up in docs/METHODOLOGY.md. Short version: on Alpaca the placements are largely redundant, attention-only is the efficiency pick, and training the router hurts.

Relationship to bitsandbytes

Experts4bit is a bitsandbytes primitive, proposed upstream in bitsandbytes#1965. Until it ships in a release, this package vendors a copy (experts4bit_qlora/_vendor/experts.py) so it runs on stock bitsandbytes today. The import shim prefers the upstream class when present:

try:
    from bitsandbytes.nn import Experts4bit      # once bitsandbytes#1965 releases
except ImportError:
    from ._vendor.experts import Experts4bit     # vendored fallback (stock bnb)

The vendored forward also auto-detects whether matmul_4bit is correct on your installed bitsandbytes — it only handles this weight layout correctly on bnb ≥ 0.50, so on older releases the primitive uses the portable dequantize path, and the matmul_4bit memory optimization engages automatically once you upgrade. Results are correct on any supported bnb either way.

When it lands upstream: bump the bitsandbytes floor and delete _vendor/ — no API change.

License

MIT (see LICENSE). experts4bit_qlora/_vendor/experts.py is vendored from bitsandbytes (also MIT) pending upstream merge.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

experts4bit_qlora-0.1.1.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

experts4bit_qlora-0.1.1-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file experts4bit_qlora-0.1.1.tar.gz.

File metadata

  • Download URL: experts4bit_qlora-0.1.1.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for experts4bit_qlora-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cbac651eef08ba5db41dbf7f7ce429f77b699dccdb5b61bc9499f9575fe7bdd1
MD5 3c4aefdaa7f38f6badc76d39ac25d8c4
BLAKE2b-256 d62ecf000141b8cb1eedab7f52d4365cd42aec339e0487ed3679cc01a8c10d26

See more details on using hashes here.

Provenance

The following attestation bundles were made for experts4bit_qlora-0.1.1.tar.gz:

Publisher: release.yml on pjordanandrsn/experts4bit-qlora

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file experts4bit_qlora-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for experts4bit_qlora-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f098507accdbacbefe87d49696695d60fe5e1a84abd8accbeb7a368df852249e
MD5 8b1808dba0702b67a7810ecce4ae9f73
BLAKE2b-256 e80cfb94dba87b15b9e3327412286d6db70b2af6198417f286cbae4eaf9501e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for experts4bit_qlora-0.1.1-py3-none-any.whl:

Publisher: release.yml on pjordanandrsn/experts4bit-qlora

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page