QLoRA fine-tuning of fused 4-bit Mixture-of-Experts on a single small GPU, on stock bitsandbytes.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

pjordanandrsn

These details have not been verified by PyPI

Project links

Project description

experts4bit-qlora

QLoRA fine-tuning of fused Mixture-of-Experts weights on a single small GPU — the part that doesn't fit anywhere else yet.

The problem

transformers v5 stores MoE experts as one fused 3-D nn.Parameter per layer (OlmoeExperts, Qwen3MoeExperts, …). bitsandbytes' 4-bit walker only replaces nn.Linear modules, so it silently skips the experts — which are the overwhelming majority of a MoE's weights. load_in_4bit "shrinks" the model but the experts stay in full precision (bitsandbytes#1849).

Experts4bit is the primitive that 4-bit-quantizes exactly that fused stack. This package pairs it with a streaming loader and per-expert LoRA, so you can actually fine-tune a real sparse-MoE on reasonable hardware.

What it buys you (measured on an RTX A2000 12 GB)

It fits at all. Full bf16 OLMoE-1B-7B is ~13.9 GB — it OOMs on a 12 GB card. In 4-bit it loads at 4.69 GB and trains in <8 GB. The streaming loader never materializes the bf16 model in CPU or GPU RAM (verified under a 3 GB container RAM cap).
It trains. QLoRA on the frozen NF4 experts improves a held-out Alpaca eval from 1.4813 → 1.0290 (see docs/METHODOLOGY.md).
Honest caveat — this is a memory technology, not an energy one. On a GPU that already fits the model, 4-bit is a 1.2–2.3× energy penalty (NF4 is storage-only; the GEMM runs in bf16 either way, plus dequant). The energy win only shows up when memory is the binding constraint — then it's the difference between running and not, and up to 4.4× lower energy/token from the batch that freed memory unlocks. Numbers and method in the docs.

Install

# primitive + adapters + benchmarks (torch + bitsandbytes):
pip install "git+https://github.com/pjordanandrsn/experts4bit-qlora"
# + the streaming MoE trainer (transformers>=5.0, datasets, ...):
pip install "experts4bit-qlora[train] @ git+https://github.com/pjordanandrsn/experts4bit-qlora"

Runs on a stock pip install bitsandbytes today — see "Relationship to bitsandbytes" below.

Quickstart

import torch
from experts4bit_qlora import Experts4bit, ExpertsLoRA

# Freeze a fused expert stack in 4-bit, attach trainable per-expert LoRA.
gate_up = torch.randn(8, 2 * 256, 128)          # [num_experts, 2*intermediate, hidden]
down    = torch.randn(8, 128, 256)              # [num_experts, hidden, intermediate]
base    = Experts4bit.from_float(gate_up, down, quant_type="nf4", compute_dtype=torch.float32)
model   = ExpertsLoRA(base, r=8, alpha=16)      # only the LoRA adapters train

End-to-end OLMoE QLoRA fine-tune (needs a CUDA GPU + [train] extras):

STEPS=150 R=8 TRAIN_EXPERTS=1 TRAIN_ATTENTION=0 OUT=./out \
  python -m experts4bit_qlora.train

Scope

The Experts4bit primitive and ExpertsLoRA adapters are model-agnostic. The streaming loader / trainer (python -m experts4bit_qlora.train) supports fused-MoE architectures that store experts per-expert on disk under model.layers.{i}.mlp.experts.{e}.{gate,up,down}_proj.weight with a SwiGLU gate:

OLMoE (OLMoE-1B-7B) — convergence-tested end-to-end; fits a 12 GB card at ~4.7 GB.
Qwen3-MoE / Qwen3.5-MoE — same checkpoint + module layout (verified byte-for-byte identical to OLMoE's on-disk format); structurally tested in tests/test_loader_architectures.py. The real weights (30–35B) need a ≥24 GB card — or the CPU-offloading path (tracked separately) — to fit 12 GB.

Anything else fails fast with a clear error. Gemma 4 is a genuinely different design (experts at layers.{i}.experts beside a parallel dense MLP, with a custom router) and needs its own loader adaptation — not yet supported. PRs welcome.

Benchmarks

# Runs on stock bitsandbytes (uses the portable dequantize forward):
python bench/bench_energy_excluded.py                    # memory wall + tokens-per-joule vs batch

# Require bitsandbytes >= 0.50 — measure the upstream matmul_4bit optimization (#1965):
python bench/_upstream/bench_matmul4bit.py --mode both   # equivalence + latency/memory
python bench/_upstream/bench_energy.py                   # joules/op: bf16 vs dequant vs matmul_4bit

The LoRA-placement ablation (which of experts / attention / router to train) and full energy analysis are written up in docs/METHODOLOGY.md. Short version: on Alpaca the placements are largely redundant, attention-only is the efficiency pick, and training the router hurts.

Relationship to bitsandbytes

Experts4bit is a bitsandbytes primitive, proposed upstream in bitsandbytes#1965. Until it ships in a release, this package vendors a copy (experts4bit_qlora/_vendor/experts.py) so it runs on stock bitsandbytes today. The import shim prefers the upstream class when present:

try:
    from bitsandbytes.nn import Experts4bit      # once bitsandbytes#1965 releases
except ImportError:
    from ._vendor.experts import Experts4bit     # vendored fallback (stock bnb)

The vendored forward also auto-detects whether matmul_4bit is correct on your installed bitsandbytes — it only handles this weight layout correctly on bnb ≥ 0.50, so on older releases the primitive uses the portable dequantize path, and the matmul_4bit memory optimization engages automatically once you upgrade. Results are correct on any supported bnb either way.

When it lands upstream: bump the bitsandbytes floor and delete _vendor/ — no API change.

License

MIT (see LICENSE). experts4bit_qlora/_vendor/experts.py is vendored from bitsandbytes (also MIT) pending upstream merge.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

pjordanandrsn

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Jul 4, 2026

0.1.1

Jul 1, 2026

This version

0.1.0

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

experts4bit_qlora-0.1.0.tar.gz (21.4 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

experts4bit_qlora-0.1.0-py3-none-any.whl (20.2 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file experts4bit_qlora-0.1.0.tar.gz.

File metadata

Download URL: experts4bit_qlora-0.1.0.tar.gz
Upload date: Jul 1, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for experts4bit_qlora-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dca5afb268a47d8964e2d3fc926a4e69c622e3faaf139417b812d97bdbfb6e85`
MD5	`8c8d5b3b02325a49e33cd5b4368eeeeb`
BLAKE2b-256	`538c71037641b31ac45301654b6eb3fad84f8b60ad9b283fd671d6325c89007f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for experts4bit_qlora-0.1.0.tar.gz:

Publisher: release.yml on pjordanandrsn/experts4bit-qlora

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: experts4bit_qlora-0.1.0.tar.gz
- Subject digest: dca5afb268a47d8964e2d3fc926a4e69c622e3faaf139417b812d97bdbfb6e85
- Sigstore transparency entry: 2038509939
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: pjordanandrsn/experts4bit-qlora@a0cea710850b1554faa3d1a3adccc161fe8cee73
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/pjordanandrsn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a0cea710850b1554faa3d1a3adccc161fe8cee73
- Trigger Event: push

File details

Details for the file experts4bit_qlora-0.1.0-py3-none-any.whl.

File metadata

Download URL: experts4bit_qlora-0.1.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 20.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for experts4bit_qlora-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97e1d1f59fe2a2cff0ac8efa6a9816c03bd1af4e4822fc7f48b4464b7e10dd8c`
MD5	`1b2fa7124865546225f5b441c43a4679`
BLAKE2b-256	`e707bee5b680ff7394d804a1552c1745c09438fdb8276c79c6be540c811bf6fb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for experts4bit_qlora-0.1.0-py3-none-any.whl:

Publisher: release.yml on pjordanandrsn/experts4bit-qlora

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: experts4bit_qlora-0.1.0-py3-none-any.whl
- Subject digest: 97e1d1f59fe2a2cff0ac8efa6a9816c03bd1af4e4822fc7f48b4464b7e10dd8c
- Sigstore transparency entry: 2038510088
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: pjordanandrsn/experts4bit-qlora@a0cea710850b1554faa3d1a3adccc161fe8cee73
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/pjordanandrsn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a0cea710850b1554faa3d1a3adccc161fe8cee73
- Trigger Event: push

experts4bit-qlora 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

experts4bit-qlora

The problem

What it buys you (measured on an RTX A2000 12 GB)

Install

Quickstart

Scope

Benchmarks

Relationship to bitsandbytes

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance