Extreme weight and KV cache compression for LLMs on Apple Silicon (MLX implementation of Google's TurboQuant)
Project description
TurboQuant-MLX
Extreme weight and KV cache compression for LLMs on Apple Silicon. MLX implementation of Google's TurboQuant (Zandieh et al., 2025) — Hadamard rotation + Lloyd-Max codebooks applied both to weights (compile time) and the KV cache (run time).
Supports dense models (LLaMA, Qwen, Mistral) and Mixture-of-Experts (Qwen-MoE, GPT-OSS, Qwen3.5-MoE). Compatible with hybrid attention architectures, attention sinks, sliding-window attention, and linear attention layers.
With both weight and KV cache compression at 3-bit, GPT-OSS-120B fits its full 131K context window in 50 GB on a 64 GB MacBook — and KV cache compression actually makes generation faster on the 120B (8.7 vs 6.4 tok/s) because the smaller cache cuts memory bandwidth more than dequant costs.
Key Results — Weight Compression
| Model | Method | Bits | PPL | Size | Gen Speed (M4 Max) |
|---|---|---|---|---|---|
| Qwen2.5-7B | TurboQuant | 3 | 8.92 | 3.5 GB | — |
| Qwen2.5-7B | Affine | 3 | 13.37 | 3.3 GB | — |
| GPT-OSS-20B | Affine (mlx-lm) | 4 | — | 11.2 GB | 148 tok/s |
| GPT-OSS-20B | MXFP4 (original) | 4 | 83.04 | 12.8 GB | — |
| GPT-OSS-20B | TurboQuant | 4 | 72.63 | 11.2 GB | — |
| GPT-OSS-20B | TurboQuant | 3 | 78.60 | 9.3 GB | 73 tok/s |
| GPT-OSS-120B | Affine 4-bit (mlx-community) | 4 | — | 65.8 GB | Doesn't fit 64GB |
| GPT-OSS-120B | MXFP4 (original) | 4 | — | 63.5 GB | Doesn't fit 64GB |
| GPT-OSS-120B | TurboQuant | 3 | — | 48 GB | 44 tok/s |
| GPT-OSS-120B | TurboQuant | 2 | — | 32 GB | 51 tok/s (poor quality) |
| Qwen3.5-122B-A10B | BF16 (original) | 16 | — | ~240 GB | Doesn't fit 64GB |
| Qwen3.5-122B-A10B | TurboQuant | 3 | — | ~50 GB | 26.5 tok/s |
Key Results — KV Cache Compression
| Model | KV cache config | KV size | Speed | Notes |
|---|---|---|---|---|
| GPT-OSS-20B (FP16 weights) | FP16 KV | 27.0 MB | 90.6 tok/s | baseline |
| GPT-OSS-20B (FP16 weights) | TQ 3-bit KV | 7.79 MB | 29.9 tok/s | 3.5x cache savings |
| GPT-OSS-120B (TQ 3-bit weights) | FP16 KV | 45.0 MB | 6.4 tok/s | baseline |
| GPT-OSS-120B (TQ 3-bit weights) | TQ 3-bit KV | 11.83 MB | 8.7 tok/s | 3.8x cache savings — and faster than FP16 |
| GPT-OSS-120B (TQ 3-bit weights) | TQ 4-bit KV | 12.21 MB | 16.0 tok/s | also clean |
| Qwen3.5-122B (TQ 3-bit weights) | FP16 KV | 161.06 MB | 5.4 tok/s | baseline |
| Qwen3.5-122B (TQ 3-bit weights) | TQ 3-bit KV | 150.17 MB | 5.7 tok/s | output identical to FP16 |
KV cache compression projects to ~7 GB RAM saved at 131K context on GPT-OSS-120B and ~5 GB at 262K on Qwen3.5-122B. Roundtrip cosine similarity vs FP16: 0.983 at 3-bit, 0.995 at 4-bit.
Install
pip install turboquant-mlx-full
The package is published as turboquant-mlx-full on PyPI, but importable as
turboquant_mlx (without the -full suffix) — this matches the original
project name and the examples in the Medium articles.
import turboquant_mlx
from turboquant_mlx.layers import TurboQuantKVCache, convert_cache_to_turboquant
Requirements
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- 64 GB unified memory recommended for 20B+ models
- Xcode Command Line Tools and CMake 3.27+ (the package builds a small Metal extension on install)
Install from source (for development)
git clone https://github.com/manjunathshiva/turboquant-mlx.git
cd turboquant-mlx
pip install -e .
For evaluation utilities (perplexity benchmarking), also install the optional dependencies:
pip install "turboquant-mlx-full[eval]"
Quick Start
1. Convert a model to TurboQuant format
# Dense model (e.g., LLaMA 3.2 1B at 3-bit)
python -m turboquant_mlx.convert \
--hf-path meta-llama/Llama-3.2-1B \
--mlx-path ./llama-3.2-1b-tq3 \
--bits 3 --group-size 64
# MoE model (e.g., GPT-OSS-20B at 2-bit)
python -m turboquant_mlx.convert \
--hf-path openai/gpt-oss-20b \
--mlx-path ./gpt-oss-20b-tq2 \
--bits 2 --group-size 64
2. Generate text
python -m turboquant_mlx.generate \
--model ./gpt-oss-20b-tq2 \
--prompt "Why is the sky blue? Explain in simple terms." \
--max-tokens 200
3. Evaluate perplexity
python -m turboquant_mlx.evaluate \
--hf-path openai/gpt-oss-20b \
--bits 2 3 4 \
--num-samples 256 --seq-len 512
4. Generate with KV cache compression
# Standard model + KV cache compression
python -m turboquant_mlx.demo_kv \
--model openai/gpt-oss-20b \
--prompt "Why is the sky blue?" \
--max-tokens 200 --tq-bits 3
# TQ-compressed model + KV cache compression (full stack)
python -m turboquant_mlx.demo_kv \
--model ./gpt-oss-120b-tq3 \
--prompt "Why is the sky blue?" \
--max-tokens 200 --tq-bits 3
# Side-by-side comparison: FP16 KV vs TurboQuant KV
python -m turboquant_mlx.demo_kv \
--model ./gpt-oss-120b-tq3 \
--prompt "Why is the sky blue?" \
--max-tokens 200 --compare
KV Cache Compression
TurboQuant KV cache compression applies the same Hadamard rotation + Lloyd-Max codebook pipeline to KV vectors at runtime. The compressed cache is dequantized to float16 only when attention needs it, so it routes through MLX's standard scaled_dot_product_attention and is compatible with attention sinks, sliding windows, and linear attention layers.
Programmatic usage
from turboquant_mlx.layers import convert_cache_to_turboquant
from mlx_lm.models.cache import make_prompt_cache
# 1. Process the prompt with FP16 KV cache (exact)
cache = make_prompt_cache(model)
model(prompt_tokens, cache=cache)
# 2. Convert to TurboQuant KV cache for generation
cache = convert_cache_to_turboquant(cache, tq_bits=3, group_size=64)
# 3. Continue generation — cache is now compressed
for token in generate_loop(model, cache):
...
Choosing a bit-width
| Weights | KV cache | Recommendation |
|---|---|---|
| FP16 / BF16 | TQ 3-bit | Default sweet spot at every model size |
| TQ-compressed (~20B) | TQ 4-bit | Use 4-bit when stacking on TQ weights — small models have a tighter noise budget |
| TQ-compressed (100B+) | TQ 3-bit | 3-bit on 3-bit works cleanly on GPT-OSS-120B and Qwen3.5-122B — 100B+ models have enough redundancy to absorb the stacked noise |
The speed flip
On small fast models (~20B), KV cache compression is a quality-vs-speed tradeoff: the dequant overhead dominates because the model is fast to begin with. On large slow models (100B+), the 4x smaller KV cache reduces memory bandwidth more than dequant adds — generation is faster than the FP16 baseline:
| Model | FP16 KV | TQ 3-bit KV | Direction |
|---|---|---|---|
| GPT-OSS-20B | 90.6 tok/s | 29.9 tok/s | TQ is 3x slower |
| GPT-OSS-120B | 6.4 tok/s | 8.7 tok/s | TQ is 1.4x faster |
Compatibility
| Feature | Supported | Notes |
|---|---|---|
| Attention sinks | Yes | GPT-OSS sink vectors flow through standard SDPA |
| Sliding window attention | Yes | RotatingKVCache layers are left untouched |
| Linear attention | Yes | ArraysCache (Qwen3.5 GatedDeltaNet) is left untouched |
| Hybrid architectures | Yes | Per-layer cache type is preserved |
| Prompt-first conversion | Yes | Process prompt with FP16, convert before generation |
Running GPT-OSS MoE Models on Apple Silicon
GPT-OSS-20B (21B total, 32 experts, 3.6B active)
Hardware: Apple M4 Max 64GB (or any Apple Silicon with 16GB+ unified memory at 3-bit)
Step 1: Convert to TurboQuant 3-bit (recommended)
python -m turboquant_mlx.convert \
--hf-path openai/gpt-oss-20b \
--mlx-path ./gpt-oss-20b-tq3 \
--bits 3 --group-size 32
Model size: 9.3 GB (vs 12.8 GB MXFP4 original — 28% smaller, lower perplexity)
The converter automatically:
- Detects MoE architecture (SwitchLinear / QuantizedSwitchLinear layers)
- Dequantizes MXFP4 expert weights to float
- Applies Hadamard rotation + Lloyd-Max codebook quantization
- Keeps router weights and attention at full precision
- Handles blockwise Hadamard for 2880-dim experts (2880 = 9 x 320)
Step 2: Generate text
python -m turboquant_mlx.generate \
--model ./gpt-oss-20b-tq3 \
--prompt "Explain quantum entanglement to a 10-year-old." \
--max-tokens 256
Expected: ~73 tok/s generation, ~85 tok/s prefill on M4 Max
Step 3: Run a quick quality check
python -m turboquant_mlx.evaluate \
--hf-path openai/gpt-oss-20b \
--bits 3 \
--no-affine --no-qjl \
--num-samples 64 --seq-len 512
All bit-widths for GPT-OSS-20B
| Method | Bits | Size | Peak RAM | Gen Speed | Quality |
|---|---|---|---|---|---|
| Affine (mlx-lm) | 4 | 11.2 GB | ~14 GB | 148 tok/s | Coherent (but see note below) |
| TurboQuant | 4 | 11.2 GB | ~14 GB | — | Best (PPL 72.63, beats MXFP4) |
| TurboQuant | 3 | 9.3 GB | ~12 GB | 73 tok/s | Recommended (PPL 78.60, beats MXFP4, coherent) |
| TurboQuant | 2 | 7.5 GB | ~10 GB | — | Poor (incoherent generation on pre-quantized models) |
Speed vs quality tradeoff: Affine 4-bit is ~2x faster on the 20B model due to simpler dequantization, but TurboQuant 3-bit is 28% smaller with lower perplexity than both affine 4-bit and OpenAI's own MXFP4. Crucially, affine 4-bit cannot scale to 120B on 64GB hardware — TurboQuant 3-bit is the only option there.
# 4-bit (best quality, beats OpenAI's MXFP4)
python -m turboquant_mlx.convert \
--hf-path openai/gpt-oss-20b \
--mlx-path ./gpt-oss-20b-tq4 \
--bits 4 --group-size 32
GPT-OSS-120B (120B total, 128 experts, ~13B active)
Hardware: Apple M4 Max 64GB — neither the original MXFP4 (63.5 GB) nor the mlx-community 4-bit affine (65.8 GB) fit on a 64GB machine. TurboQuant 3-bit is the only way to run this model on consumer hardware.
Step 1: Convert to TurboQuant 3-bit (recommended)
python -m turboquant_mlx.convert \
--hf-path openai/gpt-oss-120b \
--mlx-path ./gpt-oss-120b-tq3 \
--bits 3 --group-size 64
Model size: 48 GB
Note: Conversion requires temporarily loading the full model. With 120B parameters, peak memory during conversion may reach ~50-55 GB. On a 64 GB machine this is tight — close all other applications before running. The converter processes layers sequentially and frees memory after each expert is quantized.
Step 2: Generate text
python -m turboquant_mlx.generate \
--model ./gpt-oss-120b-tq3 \
--prompt "Explain quantum computing in simple terms." \
--max-tokens 200
Expected: ~44 tok/s generation, ~9.5 tok/s prefill, 52 GB peak memory on M4 Max 64GB
Step 3: Quick quality check
python -m turboquant_mlx.evaluate \
--hf-path openai/gpt-oss-120b \
--bits 3 \
--no-affine --no-qjl \
--num-samples 32 --seq-len 512
All bit-widths for GPT-OSS-120B
| Method | Bits | Size | Peak RAM | Gen Speed | Fits 64 GB? | Quality |
|---|---|---|---|---|---|---|
| mlx-community 4-bit | 4 (affine) | 65.8 GB | — | — | No | — |
| MXFP4 (original) | 4 (mxfp) | 63.5 GB | ~70 GB | — | No | — |
| TurboQuant | 3 | 48 GB | 52.3 GB | 44 tok/s | Yes | Coherent, well-structured |
| TurboQuant | 2 | 32 GB | 34.9 GB | 51 tok/s | Yes | Incoherent after ~20 tokens |
Neither the original MXFP4 format (63.5 GB) nor the mlx-community affine 4-bit re-quantization (65.8 GB) fit on a 64GB Mac. TurboQuant 3-bit (48 GB) is the only way to run GPT-OSS-120B on consumer hardware — and at 44 tok/s, it's interactive speed. At 2-bit, the model fits easily but generation quality degrades rapidly — 3-bit is the minimum for coherent output on pre-quantized MoE models.
Qwen3.5-122B-A10B (122B total, 256 experts, 8 active, ~10B active)
Hardware: Apple M4 Max 64GB — the original BF16 model is ~240 GB. TurboQuant 3-bit compresses it to ~50 GB, fitting on a 64GB machine.
This is a brand-new architecture featuring 256 MoE experts (the most of any model we've tested), hybrid attention (GatedDeltaNet linear attention + standard softmax attention), and thinking/reasoning capability. The model also has a shared expert per layer alongside the routed experts.
Step 1: Convert to TurboQuant 3-bit
python -m turboquant_mlx.convert \
--hf-path Qwen/Qwen3.5-122B-A10B \
--mlx-path ./qwen3.5-122b-tq3 \
--bits 3 --group-size 64
Model size: ~50 GB | Conversion time: ~90 seconds
Note: Conversion requires ~55 GB peak memory. Close all other applications before running. The converter uses memory-efficient processing — each expert layer is replaced immediately after quantization with aggressive garbage collection to handle the 256 experts per layer.
Step 2: Generate text
python -m turboquant_mlx.generate \
--model ./qwen3.5-122b-tq3 \
--prompt "Why is the sky blue? Explain in simple terms." \
--max-tokens 200
Expected: ~26.5 tok/s generation, 55 GB peak memory on M4 Max 64GB
Benchmark
| Method | Bits | Size | Peak RAM | Gen Speed | Fits 64 GB? | Quality |
|---|---|---|---|---|---|---|
| BF16 (original) | 16 | ~240 GB | — | — | No | — |
| TurboQuant | 3 | ~50 GB | 54.9 GB | 26.5 tok/s | Yes | Coherent reasoning with structured thinking |
Qwen3.5-122B-A10B is the largest and most complex model TurboQuant has been tested on: 122B parameters, 256 experts (8 active per token), hybrid GatedDeltaNet + softmax attention, and a shared expert per MoE layer. At 3-bit, the model produces structured reasoning with proper analysis steps — demonstrating that TurboQuant preserves thinking capability at extreme compression.
How It Works
TurboQuant is a two-stage, calibration-free quantization pipeline:
-
Hadamard Rotation — Multiply weights by a randomized Hadamard matrix, transforming any weight distribution into a near-Gaussian shape. This is data-oblivious (no calibration data needed).
-
Lloyd-Max Codebook — Apply information-theoretically optimal quantization for Gaussian distributions. The codebook is a mathematical constant, precomputed once.
The result: near-zero quality loss at 3-bit, and usable 2-bit quantization where standard affine completely breaks down.
For MoE models, all experts within a layer share the same rotation signs and codebook, keeping storage efficient.
CLI Options
python -m turboquant_mlx.convert --help
Options:
--hf-path TEXT HuggingFace model path or local path (required)
--mlx-path TEXT Output directory (default: mlx_model)
--bits {2,3,4} Quantization bit-width (default: 3)
--group-size {32,64,128} Elements per quantization group (default: 64)
--rotation TEXT Rotation method: hadamard, blockwise_hadamard, none
--use-qjl Enable 1-bit QJL residual correction (+1 bit overhead)
--dtype TEXT Model dtype before quantization: float16, bfloat16
Supported Architectures
| Architecture | Model Type | MoE | Status |
|---|---|---|---|
| LLaMA / Llama 3 | llama |
No | Tested |
| Qwen2 / Qwen2.5 | qwen2 |
No | Tested |
| Qwen3.5 | qwen3_5 |
No | Tested |
| Mistral | mistral |
No | Tested |
| Qwen1.5-MoE | qwen2_moe |
Yes | Tested |
| GPT-OSS | gpt_oss |
Yes | Tested |
| Qwen3.5-MoE | qwen3_5_moe |
Yes (256 experts) | Tested (122B) |
Project Structure
turboquant_mlx/
config.py # TurboQuantConfig
convert.py # CLI: HF model -> TurboQuant MLX
generate.py # Text generation with TurboQuant models
evaluate.py # Perplexity evaluation
quantize_model.py # Model traversal & layer replacement
demo_kv.py # Streaming generation demo with KV cache compression
test_kv_cache.py # KV cache roundtrip + integration tests
core/
codebook.py # Lloyd-Max codebooks for Gaussian
rotation.py # Randomized Hadamard rotation
polar_quantize.py # Rotate + codebook quantize
packing.py # Bit-packing into uint32
qjl.py # QJL residual correction
layers/
polar_linear.py # PolarQuantizedLinear (dense)
polar_switch_linear.py # PolarQuantizedSwitchLinear (MoE)
polar_kv_cache.py # TurboQuantKVCache (runtime KV compression)
kernels/
polar_qmv.py # Fused Metal kernel (dense decode)
polar_gather_qmv.py # Fused Metal kernel (MoE shared input)
polar_multi_gather_qmv.py # Fused Metal kernel (MoE per-expert input)
csrc/
polar_kernels.metal # Native Metal shaders (SIMD group reduction)
polar_ops.h/cpp # C++ MLX Primitive classes
bindings.cpp # nanobind Python bindings
CMakeLists.txt # Build system
integration/
rotation_configs.py # Per-architecture rotation configs
Citation
@misc{turboquant_mlx,
title={TurboQuant-MLX: Extreme Weight and KV Cache Compression for Apple Silicon},
year={2025},
note={MLX implementation of TurboQuant (Zandieh et al., 2025) for both weight quantization and runtime KV cache compression}
}
License
MIT
Acknowledgments
- TurboQuant — Zandieh, Han, Daliri, Karbasi (2025)
- MLX — Apple Machine Learning Research
- mlx-lm — MLX language model utilities
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file turboquant_mlx_full-0.1.0.tar.gz.
File metadata
- Download URL: turboquant_mlx_full-0.1.0.tar.gz
- Upload date:
- Size: 60.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
737fa4fdb28c874d8be0aef50d6dd80aa55d3f2f279b2e2f05315ff911695a45
|
|
| MD5 |
44804b6684dd423b59009c2696e27799
|
|
| BLAKE2b-256 |
f2c7c23eb2bcf62891a923d972c93141fd2aa1bb645af6695aef8ae78153aed7
|