Extreme weight and KV cache compression for LLMs on Apple Silicon (MLX implementation of Google's TurboQuant)

These details have not been verified by PyPI

Project links

Project description

TurboQuant-MLX

Extreme weight and KV cache compression for LLMs on Apple Silicon. MLX implementation of Google's TurboQuant (Zandieh et al., 2025) — Hadamard rotation + Lloyd-Max codebooks applied both to weights (compile time) and the KV cache (run time).

Supports dense models (LLaMA, Qwen, Mistral) and Mixture-of-Experts (Qwen-MoE, GPT-OSS, Qwen3.5-MoE). Compatible with hybrid attention architectures, attention sinks, sliding-window attention, and linear attention layers.

With both weight and KV cache compression at 3-bit, GPT-OSS-120B fits its full 131K context window in 50 GB on a 64 GB MacBook — and KV cache compression actually makes generation faster on the 120B (8.7 vs 6.4 tok/s) because the smaller cache cuts memory bandwidth more than dequant costs.

Key Results — Weight Compression

Model	Method	Bits	PPL	Size	Gen Speed (M4 Max)
Qwen2.5-7B	TurboQuant	3	8.92	3.5 GB	—
Qwen2.5-7B	Affine	3	13.37	3.3 GB	—
GPT-OSS-20B	Affine (mlx-lm)	4	—	11.2 GB	148 tok/s
GPT-OSS-20B	MXFP4 (original)	4	83.04	12.8 GB	—
GPT-OSS-20B	TurboQuant	4	72.63	11.2 GB	—
GPT-OSS-20B	TurboQuant	3	78.60	9.3 GB	73 tok/s
GPT-OSS-120B	Affine 4-bit (mlx-community)	4	—	65.8 GB	Doesn't fit 64GB
GPT-OSS-120B	MXFP4 (original)	4	—	63.5 GB	Doesn't fit 64GB
GPT-OSS-120B	TurboQuant	3	—	48 GB	44 tok/s
GPT-OSS-120B	TurboQuant	2	—	32 GB	51 tok/s (poor quality)
Qwen3.5-122B-A10B	BF16 (original)	16	—	~240 GB	Doesn't fit 64GB
Qwen3.5-122B-A10B	TurboQuant	3	—	~50 GB	26.5 tok/s

Key Results — KV Cache Compression

Model	KV cache config	KV size	Speed	Notes
GPT-OSS-20B (FP16 weights)	FP16 KV	27.0 MB	90.6 tok/s	baseline
GPT-OSS-20B (FP16 weights)	TQ 3-bit KV	7.79 MB	29.9 tok/s	3.5x cache savings
GPT-OSS-120B (TQ 3-bit weights)	FP16 KV	45.0 MB	6.4 tok/s	baseline
GPT-OSS-120B (TQ 3-bit weights)	TQ 3-bit KV	11.83 MB	8.7 tok/s	*3.8x cache savings — and faster* than FP16**
GPT-OSS-120B (TQ 3-bit weights)	TQ 4-bit KV	12.21 MB	16.0 tok/s	also clean
Qwen3.5-122B (TQ 3-bit weights)	FP16 KV	161.06 MB	5.4 tok/s	baseline
Qwen3.5-122B (TQ 3-bit weights)	TQ 3-bit KV	150.17 MB	5.7 tok/s	output identical to FP16

KV cache compression projects to ~7 GB RAM saved at 131K context on GPT-OSS-120B and ~5 GB at 262K on Qwen3.5-122B. Roundtrip cosine similarity vs FP16: 0.983 at 3-bit, 0.995 at 4-bit.

Install

pip install turboquant-mlx-full

The package is published as turboquant-mlx-full on PyPI, but importable as turboquant_mlx (without the -full suffix) — this matches the original project name and the examples in the Medium articles.

import turboquant_mlx
from turboquant_mlx.layers import TurboQuantKVCache, convert_cache_to_turboquant

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
64 GB unified memory recommended for 20B+ models
Xcode Command Line Tools and CMake 3.27+ (the package builds a small Metal extension on install)

Install from source (for development)

git clone https://github.com/manjunathshiva/turboquant-mlx.git
cd turboquant-mlx
pip install -e .

For evaluation utilities (perplexity benchmarking), also install the optional dependencies:

pip install "turboquant-mlx-full[eval]"

Quick Start

1. Convert a model to TurboQuant format

# Dense model (e.g., LLaMA 3.2 1B at 3-bit)
python -m turboquant_mlx.convert \
    --hf-path meta-llama/Llama-3.2-1B \
    --mlx-path ./llama-3.2-1b-tq3 \
    --bits 3 --group-size 64

# MoE model (e.g., GPT-OSS-20B at 2-bit)
python -m turboquant_mlx.convert \
    --hf-path openai/gpt-oss-20b \
    --mlx-path ./gpt-oss-20b-tq2 \
    --bits 2 --group-size 64

2. Generate text

python -m turboquant_mlx.generate \
    --model ./gpt-oss-20b-tq2 \
    --prompt "Why is the sky blue? Explain in simple terms." \
    --max-tokens 200

3. Evaluate perplexity

python -m turboquant_mlx.evaluate \
    --hf-path openai/gpt-oss-20b \
    --bits 2 3 4 \
    --num-samples 256 --seq-len 512

4. Generate with KV cache compression

# Standard model + KV cache compression
python -m turboquant_mlx.demo_kv \
    --model openai/gpt-oss-20b \
    --prompt "Why is the sky blue?" \
    --max-tokens 200 --tq-bits 3

# TQ-compressed model + KV cache compression (full stack)
python -m turboquant_mlx.demo_kv \
    --model ./gpt-oss-120b-tq3 \
    --prompt "Why is the sky blue?" \
    --max-tokens 200 --tq-bits 3

# Side-by-side comparison: FP16 KV vs TurboQuant KV
python -m turboquant_mlx.demo_kv \
    --model ./gpt-oss-120b-tq3 \
    --prompt "Why is the sky blue?" \
    --max-tokens 200 --compare

KV Cache Compression

TurboQuant KV cache compression applies the same Hadamard rotation + Lloyd-Max codebook pipeline to KV vectors at runtime. The compressed cache is dequantized to float16 only when attention needs it, so it routes through MLX's standard scaled_dot_product_attention and is compatible with attention sinks, sliding windows, and linear attention layers.

Programmatic usage

from turboquant_mlx.layers import convert_cache_to_turboquant
from mlx_lm.models.cache import make_prompt_cache

# 1. Process the prompt with FP16 KV cache (exact)
cache = make_prompt_cache(model)
model(prompt_tokens, cache=cache)

# 2. Convert to TurboQuant KV cache for generation
cache = convert_cache_to_turboquant(cache, tq_bits=3, group_size=64)

# 3. Continue generation — cache is now compressed
for token in generate_loop(model, cache):
    ...

Choosing a bit-width

Weights	KV cache	Recommendation
FP16 / BF16	TQ 3-bit	Default sweet spot at every model size
TQ-compressed (~20B)	TQ 4-bit	Use 4-bit when stacking on TQ weights — small models have a tighter noise budget
TQ-compressed (100B+)	TQ 3-bit	3-bit on 3-bit works cleanly on GPT-OSS-120B and Qwen3.5-122B — 100B+ models have enough redundancy to absorb the stacked noise

The speed flip

On small fast models (~20B), KV cache compression is a quality-vs-speed tradeoff: the dequant overhead dominates because the model is fast to begin with. On large slow models (100B+), the 4x smaller KV cache reduces memory bandwidth more than dequant adds — generation is faster than the FP16 baseline:

Model	FP16 KV	TQ 3-bit KV	Direction
GPT-OSS-20B	90.6 tok/s	29.9 tok/s	TQ is 3x slower
GPT-OSS-120B	6.4 tok/s	8.7 tok/s	TQ is 1.4x faster

Compatibility

Feature	Supported	Notes
Attention sinks	Yes	GPT-OSS sink vectors flow through standard SDPA
Sliding window attention	Yes	`RotatingKVCache` layers are left untouched
Linear attention	Yes	`ArraysCache` (Qwen3.5 GatedDeltaNet) is left untouched
Hybrid architectures	Yes	Per-layer cache type is preserved
Prompt-first conversion	Yes	Process prompt with FP16, convert before generation

Running GPT-OSS MoE Models on Apple Silicon

GPT-OSS-20B (21B total, 32 experts, 3.6B active)

Hardware: Apple M4 Max 64GB (or any Apple Silicon with 16GB+ unified memory at 3-bit)

Step 1: Convert to TurboQuant 3-bit (recommended)

python -m turboquant_mlx.convert \
    --hf-path openai/gpt-oss-20b \
    --mlx-path ./gpt-oss-20b-tq3 \
    --bits 3 --group-size 32

Model size: 9.3 GB (vs 12.8 GB MXFP4 original — 28% smaller, lower perplexity)

The converter automatically:

Detects MoE architecture (SwitchLinear / QuantizedSwitchLinear layers)
Dequantizes MXFP4 expert weights to float
Applies Hadamard rotation + Lloyd-Max codebook quantization
Keeps router weights and attention at full precision
Handles blockwise Hadamard for 2880-dim experts (2880 = 9 x 320)

Step 2: Generate text

python -m turboquant_mlx.generate \
    --model ./gpt-oss-20b-tq3 \
    --prompt "Explain quantum entanglement to a 10-year-old." \
    --max-tokens 256

Expected: ~73 tok/s generation, ~85 tok/s prefill on M4 Max

Step 3: Run a quick quality check

python -m turboquant_mlx.evaluate \
    --hf-path openai/gpt-oss-20b \
    --bits 3 \
    --no-affine --no-qjl \
    --num-samples 64 --seq-len 512

All bit-widths for GPT-OSS-20B

Method	Bits	Size	Peak RAM	Gen Speed	Quality
Affine (mlx-lm)	4	11.2 GB	~14 GB	148 tok/s	Coherent (but see note below)
TurboQuant	4	11.2 GB	~14 GB	—	Best (PPL 72.63, beats MXFP4)
TurboQuant	3	9.3 GB	~12 GB	73 tok/s	Recommended (PPL 78.60, beats MXFP4, coherent)
TurboQuant	2	7.5 GB	~10 GB	—	Poor (incoherent generation on pre-quantized models)

Speed vs quality tradeoff: Affine 4-bit is ~2x faster on the 20B model due to simpler dequantization, but TurboQuant 3-bit is 28% smaller with lower perplexity than both affine 4-bit and OpenAI's own MXFP4. Crucially, affine 4-bit cannot scale to 120B on 64GB hardware — TurboQuant 3-bit is the only option there.

# 4-bit (best quality, beats OpenAI's MXFP4)
python -m turboquant_mlx.convert \
    --hf-path openai/gpt-oss-20b \
    --mlx-path ./gpt-oss-20b-tq4 \
    --bits 4 --group-size 32

GPT-OSS-120B (120B total, 128 experts, ~13B active)

Hardware: Apple M4 Max 64GB — neither the original MXFP4 (63.5 GB) nor the mlx-community 4-bit affine (65.8 GB) fit on a 64GB machine. TurboQuant 3-bit is the only way to run this model on consumer hardware.

Step 1: Convert to TurboQuant 3-bit (recommended)

python -m turboquant_mlx.convert \
    --hf-path openai/gpt-oss-120b \
    --mlx-path ./gpt-oss-120b-tq3 \
    --bits 3 --group-size 64

Model size: 48 GB

Note: Conversion requires temporarily loading the full model. With 120B parameters, peak memory during conversion may reach ~50-55 GB. On a 64 GB machine this is tight — close all other applications before running. The converter processes layers sequentially and frees memory after each expert is quantized.

Step 2: Generate text

python -m turboquant_mlx.generate \
    --model ./gpt-oss-120b-tq3 \
    --prompt "Explain quantum computing in simple terms." \
    --max-tokens 200

Expected: ~44 tok/s generation, ~9.5 tok/s prefill, 52 GB peak memory on M4 Max 64GB

Step 3: Quick quality check

python -m turboquant_mlx.evaluate \
    --hf-path openai/gpt-oss-120b \
    --bits 3 \
    --no-affine --no-qjl \
    --num-samples 32 --seq-len 512

All bit-widths for GPT-OSS-120B

Method	Bits	Size	Peak RAM	Gen Speed	Fits 64 GB?	Quality
mlx-community 4-bit	4 (affine)	65.8 GB	—	—	No	—
MXFP4 (original)	4 (mxfp)	63.5 GB	~70 GB	—	No	—
TurboQuant	3	48 GB	52.3 GB	44 tok/s	Yes	Coherent, well-structured
TurboQuant	2	32 GB	34.9 GB	51 tok/s	Yes	Incoherent after ~20 tokens

Neither the original MXFP4 format (63.5 GB) nor the mlx-community affine 4-bit re-quantization (65.8 GB) fit on a 64GB Mac. TurboQuant 3-bit (48 GB) is the only way to run GPT-OSS-120B on consumer hardware — and at 44 tok/s, it's interactive speed. At 2-bit, the model fits easily but generation quality degrades rapidly — 3-bit is the minimum for coherent output on pre-quantized MoE models.

Qwen3.5-122B-A10B (122B total, 256 experts, 8 active, ~10B active)

Hardware: Apple M4 Max 64GB — the original BF16 model is ~240 GB. TurboQuant 3-bit compresses it to ~50 GB, fitting on a 64GB machine.

This is a brand-new architecture featuring 256 MoE experts (the most of any model we've tested), hybrid attention (GatedDeltaNet linear attention + standard softmax attention), and thinking/reasoning capability. The model also has a shared expert per layer alongside the routed experts.

Step 1: Convert to TurboQuant 3-bit

python -m turboquant_mlx.convert \
    --hf-path Qwen/Qwen3.5-122B-A10B \
    --mlx-path ./qwen3.5-122b-tq3 \
    --bits 3 --group-size 64

Model size: ~50 GB | Conversion time: ~90 seconds

Note: Conversion requires ~55 GB peak memory. Close all other applications before running. The converter uses memory-efficient processing — each expert layer is replaced immediately after quantization with aggressive garbage collection to handle the 256 experts per layer.

Step 2: Generate text

python -m turboquant_mlx.generate \
    --model ./qwen3.5-122b-tq3 \
    --prompt "Why is the sky blue? Explain in simple terms." \
    --max-tokens 200

Expected: ~26.5 tok/s generation, 55 GB peak memory on M4 Max 64GB

Benchmark

Method	Bits	Size	Peak RAM	Gen Speed	Fits 64 GB?	Quality
BF16 (original)	16	~240 GB	—	—	No	—
TurboQuant	3	~50 GB	54.9 GB	26.5 tok/s	Yes	Coherent reasoning with structured thinking

Qwen3.5-122B-A10B is the largest and most complex model TurboQuant has been tested on: 122B parameters, 256 experts (8 active per token), hybrid GatedDeltaNet + softmax attention, and a shared expert per MoE layer. At 3-bit, the model produces structured reasoning with proper analysis steps — demonstrating that TurboQuant preserves thinking capability at extreme compression.

How It Works

TurboQuant is a two-stage, calibration-free quantization pipeline:

Hadamard Rotation — Multiply weights by a randomized Hadamard matrix, transforming any weight distribution into a near-Gaussian shape. This is data-oblivious (no calibration data needed).
Lloyd-Max Codebook — Apply information-theoretically optimal quantization for Gaussian distributions. The codebook is a mathematical constant, precomputed once.

The result: near-zero quality loss at 3-bit, and usable 2-bit quantization where standard affine completely breaks down.

For MoE models, all experts within a layer share the same rotation signs and codebook, keeping storage efficient.

CLI Options

python -m turboquant_mlx.convert --help

Options:
  --hf-path TEXT       HuggingFace model path or local path (required)
  --mlx-path TEXT      Output directory (default: mlx_model)
  --bits {2,3,4}       Quantization bit-width (default: 3)
  --group-size {32,64,128}  Elements per quantization group (default: 64)
  --rotation TEXT      Rotation method: hadamard, blockwise_hadamard, none
  --use-qjl           Enable 1-bit QJL residual correction (+1 bit overhead)
  --dtype TEXT         Model dtype before quantization: float16, bfloat16

Supported Architectures

Architecture	Model Type	MoE	Status
LLaMA / Llama 3	`llama`	No	Tested
Qwen2 / Qwen2.5	`qwen2`	No	Tested
Qwen3.5	`qwen3_5`	No	Tested
Mistral	`mistral`	No	Tested
Qwen1.5-MoE	`qwen2_moe`	Yes	Tested
GPT-OSS	`gpt_oss`	Yes	Tested
Qwen3.5-MoE	`qwen3_5_moe`	Yes (256 experts)	Tested (122B)

Project Structure

turboquant_mlx/
    config.py                 # TurboQuantConfig
    convert.py                # CLI: HF model -> TurboQuant MLX
    generate.py               # Text generation with TurboQuant models
    evaluate.py               # Perplexity evaluation
    quantize_model.py         # Model traversal & layer replacement
    demo_kv.py                # Streaming generation demo with KV cache compression
    test_kv_cache.py          # KV cache roundtrip + integration tests
    core/
        codebook.py           # Lloyd-Max codebooks for Gaussian
        rotation.py           # Randomized Hadamard rotation
        polar_quantize.py     # Rotate + codebook quantize
        packing.py            # Bit-packing into uint32
        qjl.py                # QJL residual correction
    layers/
        polar_linear.py       # PolarQuantizedLinear (dense)
        polar_switch_linear.py # PolarQuantizedSwitchLinear (MoE)
        polar_kv_cache.py     # TurboQuantKVCache (runtime KV compression)
    kernels/
        polar_qmv.py          # Fused Metal kernel (dense decode)
        polar_gather_qmv.py   # Fused Metal kernel (MoE shared input)
        polar_multi_gather_qmv.py  # Fused Metal kernel (MoE per-expert input)
    csrc/
        polar_kernels.metal   # Native Metal shaders (SIMD group reduction)
        polar_ops.h/cpp       # C++ MLX Primitive classes
        bindings.cpp          # nanobind Python bindings
        CMakeLists.txt        # Build system
    integration/
        rotation_configs.py   # Per-architecture rotation configs

Citation

@misc{turboquant_mlx,
    title={TurboQuant-MLX: Extreme Weight and KV Cache Compression for Apple Silicon},
    year={2025},
    note={MLX implementation of TurboQuant (Zandieh et al., 2025) for both weight quantization and runtime KV cache compression}
}

License

MIT

Acknowledgments

TurboQuant — Zandieh, Han, Daliri, Karbasi (2025)
MLX — Apple Machine Learning Research
mlx-lm — MLX language model utilities

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

May 30, 2026

0.6.0

May 28, 2026

0.5.0

May 26, 2026

0.4.1

May 25, 2026

0.4.0 yanked

May 25, 2026

Reason this release was yanked:

Packaging error: the turboquant_mlx.stream (expert streaming) module was omitted from 0.4.0. Fixed in 0.4.1 — please upgrade: pip install "turboquant-mlx-full>=0.4.1"

0.3.0

May 3, 2026

0.2.0

May 2, 2026

0.1.6

Apr 27, 2026

0.1.5

Apr 26, 2026

0.1.4

Apr 26, 2026

0.1.3

Apr 23, 2026

0.1.2

Apr 22, 2026

0.1.1

Apr 9, 2026

This version

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_mlx_full-0.1.0.tar.gz (60.1 kB view details)

Uploaded Apr 7, 2026 Source

File details

Details for the file turboquant_mlx_full-0.1.0.tar.gz.

File metadata

Download URL: turboquant_mlx_full-0.1.0.tar.gz
Upload date: Apr 7, 2026
Size: 60.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for turboquant_mlx_full-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`737fa4fdb28c874d8be0aef50d6dd80aa55d3f2f279b2e2f05315ff911695a45`
MD5	`44804b6684dd423b59009c2696e27799`
BLAKE2b-256	`f2c7c23eb2bcf62891a923d972c93141fd2aa1bb645af6695aef8ae78153aed7`

See more details on using hashes here.

turboquant-mlx-full 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboQuant-MLX

Key Results — Weight Compression

Key Results — KV Cache Compression

Install

Requirements

Install from source (for development)

Quick Start

1. Convert a model to TurboQuant format

2. Generate text

3. Evaluate perplexity

4. Generate with KV cache compression

KV Cache Compression

Programmatic usage

Choosing a bit-width

The speed flip

Compatibility

Running GPT-OSS MoE Models on Apple Silicon

GPT-OSS-20B (21B total, 32 experts, 3.6B active)

Step 1: Convert to TurboQuant 3-bit (recommended)

Step 2: Generate text

Step 3: Run a quick quality check

All bit-widths for GPT-OSS-20B

GPT-OSS-120B (120B total, 128 experts, ~13B active)

Step 1: Convert to TurboQuant 3-bit (recommended)

Step 2: Generate text

Step 3: Quick quality check

All bit-widths for GPT-OSS-120B

Qwen3.5-122B-A10B (122B total, 256 experts, 8 active, ~10B active)

Step 1: Convert to TurboQuant 3-bit

Step 2: Generate text

Benchmark

How It Works

CLI Options

Supported Architectures

Project Structure

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes