Mixed-precision quantization optimizer for MLX models on Apple Silicon
Project description
mlx-optiq
Mixed-precision quantization optimizer for MLX models on Apple Silicon.
OptiQ produces better quantized models that mlx-lm loads natively. It also provides TurboQuant KV cache — rotation-based vector quantization that preserves attention inner products better than standard affine quantization.
Install
pip install mlx-optiq
What It Does
1. Mixed-Precision Weight Quantization
Instead of uniform quantization (all layers at 4-bit), OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths.
Qwen3.5-0.8B (GSM8K, 200 samples):
| Model | GSM8K | Size |
|---|---|---|
| OptiQ mixed (4.5 BPW) | 27.0% | 570 MB |
| Uniform 4-bit | 11.5% | 404 MB |
2.3x better accuracy at modest size increase. Models work with standard mlx-lm — no special code needed:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
2. TurboQuant KV Cache
Implements rotation-based vector quantization from TurboQuant (ICLR 2026) for KV cache compression. Random orthogonal rotation before scalar quantization preserves the inner products that attention's Q·K^T computation needs.
Rotated-space attention eliminates per-key rotation overhead — only the query and output are rotated once each (O(d²)), while all stored keys/values are accessed via cheap centroid lookups (O(d)). Result: near-zero speed overhead.
Qwen3.5-0.8B (6 self-attention layers):
| Method | PPL | Needle Retrieval | Speed vs FP16 |
|---|---|---|---|
| FP16 KV (reference) | 22.50 | 73% | baseline |
| Affine 4-bit KV | 22.98 | 80% | -0% |
| TurboQuant 4-bit KV | 22.87 | 100% | -2% |
| TurboQuant 3-bit KV | 23.66 | 100% | +4% |
- TurboQuant 4-bit beats affine on PPL (+0.37 vs +0.48) and needle retrieval (100% vs 80%)
- TurboQuant enables 3-bit KV cache where affine can't (head_dim=256 packing incompatibility)
- GSM8K reasoning preserved: TQ 4-bit gets 32% vs FP16's 30% (50-sample test)
from mlx_lm import load
from optiq.core.turbo_kv_cache import TurboQuantKVCache, patch_attention
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
patch_attention() # Install rotated-space attention (once)
# Replace self-attention KV caches with TurboQuant
cache = model.make_cache()
for i, layer in enumerate(model.layers):
if hasattr(layer, "self_attn"):
cache[i] = TurboQuantKVCache(
head_dim=layer.self_attn.head_dim, bits=4, seed=42+i
)
# Use as normal
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, prompt_cache=cache)
Pre-built Models
Available on HuggingFace (work with standard mlx-lm):
Convert Your Own Models
pip install mlx-optiq[convert]
# Mixed-precision quantization
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
# Evaluate
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit
How It Works
Weight quantization pipeline:
- Load PyTorch model from HuggingFace
- Per-layer KL divergence sensitivity analysis on calibration data
- Greedy knapsack optimization to assign bit-widths within BPW budget
- MLX conversion via mlx-lm with custom per-layer quantization config
TurboQuant KV cache:
- Random orthogonal rotation makes vector coordinates near-independent
- Optimal Lloyd-Max scalar quantization per coordinate
- Rotated-space attention: pre-rotate Q, compute SDPA in centroid space, post-rotate output
- Incremental quantization: only new tokens are processed each step
Architecture Note (Hybrid Models)
Qwen3.5 uses 18 GatedDeltaNet layers (recurrent state) + 6 standard self-attention layers (KV cache). TurboQuant is applied to the KV cache layers only. The recurrent state uses a read-modify-write pattern where quantization errors accumulate — keeping it at FP16 is recommended for generation tasks.
Article
Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon
Requirements
- Python >= 3.11
- Apple Silicon Mac (for MLX)
mlx-lm >= 0.20
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_optiq-0.0.3.tar.gz.
File metadata
- Download URL: mlx_optiq-0.0.3.tar.gz
- Upload date:
- Size: 60.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ab8898163d6069d6f3135f931d7f15b904fbfa1494c2b2dad8b22aaa745c8cc
|
|
| MD5 |
cb57b3868158dedb8641c7ddc7c275c3
|
|
| BLAKE2b-256 |
fd4962aebee16c49d8ecfae310348d9c385af5dc8b1963023ea2f6bb1b747c23
|
File details
Details for the file mlx_optiq-0.0.3-py3-none-any.whl.
File metadata
- Download URL: mlx_optiq-0.0.3-py3-none-any.whl
- Upload date:
- Size: 70.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eeae349828915ea4669945127b51a41db9cab9217979c5e918d97baeba0a608
|
|
| MD5 |
51de3c19bd8cecaa46219024776921fb
|
|
| BLAKE2b-256 |
b39c69a69f1a74ad635cde2606ce28387dcafc8e6052883c64522893ac749598
|