Skip to main content

Mixed-precision quantization optimizer for MLX models on Apple Silicon

Project description

mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

OptiQ produces better quantized models that mlx-lm loads natively. It also provides TurboQuant KV cache — rotation-based vector quantization that preserves attention inner products better than standard affine quantization.

Install

pip install mlx-optiq

What It Does

1. Mixed-Precision Weight Quantization

Instead of uniform quantization (all layers at 4-bit), OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths.

Qwen3.5-0.8B (GSM8K, 200 samples):

Model GSM8K Size
OptiQ mixed (4.5 BPW) 27.0% 570 MB
Uniform 4-bit 11.5% 404 MB

2.3x better accuracy at modest size increase. Models work with standard mlx-lm — no special code needed:

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

2. TurboQuant KV Cache

Implements rotation-based vector quantization from TurboQuant (ICLR 2026) for KV cache compression. Random orthogonal rotation before scalar quantization preserves the inner products that attention's Q·K^T computation needs.

Rotated-space attention eliminates per-key rotation overhead — only the query and output are rotated once each (O(d²)), while all stored keys/values are accessed via cheap centroid lookups (O(d)). Result: near-zero speed overhead.

Qwen3.5-0.8B (6 self-attention layers):

Method PPL Needle Retrieval Speed vs FP16
FP16 KV (reference) 22.50 73% baseline
Affine 4-bit KV 22.98 80% -0%
TurboQuant 4-bit KV 22.87 100% -2%
TurboQuant 3-bit KV 23.66 100% +4%
  • TurboQuant 4-bit beats affine on PPL (+0.37 vs +0.48) and needle retrieval (100% vs 80%)
  • TurboQuant enables 3-bit KV cache where affine can't (head_dim=256 packing incompatibility)
  • GSM8K reasoning preserved: TQ 4-bit gets 32% vs FP16's 30% (50-sample test)
from mlx_lm import load
from optiq.core.turbo_kv_cache import TurboQuantKVCache, patch_attention

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
patch_attention()  # Install rotated-space attention (once)

# Replace self-attention KV caches with TurboQuant
cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim, bits=4, seed=42+i
        )

# Use as normal
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, prompt_cache=cache)

3. YOLO26 Object Detection

Mixed-precision quantized YOLO26 models for real-time object detection on Apple Silicon. ~4x compression with near-zero detection loss.

pip install mlx-optiq[yolo]
from optiq.models.yolo import load_quantized_yolo

model = load_quantized_yolo("mlx-community/YOLO26n-OptiQ-6bit")
results = model.predict("image.jpg")

COCO128 Benchmark:

Model Original OptiQ Compression Detection Delta
YOLO26n 9.9 MB 2.5 MB 3.9x -1.6%
YOLO26s 38.4 MB 8.9 MB 4.3x -7.0%
YOLO26m 83.8 MB 18.9 MB 4.4x +0.1%
YOLO26l 100.7 MB 22.9 MB 4.4x 0.0%
YOLO26x 225.5 MB 50.6 MB 4.5x -1.1%

Pre-built Models

Available on HuggingFace:

LLMs (work with standard mlx-lm):

YOLO26 (require mlx-optiq[yolo]):

Convert Your Own Models

pip install mlx-optiq[convert]

# Mixed-precision quantization
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8

# Evaluate
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

How It Works

Weight quantization pipeline:

  1. Load PyTorch model from HuggingFace
  2. Per-layer KL divergence sensitivity analysis on calibration data
  3. Greedy knapsack optimization to assign bit-widths within BPW budget
  4. MLX conversion via mlx-lm with custom per-layer quantization config

TurboQuant KV cache:

  1. Random orthogonal rotation makes vector coordinates near-independent
  2. Optimal Lloyd-Max scalar quantization per coordinate
  3. Rotated-space attention: pre-rotate Q, compute SDPA in centroid space, post-rotate output
  4. Incremental quantization: only new tokens are processed each step

Architecture Note (Hybrid Models)

Qwen3.5 uses 18 GatedDeltaNet layers (recurrent state) + 6 standard self-attention layers (KV cache). TurboQuant is applied to the KV cache layers only. The recurrent state uses a read-modify-write pattern where quantization errors accumulate — keeping it at FP16 is recommended for generation tasks.

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

  • Python >= 3.11
  • Apple Silicon Mac (for MLX)
  • mlx-lm >= 0.20

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.4.tar.gz (60.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_optiq-0.0.4-py3-none-any.whl (70.8 kB view details)

Uploaded Python 3

File details

Details for the file mlx_optiq-0.0.4.tar.gz.

File metadata

  • Download URL: mlx_optiq-0.0.4.tar.gz
  • Upload date:
  • Size: 60.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.4.tar.gz
Algorithm Hash digest
SHA256 c46a2763242b89da33e9c11a6c525e95a69a4d4bf89224e0e2027dbd155f8bc0
MD5 186e6076159446fd74d1cc5c07dc451d
BLAKE2b-256 54d6e5ee73252c8ce63f533156793bde8053a6b9d15a90287774fb329e91db47

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: mlx_optiq-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 70.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c7ebe678457640cea98c42afcb4e8a2192dca771f34bd87090634965654d9f9e
MD5 a5511c43a4a7e4383edf6e2a79b2a253
BLAKE2b-256 84a11652ce5cd6efab451a732826b5c0971948b05de44c8a942735647dddb2d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page