Skip to main content

Mixed-precision quantization optimizer for MLX models on Apple Silicon

Project description

mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

OptiQ produces better quantized models that mlx-lm loads natively. It also provides TurboQuant KV cache — rotation-based vector quantization that preserves attention inner products better than standard affine quantization.

Install

pip install mlx-optiq

What It Does

1. Mixed-Precision Weight Quantization

Instead of uniform quantization (all layers at 4-bit), OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths.

Qwen3.5-0.8B (GSM8K, 200 samples):

Model GSM8K Size
OptiQ mixed (4.5 BPW) 27.0% 570 MB
Uniform 4-bit 11.5% 404 MB

2.3x better accuracy at modest size increase. Models work with standard mlx-lm — no special code needed:

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

2. TurboQuant KV Cache

Implements rotation-based vector quantization from TurboQuant (ICLR 2026) for KV cache compression. Random orthogonal rotation before scalar quantization preserves the inner products that attention's Q·K^T computation needs.

Rotated-space attention eliminates per-key rotation overhead — only the query and output are rotated once each (O(d²)), while all stored keys/values are accessed via cheap centroid lookups (O(d)). Result: near-zero speed overhead.

Qwen3.5-0.8B (6 self-attention layers):

Method PPL Needle Retrieval Speed vs FP16
FP16 KV (reference) 22.50 73% baseline
Affine 4-bit KV 22.98 80% -0%
TurboQuant 4-bit KV 22.87 100% -2%
TurboQuant 3-bit KV 23.66 100% +4%
  • TurboQuant 4-bit beats affine on PPL (+0.37 vs +0.48) and needle retrieval (100% vs 80%)
  • TurboQuant enables 3-bit KV cache where affine can't (head_dim=256 packing incompatibility)
  • GSM8K reasoning preserved: TQ 4-bit gets 32% vs FP16's 30% (50-sample test)
from mlx_lm import load
from optiq.core.turbo_kv_cache import TurboQuantKVCache, patch_attention

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
patch_attention()  # Install rotated-space attention (once)

# Replace self-attention KV caches with TurboQuant
cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim, bits=4, seed=42+i
        )

# Use as normal
response = generate(model, tokenizer, prompt="Hello", max_tokens=100, prompt_cache=cache)

Pre-built Models

Available on HuggingFace (work with standard mlx-lm):

Convert Your Own Models

pip install mlx-optiq[convert]

# Mixed-precision quantization
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8

# Evaluate
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

How It Works

Weight quantization pipeline:

  1. Load PyTorch model from HuggingFace
  2. Per-layer KL divergence sensitivity analysis on calibration data
  3. Greedy knapsack optimization to assign bit-widths within BPW budget
  4. MLX conversion via mlx-lm with custom per-layer quantization config

TurboQuant KV cache:

  1. Random orthogonal rotation makes vector coordinates near-independent
  2. Optimal Lloyd-Max scalar quantization per coordinate
  3. Rotated-space attention: pre-rotate Q, compute SDPA in centroid space, post-rotate output
  4. Incremental quantization: only new tokens are processed each step

Architecture Note (Hybrid Models)

Qwen3.5 uses 18 GatedDeltaNet layers (recurrent state) + 6 standard self-attention layers (KV cache). TurboQuant is applied to the KV cache layers only. The recurrent state uses a read-modify-write pattern where quantization errors accumulate — keeping it at FP16 is recommended for generation tasks.

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

  • Python >= 3.11
  • Apple Silicon Mac (for MLX)
  • mlx-lm >= 0.20

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.3.tar.gz (60.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_optiq-0.0.3-py3-none-any.whl (70.5 kB view details)

Uploaded Python 3

File details

Details for the file mlx_optiq-0.0.3.tar.gz.

File metadata

  • Download URL: mlx_optiq-0.0.3.tar.gz
  • Upload date:
  • Size: 60.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.3.tar.gz
Algorithm Hash digest
SHA256 8ab8898163d6069d6f3135f931d7f15b904fbfa1494c2b2dad8b22aaa745c8cc
MD5 cb57b3868158dedb8641c7ddc7c275c3
BLAKE2b-256 fd4962aebee16c49d8ecfae310348d9c385af5dc8b1963023ea2f6bb1b747c23

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: mlx_optiq-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 70.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4eeae349828915ea4669945127b51a41db9cab9217979c5e918d97baeba0a608
MD5 51de3c19bd8cecaa46219024776921fb
BLAKE2b-256 b39c69a69f1a74ad635cde2606ce28387dcafc8e6052883c64522893ac749598

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page