Skip to main content

Mixed-precision quantization optimizer for MLX models on Apple Silicon

Project description

mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon.

Website: https://thinsignal.com/optiq/  |  PyPI: https://pypi.org/project/mlx-optiq/

OptIQ turns "uniform 4-bit" into a data-driven, per-layer budget. Sensitive layers stay at 8-bit; the rest get 4-bit. This works for both model weights and the KV cache at serving time — and comes with an OpenAI-compatible server (optiq serve) that wraps mlx_lm.server with the KV path built in.

Install

pip install mlx-optiq

What you get

What it does Where
optiq convert Per-layer sensitivity analysis + mixed-precision weight quantization. Output is a stock-mlx-lm-compatible MLX model. Experiments →
optiq kv-cache Per-layer KV-cache sensitivity. Writes kv_config.json with per-layer bit-widths. Experiments →
optiq serve OpenAI-compatible HTTP server with the mixed-precision KV path wired in. Drop-in mlx_lm.server replacement. Results →
optiq.core.turbo_kv_cache TurboQuant rotated-space KV (library). Research path for attention-inner-product-preserving quantization. Experiments →

Pre-built OptIQ-quantized models on HuggingFace: Models →

Quickstart

Use a pre-built model (stock mlx-lm, no OptIQ code required):

from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Hello", max_tokens=100)

Serve with mixed-precision KV (new in v0.0.5):

# One-time sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# OpenAI-compatible server on :8080
optiq serve \
  --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
  --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
  --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20

Convert a new model:

pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit

Headline numbers

Weight quantization — GSM8K vs uniform 4-bit:

  • Qwen3.5-0.8B: 27.0% vs 11.5% (+15.5pp)
  • gemma-4-e4b-it: 55.5% vs 23.5% (+32.0pp)

KV-cache serving — decode tok/s at 64k context (Apple M3 Max 36GB):

  • Qwen3.5-2B: 41.8 vs fp16 27.9 (+50%)
  • Qwen3.5-4B: 13.1 vs fp16 8.1 (+62%)
  • Qwen3.5-9B: 27.1 vs fp16 20.7 (+31%)

Full tables, methodology, and per-layer configs on the Results page.

How it works

Weight quantization pipeline:

  1. Load PyTorch model from HuggingFace.
  2. Per-layer KL-divergence sensitivity on calibration data (WikiText-2 for LLMs).
  3. Greedy knapsack: start all layers at min-bit, upgrade by KL-reduction-per-bit until BPW budget is spent. Sensitive layers like lm_head, embed_tokens, and the first/last blocks are protected at the max bit-width.
  4. MLX conversion via mlx-lm with per-layer quant_predicate.

KV-cache serving pipeline:

  1. optiq kv-cache runs the same sensitivity analysis but on KV quantization — per full-attention layer, measures KL divergence when that layer's KV is quantized to each candidate bit-width.
  2. optiq serve monkey-patches mlx_lm.server's generation loop to use mlx_lm.models.cache.QuantizedKVCache at per-layer bit-widths (via maybe_quantize_kv_cache). The patched hook replaces mlx-lm's uniform kv_bits with a per-layer dict.
  3. At SDPA time, mx.quantized_matmul reads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel path is faster than the 4-bit one, so protecting a single sensitive layer at 8-bit gives both quality preservation and a decode speedup.

TurboQuant KV (research path — not default in optiq serve):

  1. Random orthogonal rotation makes coordinates near-independent.
  2. Optimal Lloyd-Max scalar quantization per coordinate.
  3. Rotated-space attention: rotate Q once and output once, work in centroid space in between. O(d²) fixed cost vs O(seq_len × d²) for naïve rotated-then-dequant.

Hybrid-attention note

Qwen3.5 interleaves linear-attention (GatedDeltaNet) and full-attention layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. optiq kv-cache automatically skips the linear layers. On Qwen3.5-4B/9B, this means 8 of 32 layers get per-layer KV bit assignments; the typical output is 7 @ 4-bit + 1 @ 8-bit protecting layer 3 (the first full-attention layer).

Status / roadmap

  • ✅ Weight quantization: production
  • ✅ KV cache serving (Qwen3.5): production in v0.0.5
  • 🚧 Gemma-4 KV serving: blocked on upstream mlx-lm shared-KV attention not supporting QuantizedKVCache
  • 🔬 TurboQuant serving with a fused Metal kernel: research

Article

Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Requirements

  • Python ≥ 3.11
  • Apple Silicon Mac (for MLX)
  • mlx-lm ≥ 0.20

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_optiq-0.0.5.tar.gz (63.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_optiq-0.0.5-py3-none-any.whl (73.4 kB view details)

Uploaded Python 3

File details

Details for the file mlx_optiq-0.0.5.tar.gz.

File metadata

  • Download URL: mlx_optiq-0.0.5.tar.gz
  • Upload date:
  • Size: 63.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.5.tar.gz
Algorithm Hash digest
SHA256 ffc57cb8feeacde1f22d857bfc48f7b216a5080335ca8de4535f58cb88f29b2f
MD5 4cc98affbdf6b1958b747fff537b612a
BLAKE2b-256 d4c6a6824346785bc3d1899dbac78fcba681a69ea846a739fd8b18e7374fedea

See more details on using hashes here.

File details

Details for the file mlx_optiq-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: mlx_optiq-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 73.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for mlx_optiq-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 933319c3cc61c0145802b0fd16dda9470c290b76a5677df3559f721fce517bae
MD5 8d23ffae81aebe7ce80f4de25aef598b
BLAKE2b-256 c9897c800aae3b968251c5388b61c58a10af7e66c44d7c8bfdaebcc704b05784

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page