Mixed-precision quantization optimizer for MLX models on Apple Silicon
Project description
mlx-optiq
Optimized deployment of LLMs, VLMs, and vision models on Apple Silicon.
Website: https://mlx-optiq.pages.dev/ | PyPI: https://pypi.org/project/mlx-optiq/ | Models: https://huggingface.co/mlx-community?search_models=OptiQ
OptIQ is an optimizing compiler and runtime for MLX. It takes a full-precision model and turns it into the best version for a given memory/latency budget on your Mac, using per-layer sensitivity measurements instead of "uniform 4-bit everywhere". The same sensitivity signal drives every layer of the stack — weights, KV cache, LoRA fine-tuning, runtime adapter swapping.
pip install mlx-optiq
Why mlx-optiq
Stock mlx-lm treats every layer of a quantized model the same. OptIQ doesn't:
- Some layers are 50× more sensitive to quantization than others. OptIQ measures this once per model and assigns bits per-layer, holding the same average bits-per-weight while cutting quality loss. On GSM8K, this recovers +15–32 percentage points over uniform 4-bit on the same model, same quant budget (results).
- The same is true of the KV cache. Some attention layers' KV are catastrophic to quantize (layer 0 KV is ~56× more sensitive than the median), others are essentially lossless at 4-bit.
optiq serveruns a per-layer KV quant pipeline that keeps your quality while cutting decode memory — up to +62% decode tok/s at 64k context vs fp16 KV on M3 Max. - LoRA fine-tuning should reuse that sensitivity signal too.
optiq lora trainassigns higher adapter rank to layers OptIQ identified as sensitive, and lower rank to robust ones — so your adapter budget goes where it helps most. - Multi-adapter serving shouldn't reload the base model every time.
optiq serveimplements reversible mounted LoRA: mount multiple adapters on one base, switch per-request via a ContextVar-isolated activation gate, all without touching the frozen base weights.
Plus everything a deployment framework actually needs: vision-stripping for pure-text variants of VLMs, TurboQuant rotated-space KV compression (research path), YOLO26 quantization for object detection, and a roofline latency model calibrated to Apple Silicon bandwidth.
The full stack at a glance
| Feature | CLI | Description |
|---|---|---|
| Weight quantization | optiq convert |
Per-layer sensitivity + greedy knapsack. Auto-strips vision/audio metadata when quantizing multi-modal base models for text-only use. |
| KV cache quantization | optiq kv-cache |
Writes per-layer kv_config.json. Same sensitivity method applied to the attention cache. |
| TurboQuant compression | optiq.core.turbo_quant |
Rotation + optimal Lloyd-Max scalar quantization. Library API for research/custom pipelines. |
| OpenAI-compatible server | optiq serve |
Drop-in mlx_lm.server replacement with mixed-precision KV, mounted LoRA adapters, and --adapter <HF repo id> auto-download. |
| Sensitivity-aware LoRA | optiq lora train |
Per-layer rank scaling from OptIQ's bit assignments. PEFT-compatible output + OptIQ sidecar metadata. |
| Mounted hot-swap adapters | optiq.adapters.mount |
Reversible per-request adapter activation via ContextVar. N adapters co-resident with one base. |
| VLM → text-only | optiq convert --strip-unused-modalities |
Drops vision/audio weights + config cleanup. Output routes through gemma4_text / qwen3_5_text instead of the VLM wrapper. |
| YOLO26 quantization | optiq convert --model-type yolo |
Full pipeline including per-layer detection-output KL sensitivity. Outputs a yolo-mlx compatible model. |
| Latency prediction | optiq latency |
Roofline model calibrated to Apple Silicon memory bandwidth. Predicts decode tok/s for a given model + bit layout before running it. |
| Benchmarking | optiq benchmark / optiq eval |
GSM8K, AI2D, WER, perplexity. Side-by-side vs baselines. |
Quickstart
Running a pre-built OptIQ model
Every mlx-community OptiQ model works out of the box with stock mlx-lm:
from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=50))
Installing mlx-optiq unlocks the rest — mixed-precision KV serving, LoRA fine-tuning, runtime hot-swap adapters. Bit-identical inference quality either way.
Serving with mixed-precision KV
# One-time per-layer KV sensitivity analysis
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
--target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b
# OpenAI-compatible server on :8080
optiq serve \
--kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
--model mlx-community/Qwen3.5-9B-OptiQ-4bit \
--max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20
Fine-tuning with sensitivity-aware LoRA
# Train a LoRA adapter. --rank-scaling by_bits assigns rank proportional
# to OptIQ's per-layer bit assignments: 8-bit layers get 2× the rank of
# 4-bit layers, at the same average.
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
--data ./my_training_data \
--rank 8 --rank-scaling by_bits \
--iters 1000 -o ./my_adapter
# Inspect what was adapted (shows per-layer rank distribution)
optiq lora info ./my_adapter
Adapter output is PEFT-compatible (adapter_config.json + adapters.safetensors) plus an OptIQ sidecar (optiq_lora_config.json) that records per-layer rank. Loads with any PEFT tool.
Serving with hot-swap adapters
# Preload an adapter at startup (HF repo id or local path)
optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
--adapter ./my_adapter
# ...or use the mount API directly for multi-adapter serving:
# mount N adapters on one base, switch per request in the same Python
# process without reloading the model. See optiq/adapters/mount.py.
Converting a fresh model
pip install mlx-optiq[convert]
optiq convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8
optiq eval ./optiq_model --task gsm8k --baseline ./uniform_4bit
For multi-modal base models quantized for text-only deployment, OptIQ auto-strips vision/audio metadata by default. Pass --keep-unused-modalities to disable.
YOLO26
from optiq.models.yolo import run_yolo_pipeline
results = run_yolo_pipeline(
"optiq_output/yolo26n.safetensors",
"optiq_output/yolo26n_optiq",
)
Full pipeline: per-layer sensitivity on detection outputs → greedy knapsack → yolo-mlx-compatible output.
Headline numbers
Weight quantization on GSM8K — OptIQ vs uniform 4-bit (same avg BPW):
| Model | Uniform-4b | OptIQ-4b | Δ |
|---|---|---|---|
| Qwen3.5-0.8B | 11.5% | 27.0% | +15.5pp |
| gemma-4-e4b-it | 23.5% | 55.5% | +32.0pp |
Decode throughput at 64k context — optiq serve mixed-precision KV vs fp16 on M3 Max 36GB:
| Model | fp16 tok/s | OptIQ tok/s | Δ |
|---|---|---|---|
| Qwen3.5-2B | 27.9 | 41.8 | +50% |
| Qwen3.5-4B | 8.1 | 13.1 | +62% |
| Qwen3.5-9B | 20.7 | 27.1 | +31% |
Full tables + methodology: Results page.
How it works
Weight quantization pipeline (optiq convert):
- Load the base model from HuggingFace via PyTorch.
- For each linear layer × each candidate bit-width, simulate quantization and measure KL divergence between full-precision and quantized logits on a calibration set (WikiText-2 for LLMs, COCO-captions for VLMs).
- Greedy knapsack: start every layer at the minimum bits, upgrade the layer with the best "KL-reduction-per-bit" ratio each step until the target BPW budget is spent. Protected layers (
lm_head,embed_tokens, first/last transformer blocks) always get the max bit-width. - MLX conversion via
mlx-lm.convert()with the per-layerquant_predicatefrom step 3. - For multi-modal base models, strip vision/audio metadata from config.json (auto; opt out with
--keep-unused-modalities).
KV cache pipeline (optiq kv-cache + optiq serve):
- Same sensitivity measurement but applied to the KV cache: for each full-attention layer, replace that layer's KV with a quantized copy and measure KL on held-out prompts.
optiq servemonkey-patchesmlx_lm.server.stream_generateto usemlx_lm.models.cache.QuantizedKVCacheat per-layer bit-widths (via a patchedmaybe_quantize_kv_cachehook).- At attention time,
mx.quantized_matmulreads packed KV directly — no fp16 materialization. On Apple Silicon, the 8-bit kernel is faster than the 4-bit kernel, so protecting one sensitive layer at 8-bit gives both quality AND a throughput bump.
TurboQuant KV (optiq.core.turbo_kv_cache — research path):
- Random orthogonal rotation makes coordinates near-independent → better-conditioned quantization.
- Optimal Lloyd-Max scalar quantization per coordinate (1/2/3/4-bit centroid tables).
- Rotated-space attention: rotate Q once and output once, work in centroid space in between. Attention cost stays O(seq × d); rotation is O(d²) fixed.
Sensitivity-aware LoRA (optiq lora train):
- Read
optiq_metadata.json.per_layer— OptIQ's per-layer bit assignment. - Per target linear, derive rank:
rank_scaling=by_bitsgivesr = base_rank × (bits / 4), so 8-bit layers get 2× the rank of 4-bit at the same base. - Apply mounted LoRA across all target blocks with the per-layer rank.
- Train via
mlx_lm.tuner.trainer.train, with themx.compiledecorator monkey-patched out to avoid a known Metal OOM on 9B-class models. - Save in PEFT-compatible format + OptIQ sidecar recording the per-layer rank distribution.
Mounted LoRA hot-swap (optiq.adapters.mount):
prepare_model_for_mounted_lorawalks every transformer block and wraps each target linear (q_proj,v_projby default) in aMountedLoRALinearthat holds a dict of{adapter_id: (A, B, scale)}plus the frozen base linear.mount_adapter_on_model(model, adapter_id, adapter_dir)loads adapter weights off disk and registers them on every MountedLoRALinear.- At inference time, a
ContextVardecides which adapter is active.None→ base only.with AdapterActivation("A"):→ forward pass adds adapter A's residual. - ContextVar semantics mean concurrent asyncio tasks / threads with different active adapters don't step on each other.
When you need mlx-optiq vs bare mlx-lm
| Scenario | Bare mlx-lm |
mlx-optiq |
|---|---|---|
| Load + generate from an OptIQ HF model | ✅ | ✅ |
| Mixed-precision KV cache at serve time | — | ✅ |
| LoRA fine-tuning that uses OptIQ's sensitivity data | — | ✅ |
| Hot-swappable adapters in one serving process | — | ✅ |
| Fresh conversion of a new base model with OptIQ | — | ✅ |
| TurboQuant research pipelines | — | ✅ |
| YOLO26 quantization | — | ✅ |
For pure inference on published OptIQ models: bare mlx-lm is enough and gets bit-identical output. For everything else, install mlx-optiq.
Hybrid-attention note
Qwen3.5 interleaves linear-attention (GatedDeltaNet) and full-attention layers on a 4:1 ratio — only 1 in 4 layers has a KV cache. optiq kv-cache skips the linear layers automatically. On Qwen3.5-4B/9B, you end up with 8 of 32 layers getting per-layer KV bit assignments, typically 7 @ 4-bit + 1 @ 8-bit protecting layer 3 (the first full-attention layer).
Status / roadmap
- ✅ Weight quantization: production
- ✅ KV-cache serving (Qwen3.5): production since v0.0.5
- ✅ Sensitivity-aware LoRA + mounted hot-swap: production since v0.0.8
- ✅ VLM-to-text metadata stripping: production since v0.0.8
- ✅ YOLO26 pipeline: production
- 🚧 Gemma-4 KV serving: blocked on upstream mlx-lm shared-KV attention not supporting
QuantizedKVCache - 🚧 Per-request adapter routing in the HTTP layer: mount/swap API is production; HTTP
X-OptIQ-Adapterheader plumbing is next - 🔬 TurboQuant serving path with a fused Metal kernel: research
Article
Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon
Requirements
- Python ≥ 3.11
- Apple Silicon Mac (for MLX)
mlx-lm ≥ 0.30
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_optiq-0.0.9.tar.gz.
File metadata
- Download URL: mlx_optiq-0.0.9.tar.gz
- Upload date:
- Size: 95.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c54f77a6721999c2f5299b71aee58b90a1a3e282a06be4aed145c8279dc3b99d
|
|
| MD5 |
ae391737f0fdab28d370bd3d8238ce17
|
|
| BLAKE2b-256 |
6d1eb2599bb9a47cf7bec1e606b3df0b0eb2d2a10d7b98a060433817bd9b180a
|
File details
Details for the file mlx_optiq-0.0.9-py3-none-any.whl.
File metadata
- Download URL: mlx_optiq-0.0.9-py3-none-any.whl
- Upload date:
- Size: 104.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75cb94a2ded7866feceb6607bce5a007ba9cafc8cb65e32a11f146d91d169aad
|
|
| MD5 |
db4dbcc869e61d8eb3b63c0d7d5be823
|
|
| BLAKE2b-256 |
f82251d9bc57794ee4de1f785a41695fe005b2a917de5365c3884895772f7888
|